jsoup: The Java HTML Parser
Scrape, clean, and manipulate HTML with jQuery-like simplicity in Java
What is jsoup?
jsoup is a lightweight yet powerful Java library designed for working with real-world HTML. It provides a seamless API for parsing HTML from URLs, files, or strings, extracting and manipulating data using DOM traversal, CSS selectors, and modern HTML5 methods. Ideal for web scraping, data extraction, and HTML sanitization, jsoup handles malformed markup gracefully—making it perfect for parsing web pages as a browser would.
Unlike raw regex approaches, jsoup offers a clean object model with methods inspired by jQuery, simplifying tasks like form submission, attribute modification, and text extraction. With zero dependencies and MIT licensing, it’s a favorite for Java developers needing reliable HTML processing.
Key advantages of jsoup include:
- Real-world HTML handling: Parses messy HTML like browsers do
- jQuery-style syntax: Familiar CSS selectors (e.g.,
doc.select("div.content")
) - Scraping-friendly: Follows redirects, handles cookies, and submits forms
- Cross-platform: Pure Java with no native dependencies
- Security: Built-in XSS prevention and HTML sanitization
Perfect for data mining, web automation, and content analysis.
Why Choose jsoup?
- Simplicity: Intuitive API with CSS selector support
- Reliability: Actively maintained since 2009
- Performance: Optimized for streaming and large documents
- Flexibility: Works with fragments, files, or live URLs
- Clean output: Pretty-prints and reformats HTML
Installation
Add jsoup via Maven or Gradle:
Maven
org.jsoup
jsoup
1.17.2
Gradle
implementation 'org.jsoup:jsoup:1.17.2'
System Requirements: Java 8+
Code Examples
Practical jsoup use cases:
Example 1: Parse a Document from a String
If you have HTML in a Java string, and you want to parse the HTML to get its contents, or to modify it, jsoup can achieve this for you with just few lines of code.
The parse(String html, String baseUri) method converts the input HTML into a new Document. The baseUri parameter helps resolve relative URLs into absolute ones and should correspond to the URL from which the document was retrieved. If this isn’t relevant, or if the HTML contains a
Example 2: Use CSS Selectors to Find Elements
You want to find or manipulate elements using CSS selectors. Parse and manipulate an HTML string directly:
Example 3: Modify the HTML of an Element
You need to modify the HTML of an element. Use the HTML setter methods in Element as shown in following sample code.
Advanced Features
jsoup supports professional HTML processing:
- Form handling: Submit POST data:
Form Submission
Connection.Response res = Jsoup.connect("https://example.com/login") .data("username", "user", "password", "pass") .method(Connection.Method.POST) .execute(); Document dashboard = res.parse();
- Proxy support: Scrape via proxies:
Proxy Configuration
Document doc = Jsoup.connect("https://target.com") .proxy("127.0.0.1", 8080) .get();
- DOM manipulation: Modify HTML structure:
DOM Changes
doc.select("div.ads").remove(); // Remove all ads doc.select("h1").attr("class", "header"); // Add CSS class
jsoup vs. HTMLUnit
Key differences:
- Focus: jsoup parses static HTML; HTMLUnit simulates browsers (JavaScript execution)
- Speed: jsoup is faster for pure HTML parsing
- API Style: jsoup uses CSS selectors; HTMLUnit mimics Selenium
- Use Cases: jsoup for scraping; HTMLUnit for testing dynamic pages
- Dependencies: jsoup has none; HTMLUnit requires additional libraries
Conclusion
jsoup is the ultimate HTML toolkit for Java developers. Ideal for:
- Web scraping: Extract data from any website
- Data cleaning: Sanitize and normalize HTML
- Content analysis: Parse RSS feeds or web archives
- Testing: Verify HTML structure in apps
With its MIT license and intuitive API, jsoup is the top choice for Java-based HTML processing.
Similar Products
- Apache POI XWPF | Open Source Java API to Create & Modify DOCX files
- DocX | Open Source .NET API to Create & Modify DOCX files
- Docx4J | Open Source Java API to Create & Modify DOC and DOCX files
- ExcelDataReader | Open Source .NET API to read XLS, XLSX, CSV and Spreadsheet documents
- FileFormat.Cells | Cerate and Update Excel files with C# .NET