1. Products
  2.   Editor
  3.   Java
  4.   jsoup
 
  

jsoup: The Java HTML Parser

Scrape, clean, and manipulate HTML with jQuery-like simplicity in Java

What is jsoup?

jsoup is a lightweight yet powerful Java library designed for working with real-world HTML. It provides a seamless API for parsing HTML from URLs, files, or strings, extracting and manipulating data using DOM traversal, CSS selectors, and modern HTML5 methods. Ideal for web scraping, data extraction, and HTML sanitization, jsoup handles malformed markup gracefully—making it perfect for parsing web pages as a browser would.

Unlike raw regex approaches, jsoup offers a clean object model with methods inspired by jQuery, simplifying tasks like form submission, attribute modification, and text extraction. With zero dependencies and MIT licensing, it’s a favorite for Java developers needing reliable HTML processing.

Key advantages of jsoup include:

  • Real-world HTML handling: Parses messy HTML like browsers do
  • jQuery-style syntax: Familiar CSS selectors (e.g., doc.select("div.content"))
  • Scraping-friendly: Follows redirects, handles cookies, and submits forms
  • Cross-platform: Pure Java with no native dependencies
  • Security: Built-in XSS prevention and HTML sanitization

Perfect for data mining, web automation, and content analysis.

GitHub

GitHub Stats

Name:
Language:
Stars:
Forks:
License:
Repository was last updated at

Why Choose jsoup?

  • Simplicity: Intuitive API with CSS selector support
  • Reliability: Actively maintained since 2009
  • Performance: Optimized for streaming and large documents
  • Flexibility: Works with fragments, files, or live URLs
  • Clean output: Pretty-prints and reformats HTML

Installation

Add jsoup via Maven or Gradle:

Maven



    org.jsoup
    jsoup
    1.17.2


Gradle


implementation 'org.jsoup:jsoup:1.17.2'

System Requirements: Java 8+

Code Examples

Practical jsoup use cases:

jsoup HTML parsing

Example 1: Parse a Document from a String

If you have HTML in a Java string, and you want to parse the HTML to get its contents, or to modify it, jsoup can achieve this for you with just few lines of code.

The parse(String html, String baseUri) method converts the input HTML into a new Document. The baseUri parameter helps resolve relative URLs into absolute ones and should correspond to the URL from which the document was retrieved. If this isn’t relevant, or if the HTML contains a element, you can use the simpler parse(String html) method instead.

Example 2: Use CSS Selectors to Find Elements

You want to find or manipulate elements using CSS selectors. Parse and manipulate an HTML string directly:

Example 3: Modify the HTML of an Element

You need to modify the HTML of an element. Use the HTML setter methods in Element as shown in following sample code.

Advanced Features

jsoup supports professional HTML processing:

  • Form handling: Submit POST data:

    Form Submission

    
        Connection.Response res = Jsoup.connect("https://example.com/login")
            .data("username", "user", "password", "pass")
            .method(Connection.Method.POST)
            .execute();
        Document dashboard = res.parse();
        
    
  • Proxy support: Scrape via proxies:

    Proxy Configuration

    
        Document doc = Jsoup.connect("https://target.com")
            .proxy("127.0.0.1", 8080)
            .get();
        
    
  • DOM manipulation: Modify HTML structure:

    DOM Changes

    
        doc.select("div.ads").remove(); // Remove all ads
        doc.select("h1").attr("class", "header"); // Add CSS class
        
    

jsoup vs. HTMLUnit

Key differences:

  • Focus: jsoup parses static HTML; HTMLUnit simulates browsers (JavaScript execution)
  • Speed: jsoup is faster for pure HTML parsing
  • API Style: jsoup uses CSS selectors; HTMLUnit mimics Selenium
  • Use Cases: jsoup for scraping; HTMLUnit for testing dynamic pages
  • Dependencies: jsoup has none; HTMLUnit requires additional libraries

Conclusion

jsoup is the ultimate HTML toolkit for Java developers. Ideal for:

  • Web scraping: Extract data from any website
  • Data cleaning: Sanitize and normalize HTML
  • Content analysis: Parse RSS feeds or web archives
  • Testing: Verify HTML structure in apps

With its MIT license and intuitive API, jsoup is the top choice for Java-based HTML processing.

Similar Products

 English