PDF to HTML/XML Conversion Python Library

Free & open source Python library to convert PDF documents into HTML & XML.

What is pdfminer.six?

pdfminer.six is a free and open source Python library which can be used to convert PDF documents into other formats.

Here's a brief list of its main PDF conversion features:

PDF to HTML Conversion: Convert PDF documents into HTML format while preserving the document's structure and layout.
PDF to XML Conversion: Transform PDF files into XML format, capturing all details, including fonts and other elements.

GitHub Stats

Name:
Language:
Stars:
Forks:
License:
Repository was last updated at

Getting Started with pdfminer.six

You need Python version 3.6.0 or higher to install and use pypdf. So, first install Python and then use below commands to install pypdf on your machine using pip and virtual environment.

Linux


python3 -m venv venv
source venv/bin/activate
pip install pdfminer.six

MacOS


python -m venv venv
source venv/bin/activate
pip install pdfminer.six

Windows


python3 -m venv venv
venv\Scripts\activate.bat
pip install pdfminer.six

Convert PDF to HTML

We can convert a PDF document to HTML format using the pdfminer.six library’s extract_text_to_fp function (with output type set to html) provided by the library, as shown in the below code snippet:

Output

The following screenshot shows the HTML file generated by converting the PDF document:

Converting PDF To XML

We can also convert a PDF document to XML format using the same extract_text_to_fp function (but with output type set to xml) provided by the library, as shown in the below code snippet:

Output

The following screenshot shows the XML content converted from the PDF document:

Conclusion

Generally, pdfminer.six supports converting PDF documents to XML format without any issues but when attempting to convert a PDF to HTML, it manages to transfer the text content but often disrupts the overall layout.

PDF to HTML/XML Conversion Python Library

Free & open source Python library to convert PDF documents into HTML & XML.

What is pdfminer.six?

GitHub Stats

Getting Started with pdfminer.six

Linux

MacOS

Windows

Convert PDF to HTML

Output

Converting PDF To XML

Output

Conclusion

Similar Products