PDF to HTML/XML Conversion Python Library
Free & open source Python library to convert PDF documents into HTML & XML.
What is pdfminer.six?
pdfminer.six is a free and open source Python library which can be used to convert PDF documents into other formats.
Here's a brief list of its main PDF conversion features:
- PDF to HTML Conversion: Convert PDF documents into HTML format while preserving the document's structure and layout.
- PDF to XML Conversion: Transform PDF files into XML format, capturing all details, including fonts and other elements.
Getting Started with pdfminer.six
You need Python version 3.6.0 or higher to install and use pypdf. So, first install Python and then use below commands to install pypdf on your machine using pip and virtual environment.
Linux
python3 -m venv venv
source venv/bin/activate
pip install pdfminer.six
MacOS
python -m venv venv
source venv/bin/activate
pip install pdfminer.six
Windows
python3 -m venv venv
venv\Scripts\activate.bat
pip install pdfminer.six
Convert PDF to HTML
We can convert a PDF document to HTML format using the pdfminer.six library’s extract_text_to_fp function (with output type set to html) provided by the library, as shown in the below code snippet:
Output
The following screenshot shows the HTML file generated by converting the PDF document:
Converting PDF To XML
We can also convert a PDF document to XML format using the same extract_text_to_fp function (but with output type set to xml) provided by the library, as shown in the below code snippet:
Output
The following screenshot shows the XML content converted from the PDF document:
Conclusion
Generally, pdfminer.six supports converting PDF documents to XML format without any issues but when attempting to convert a PDF to HTML, it manages to transfer the text content but often disrupts the overall layout.