Open Source Python PDF Parser Library
Try this free & open Source Python library to parse, read and extract text, images, tables & other content from PDF documents.
What is PyMuPDF?
PyMuPDF, also known as Fitz, is an open-source Python library that provides a comprehensive set of tools for working with PDF files. With PyMuPDF, users can efficiently perform tasks such as opening PDFs, extracting text, images and tables, manipulating page properties like rotation and cropping, creating new PDF documents, and converting PDF pages to images.
PyMuPDF supports several features which are listed below:
- PDF Document Reading: PyMuPDF can open and read PDF documents, allowing you to access the text, images, and other content within them.
- Text Extraction: You can extract text from PDF documents, including text content, fonts, and layout information.
- Image Extraction: You can extract images from PDF documents in various formats, such as JPEG or PNG.
- Table Extraction: You can also extract tables from PDF documents.
In this review, our primary focus will be on the extraction and parsing features of the library. For an in-depth evaluation of splitting, merging & page management features, please click here.
Getting Started with PyMuPDF
You need Python version 3.8.0 or higher to install and use PyMuPDF. So, first install Python and then use below commands to install PyMuPDF on your machine using pip and virtual environment.
Linux
python -m venv pymupdf-venv
. pymupdf-venv/bin/activate
pip install pymupdf
MacOS
python -m venv pymupdf-venv
. pymupdf-venv/bin/activate
pip install pymupdf
Windows
python -m venv pymupdf-venv
.\pymupdf-venv\Scripts\activate
pip install pymupdf
Extract Text from PDF
You can use the PyMuPDF library in Python to extract text from a PDF document and perform text analysis, such as counting words, just by using the functions provided in the library, as shown in the code below:
Output
The image below shows the extracted text and the number of words in the PDF file:
Extract Images from PDF
We can use PyMuPDF library to extract images from a PDF document in Python. Below code snippet opens the specified PDF file, extracts images from the PDF and saves them in the current working directory:
Output
Following is the PNG Image extracted from the PDF document
Extract Tables from PDF
We can also use the PyMuPDF library to process a PDF document and extract tables from it. Check below code snippet which opens the specified PDF file and extracts tables from the PDF document:
Output
Below screenshot shows the table extracted from the PDF document:
Insert Text into PDF
Below Python code snippet demonstrates the use of the PyMuPDF library for inserting text into a PDF file and saving the modified PDF as text.pdf:
Output
The text inserted using the above code is highlighted in the red box given below:
PDF Text Recognition using OCR with PyMuPDF
In this code example, we demonstrate how to use PyMuPDF to perform Optical Character Recognition (OCR) on a PDF document. PyMuPDF utilizes the Tesseract OCR engine internally for text extraction. To enable OCR functionality, you need to obtain the required language data files for Tesseract. You can download Tesseract language data file here. We will perform OCR on the PDF file containing the following image:
Output
The image below shows the text extracted from the image present in the provided PDF file:
Conclusion
In summary, PyMuPDF is a professional tool with some clear strengths and weaknesses. It's great for tasks like OCR and text extraction which makes it valuable for handling text in PDFs.
However, it's not so good at extracting tables from PDFs specially when PDFs have complex structure or more number of pages, which might be a drawback for some users. Also, it may require additional libraries like Pandas and Tesseract OCR language data files in certain situations, adding complexity to its usage. Despite these limitations, PyMuPDF remains a robust choice for working with text in PDFs.