PyMuPDF
Open Source Python Library to Manage PDF Metadata
Try PyMuPDF, Free & Open Source Python library to access and modify metadata of PDF documents.
What is PyMuPDF?
PyMuPDF, also known as Fitz, is an open source Python library that offers many features like parsing PDFs, splitting and merging PDFs etc. but in this page we will only discuss that how Python developers can use PyMuPDF library to handle PDF metadata related tasks such as:
- Read PDF Metadata: PyMuPDF supports accessing metadata of PDF documents containing information such as author, title, subject and creation date etc.
- Modify PDF Metadata: The library also allows modifying metadata of PDF documents.
- Read XML Metadata: PDF documents also contain XML metadata which is not limited to standard document properties like author, title etc. and can have additional metadata. With PyMuPDF, developers can also read it.
- Change XML Metadata: Developers can also change XML metadata of PDFs using PyMuPDF library.
Getting Started with PyMuPDF
You need Python version 3.8.0 or higher to install and use PyMuPDF. So, first install Python and then use below commands to install PyMuPDF on your machine using pip and virtual environment.
Linux
python -m venv pymupdf-venv
. pymupdf-venv/bin/activate
pip install pymupdf
MacOS
python -m venv pymupdf-venv
. pymupdf-venv/bin/activate
pip install pymupdf
Windows
python -m venv pymupdf-venv
.\pymupdf-venv\Scripts\activate
pip install pymupdf
Read PDF Metadata
We can read metadata of a PDF using the metadata member of the PyMuPDF library which contains the complete metadata content of the document. The below code snippet shows how to obtain the metadata of a PDF from the metadata member:
Output
The below screenshot shows the retrieved metadata from a PDF using PyMuPDF:
Edit PDF Metadata
We can edit the metadata of PDFs using the PyMuPDF library by passing a dictionary containing the fields we want to change along with their new values to the set_metadata method as shown in below code snippet:
Read XML Metadata of PDFs
We can retrieve XML metadata of a PDF using the PyMuPDF library. We use the get_xml_metadata method which returns the entire XML metadata as shown in below code snippet:
Output
The below screenshot shows the XML metadata retrieved from a PDF using PyMuPDF:
Change XML Metadata of PDFs
We can set or change the XML metadata of a PDF using the set_xml_metadata method of the PyMuPDF library. It is not as straightforward as replacing the document-level metadata because the set_xml_metadata will accept any string and replace the complete XML metadata with the string passed to it.
In order to avoid unintentional deletion of any metadata information, we first fetch the complete XML metadata as a string using the get_xml_metadata and then use the replace method of string to replace the desired information finally, we pass the complete XML with changed fields to the set_xml_metadata method which changes the entire XML metadata of the PDF. Check below code snippet for details:
Conclusion
In summary, PyMuPDF is a great tool for tasks related to metadata manipulation. We can easily retrieve and change the metadata information of PDFs. However, a notable weakness lies in the set_xml_metadata method. This method accepts any string passed to it and overwrites the previous XML with it which may cause unintentional loss of information to avoid this issue developers are required to implement their logic to ensure correct modifications in XML metadata.