1. Products
  2.   Metadata
  3.   Python
  4.   PyMuPDF

PyMuPDF

 
 

Open Source Python Library to Manage PDF Metadata

Try PyMuPDF, Free & Open Source Python library to access and modify metadata of PDF documents.

What is PyMuPDF?

PyMuPDF, also known as Fitz, is an open source Python library that offers many features like parsing PDFs, splitting and merging PDFs etc. but in this page we will only discuss that how Python developers can use PyMuPDF library to handle PDF metadata related tasks such as:

  • Read PDF Metadata: PyMuPDF supports accessing metadata of PDF documents containing information such as author, title, subject and creation date etc.
  • Modify PDF Metadata: The library also allows modifying metadata of PDF documents.
  • Read XML Metadata: PDF documents also contain XML metadata which is not limited to standard document properties like author, title etc. and can have additional metadata. With PyMuPDF, developers can also read it.
  • Change XML Metadata: Developers can also change XML metadata of PDFs using PyMuPDF library.
GitHub

GitHub Stats

Name:
Language:
Stars:
Forks:
License:
Repository was last updated at

Getting Started with PyMuPDF

You need Python version 3.8.0 or higher to install and use PyMuPDF. So, first install Python and then use below commands to install PyMuPDF on your machine using pip and virtual environment.

Linux


python -m venv pymupdf-venv
. pymupdf-venv/bin/activate
pip install pymupdf

MacOS


python -m venv pymupdf-venv
. pymupdf-venv/bin/activate
pip install pymupdf

Windows


python -m venv pymupdf-venv
.\pymupdf-venv\Scripts\activate
pip install pymupdf  

Read PDF Metadata

We can read metadata of a PDF using the metadata member of the PyMuPDF library which contains the complete metadata content of the document. The below code snippet shows how to obtain the metadata of a PDF from the metadata member:

Output

The below screenshot shows the retrieved metadata from a PDF using PyMuPDF:

Edit PDF Metadata

We can edit the metadata of PDFs using the PyMuPDF library by passing a dictionary containing the fields we want to change along with their new values to the set_metadata method as shown in below code snippet:

Read XML Metadata of PDFs

We can retrieve XML metadata of a PDF using the PyMuPDF library. We use the get_xml_metadata method which returns the entire XML metadata as shown in below code snippet:

Output

The below screenshot shows the XML metadata retrieved from a PDF using PyMuPDF:

Change XML Metadata of PDFs

We can set or change the XML metadata of a PDF using the set_xml_metadata method of the PyMuPDF library. It is not as straightforward as replacing the document-level metadata because the set_xml_metadata will accept any string and replace the complete XML metadata with the string passed to it.

In order to avoid unintentional deletion of any metadata information, we first fetch the complete XML metadata as a string using the get_xml_metadata and then use the replace method of string to replace the desired information finally, we pass the complete XML with changed fields to the set_xml_metadata method which changes the entire XML metadata of the PDF. Check below code snippet for details:

Conclusion

In summary, PyMuPDF is a great tool for tasks related to metadata manipulation. We can easily retrieve and change the metadata information of PDFs. However, a notable weakness lies in the set_xml_metadata method. This method accepts any string passed to it and overwrites the previous XML with it which may cause unintentional loss of information to avoid this issue developers are required to implement their logic to ensure correct modifications in XML metadata.

Similar Products

 English