PdfPig: Advanced PDF Text Extraction for .NET

Read and analyze PDF content without dependencies - text, positions, fonts and metadata

What is PdfPig?

PdfPig is an open source .NET library focused on extracting content from PDF files without the overhead of native dependencies. Unlike PDF generators, PdfPig specializes in reading existing documents to access text, font information, positional data, and document structure. It's particularly valuable for data mining, content analysis, and document processing pipelines.

Key advantages of PdfPig include:

Zero dependencies: Pure C# implementation
Low-level access: Precise text positioning and font metrics
Memory efficient: Handles large documents with minimal overhead
OCR-ready: Extract text with bounding boxes for analysis
MIT licensed: Free for commercial use

Ideal for document analysis, text extraction, and PDF content processing.

GitHub Stats

Name:
Language:
Stars:
Forks:
License:
Repository was last updated at

Why Choose PdfPig?

Accuracy: Handles complex PDF text layouts correctly
Performance: Benchmarked faster than similar .NET libraries
Transparency: Access raw PDF structures when needed
Active development: Regular updates since 2018
Cross-platform: Works on .NET Standard 2.0+

Installation

PdfPig is available via NuGet for easy integration:

Package Manager Console


Install-Package PdfPig

.NET CLI


dotnet add package PdfPig

System Requirements: .NET Standard 2.0 compatible runtime

Code Examples

Practical examples of PdfPig's capabilities:

PdfPig Extraction

Example 1: Basic Text Extraction

This example demonstrates how to open a PDF document and extract all text content while preserving reading order. PdfPig provides access to each letter with its exact position in the document, enabling advanced layout analysis beyond simple text extraction.

Output includes:

Raw text content in reading order
Page numbers for each text segment
Basic font information

Example 2: Advanced Positional Analysis

PdfPig excels at providing precise positional data for text elements. This example shows how to extract words with their bounding boxes, enabling tasks like table detection, form processing, and content region analysis.

Example 3: Font and Metadata Extraction

Beyond text content, PdfPig provides access to document metadata and detailed font information. This example demonstrates extracting document properties and analyzing font usage throughout the PDF.

Advanced Features

PdfPig supports professional PDF analysis:

Image extraction: Access embedded images:

Image Extraction


    using var document = PdfDocument.Open("file.pdf");
    foreach (var page in document.GetPages())
    {
        foreach (var image in page.GetImages())
        {
            var bytes = image.RawBytes;
            // Process image data
        }
    }

Bookmark navigation: Access document outline:

Bookmarks


    var bookmarks = document.GetBookmarks();
    foreach (var bookmark in bookmarks)
    {
        Console.WriteLine($"{bookmark.Title} - Page {bookmark.PageNumber}");
    }

Encrypted PDFs: Handle password-protected files:

Encrypted PDF


    var options = new ParsingOptions
    {
        Password = "secure123"
    };
    using var doc = PdfDocument.Open("encrypted.pdf", options);

PdfPig vs PdfSharp

Here are the 5 key differences between PdfPig and PDFsharp:

Primary Function:PdfPig specializes in reading/extracting text, positions, and metadata. PDFsharp focuses on creating/editing PDF documents
Text vs Graphics:PdfPig Extracts text with pixel-perfect precision (including coordinates). PDFsharp is optimized for drawing text/shapes (reports, invoices, forms)
Document Access:PdfPig analyzes existing PDFs, while PDFsharp can modify pages, add content, merge files
Advanced FeaturesPdfPig reveals font details, bounding boxes, and document structure, while PDFsharp supports PDF/A standards, images, and encryption
Use CasesPdfPig supports Data mining, OCR preprocessing, content analysis, while PDFsharp supports report generation, PDF manipulation, form filling

Conclusion

PdfPig delivers unparalleled PDF content access for .NET developers. Ideal for:

Data extraction: Mining content from reports and documents
Document analysis: Understanding PDF structure and layout
Accessibility: Converting PDF content to other formats
Pre-processing: Preparing documents for OCR or ML

With its focus on accurate content extraction and low memory usage, PdfPig is the go-to choice for PDF analysis in .NET.