1. Products
  2.   Parser
  3.   .NET
  4.   PdfPig
 
  

PdfPig: Advanced PDF Text Extraction for .NET

Read and analyze PDF content without dependencies - text, positions, fonts and metadata

What is PdfPig?

PdfPig is an open source .NET library focused on extracting content from PDF files without the overhead of native dependencies. Unlike PDF generators, PdfPig specializes in reading existing documents to access text, font information, positional data, and document structure. It's particularly valuable for data mining, content analysis, and document processing pipelines.

Key advantages of PdfPig include:

  • Zero dependencies: Pure C# implementation
  • Low-level access: Precise text positioning and font metrics
  • Memory efficient: Handles large documents with minimal overhead
  • OCR-ready: Extract text with bounding boxes for analysis
  • MIT licensed: Free for commercial use

Ideal for document analysis, text extraction, and PDF content processing.

GitHub

GitHub Stats

Name:
Language:
Stars:
Forks:
License:
Repository was last updated at

Why Choose PdfPig?

  • Accuracy: Handles complex PDF text layouts correctly
  • Performance: Benchmarked faster than similar .NET libraries
  • Transparency: Access raw PDF structures when needed
  • Active development: Regular updates since 2018
  • Cross-platform: Works on .NET Standard 2.0+

Installation

PdfPig is available via NuGet for easy integration:

Package Manager Console


Install-Package PdfPig

.NET CLI


dotnet add package PdfPig

System Requirements: .NET Standard 2.0 compatible runtime

Code Examples

Practical examples of PdfPig's capabilities:

PdfPig Extraction

Example 1: Basic Text Extraction

This example demonstrates how to open a PDF document and extract all text content while preserving reading order. PdfPig provides access to each letter with its exact position in the document, enabling advanced layout analysis beyond simple text extraction.

Output includes:

  • Raw text content in reading order
  • Page numbers for each text segment
  • Basic font information

Example 2: Advanced Positional Analysis

PdfPig excels at providing precise positional data for text elements. This example shows how to extract words with their bounding boxes, enabling tasks like table detection, form processing, and content region analysis.

Example 3: Font and Metadata Extraction

Beyond text content, PdfPig provides access to document metadata and detailed font information. This example demonstrates extracting document properties and analyzing font usage throughout the PDF.

Advanced Features

PdfPig supports professional PDF analysis:

  • Image extraction: Access embedded images:

    Image Extraction

    
        using var document = PdfDocument.Open("file.pdf");
        foreach (var page in document.GetPages())
        {
            foreach (var image in page.GetImages())
            {
                var bytes = image.RawBytes;
                // Process image data
            }
        }
        
    
  • Bookmark navigation: Access document outline:

    Bookmarks

    
        var bookmarks = document.GetBookmarks();
        foreach (var bookmark in bookmarks)
        {
            Console.WriteLine($"{bookmark.Title} - Page {bookmark.PageNumber}");
        }
        
    
  • Encrypted PDFs: Handle password-protected files:

    Encrypted PDF

    
        var options = new ParsingOptions
        {
            Password = "secure123"
        };
        using var doc = PdfDocument.Open("encrypted.pdf", options);
        
    

PdfPig vs PdfSharp

Here are the 5 key differences between PdfPig and PDFsharp:

  • Primary Function:PdfPig specializes in reading/extracting text, positions, and metadata. PDFsharp focuses on creating/editing PDF documents
  • Text vs Graphics:PdfPig Extracts text with pixel-perfect precision (including coordinates). PDFsharp is optimized for drawing text/shapes (reports, invoices, forms)
  • Document Access:PdfPig analyzes existing PDFs, while PDFsharp can modify pages, add content, merge files
  • Advanced FeaturesPdfPig reveals font details, bounding boxes, and document structure, while PDFsharp supports PDF/A standards, images, and encryption
  • Use CasesPdfPig supports Data mining, OCR preprocessing, content analysis, while PDFsharp supports report generation, PDF manipulation, form filling

Conclusion

PdfPig delivers unparalleled PDF content access for .NET developers. Ideal for:

  • Data extraction: Mining content from reports and documents
  • Document analysis: Understanding PDF structure and layout
  • Accessibility: Converting PDF content to other formats
  • Pre-processing: Preparing documents for OCR or ML

With its focus on accurate content extraction and low memory usage, PdfPig is the go-to choice for PDF analysis in .NET.

Similar Products

 English