PdfPig: Advanced PDF Text Extraction for .NET
Read and analyze PDF content without dependencies - text, positions, fonts and metadata
What is PdfPig?
PdfPig is an open source .NET library focused on extracting content from PDF files without the overhead of native dependencies. Unlike PDF generators, PdfPig specializes in reading existing documents to access text, font information, positional data, and document structure. It's particularly valuable for data mining, content analysis, and document processing pipelines.
Key advantages of PdfPig include:
- Zero dependencies: Pure C# implementation
- Low-level access: Precise text positioning and font metrics
- Memory efficient: Handles large documents with minimal overhead
- OCR-ready: Extract text with bounding boxes for analysis
- MIT licensed: Free for commercial use
Ideal for document analysis, text extraction, and PDF content processing.
Why Choose PdfPig?
- Accuracy: Handles complex PDF text layouts correctly
- Performance: Benchmarked faster than similar .NET libraries
- Transparency: Access raw PDF structures when needed
- Active development: Regular updates since 2018
- Cross-platform: Works on .NET Standard 2.0+
Installation
PdfPig is available via NuGet for easy integration:
Package Manager Console
Install-Package PdfPig
.NET CLI
dotnet add package PdfPig
System Requirements: .NET Standard 2.0 compatible runtime
Code Examples
Practical examples of PdfPig's capabilities:
Example 1: Basic Text Extraction
This example demonstrates how to open a PDF document and extract all text content while preserving reading order. PdfPig provides access to each letter with its exact position in the document, enabling advanced layout analysis beyond simple text extraction.
Output includes:
- Raw text content in reading order
- Page numbers for each text segment
- Basic font information
Example 2: Advanced Positional Analysis
PdfPig excels at providing precise positional data for text elements. This example shows how to extract words with their bounding boxes, enabling tasks like table detection, form processing, and content region analysis.
Example 3: Font and Metadata Extraction
Beyond text content, PdfPig provides access to document metadata and detailed font information. This example demonstrates extracting document properties and analyzing font usage throughout the PDF.
Advanced Features
PdfPig supports professional PDF analysis:
- Image extraction: Access embedded images:
Image Extraction
using var document = PdfDocument.Open("file.pdf"); foreach (var page in document.GetPages()) { foreach (var image in page.GetImages()) { var bytes = image.RawBytes; // Process image data } }
- Bookmark navigation: Access document outline:
Bookmarks
var bookmarks = document.GetBookmarks(); foreach (var bookmark in bookmarks) { Console.WriteLine($"{bookmark.Title} - Page {bookmark.PageNumber}"); }
- Encrypted PDFs: Handle password-protected files:
Encrypted PDF
var options = new ParsingOptions { Password = "secure123" }; using var doc = PdfDocument.Open("encrypted.pdf", options);
PdfPig vs PdfSharp
Here are the 5 key differences between PdfPig and PDFsharp:
- Primary Function:PdfPig specializes in reading/extracting text, positions, and metadata. PDFsharp focuses on creating/editing PDF documents
- Text vs Graphics:PdfPig Extracts text with pixel-perfect precision (including coordinates). PDFsharp is optimized for drawing text/shapes (reports, invoices, forms)
- Document Access:PdfPig analyzes existing PDFs, while PDFsharp can modify pages, add content, merge files
- Advanced FeaturesPdfPig reveals font details, bounding boxes, and document structure, while PDFsharp supports PDF/A standards, images, and encryption
- Use CasesPdfPig supports Data mining, OCR preprocessing, content analysis, while PDFsharp supports report generation, PDF manipulation, form filling
Conclusion
PdfPig delivers unparalleled PDF content access for .NET developers. Ideal for:
- Data extraction: Mining content from reports and documents
- Document analysis: Understanding PDF structure and layout
- Accessibility: Converting PDF content to other formats
- Pre-processing: Preparing documents for OCR or ML
With its focus on accurate content extraction and low memory usage, PdfPig is the go-to choice for PDF analysis in .NET.