Open Source Python Library for Text Extraction from Images
Leverage PyTesseract OCR to extract printed and handwritten text from images effortlessly.
What is PyTesseract API for Python?
PyTesseract is a Python wrapper for Google's Tesseract-OCR engine, a powerful open-source tool for extracting text from images. It enables developers to convert scanned documents, handwritten notes, and image-based text into machine-readable content with minimal effort. PyTesseract is widely used in automation, data extraction, document digitization, and AI-powered applications requiring Optical Character Recognition (OCR).
The library is especially useful for automating data entry tasks, recognizing text from screenshots, and digitizing printed documents. With support for multiple languages and various image preprocessing techniques, PyTesseract offers an efficient and flexible solution for extracting text from images.
PyTesseract API - Key Features
PyTesseract API provides essential features for accurate text extraction:
- Image to Text Conversion: Extract printed or handwritten text from images using OCR.
- Multi-Language Support: Recognizes over 100 languages with Tesseract OCR models.
- Preprocessing Compatibility: Works with OpenCV and PIL for image enhancement before OCR.
- PDF to Text Extraction: Convert scanned PDFs into searchable text format.
- Boxed Text Output: Extract text with positional bounding boxes.
- Batch Processing: Perform OCR on multiple images efficiently.
- Cross-Platform: Compatible with Windows, macOS, and Linux.
- Open Source: Actively maintained and free to use.
Advantages of Using PyTesseract API for OCR
- Automation: Extract text from images automatically for document processing workflows.
- Efficiency: Convert scanned documents into editable text quickly.
- Accuracy: Provides high OCR accuracy with proper image preprocessing.
- Scalability: Ideal for processing large volumes of images efficiently.
- Integration: Works with AI applications, databases, and text analytics tools.
Getting Started with PyTesseract API
Before using PyTesseract, ensure Tesseract-OCR is installed on your system.
Installation
Install PyTesseract and Dependencies
pip install pytesseract pillow opencv-python
Additionally, install Tesseract-OCR on Windows using:
Install Tesseract-OCR (Windows)
# Download Tesseract from:
https://github.com/UB-Mannheim/tesseract/wiki
On Linux, install Tesseract with:
Install Tesseract-OCR (Linux)
sudo apt install tesseract-ocr
Code Examples for Text Extraction with PyTesseract API
Below are examples demonstrating text extraction from images using PyTesseract.
Example 1: Extract Text from an Image
Extract Text from Image
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = Image.open("sample.png")
text = pytesseract.image_to_string(image)
print(text)
Example 2: Extract Text with Bounding Boxes
Extract Text with Bounding Boxes
import pytesseract
import cv2
image = cv2.imread("sample.png")
h, w, _ = image.shape
boxes = pytesseract.image_to_boxes(image)
for b in boxes.splitlines():
b = b.split()
x, y, x2, y2 = int(b[1]), int(b[2]), int(b[3]), int(b[4])
cv2.rectangle(image, (x, h - y), (x2, h - y2), (0, 255, 0), 2)
cv2.imwrite("output.png", image)
Example 3: Extract Text from Grayscale Image
Extract Text from Grayscale Image
import pytesseract
import cv2
image = cv2.imread("sample.png")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
text = pytesseract.image_to_string(gray)
print(text)
Conclusion
PyTesseract API provides an efficient way to extract text from images, making it useful for document digitization, OCR automation, and data extraction workflows. With its ability to process images with high accuracy and its integration with OpenCV and PIL, PyTesseract is a powerful tool for developers working on OCR-based projects.