Open Source Python Library for Text Extraction from Images

Leverage PyTesseract OCR to extract printed and handwritten text from images effortlessly.

What is PyTesseract API for Python?

PyTesseract is a Python wrapper for Google's Tesseract-OCR engine, a powerful open-source tool for extracting text from images. It enables developers to convert scanned documents, handwritten notes, and image-based text into machine-readable content with minimal effort. PyTesseract is widely used in automation, data extraction, document digitization, and AI-powered applications requiring Optical Character Recognition (OCR).

The library is especially useful for automating data entry tasks, recognizing text from screenshots, and digitizing printed documents. With support for multiple languages and various image preprocessing techniques, PyTesseract offers an efficient and flexible solution for extracting text from images.

PyTesseract API - Key Features

PyTesseract API provides essential features for accurate text extraction:

Image to Text Conversion: Extract printed or handwritten text from images using OCR.
Multi-Language Support: Recognizes over 100 languages with Tesseract OCR models.
Preprocessing Compatibility: Works with OpenCV and PIL for image enhancement before OCR.
PDF to Text Extraction: Convert scanned PDFs into searchable text format.
Boxed Text Output: Extract text with positional bounding boxes.
Batch Processing: Perform OCR on multiple images efficiently.
Cross-Platform: Compatible with Windows, macOS, and Linux.
Open Source: Actively maintained and free to use.

GitHub Stats

Name:
Language:
Stars:
Forks:
License:
Repository was last updated at

Advantages of Using PyTesseract API for OCR

Automation: Extract text from images automatically for document processing workflows.
Efficiency: Convert scanned documents into editable text quickly.
Accuracy: Provides high OCR accuracy with proper image preprocessing.
Scalability: Ideal for processing large volumes of images efficiently.
Integration: Works with AI applications, databases, and text analytics tools.

Getting Started with PyTesseract API

Before using PyTesseract, ensure Tesseract-OCR is installed on your system.

Installation

Install PyTesseract and Dependencies


pip install pytesseract pillow opencv-python

Additionally, install Tesseract-OCR on Windows using:

Install Tesseract-OCR (Windows)


# Download Tesseract from:
https://github.com/UB-Mannheim/tesseract/wiki

On Linux, install Tesseract with:

Install Tesseract-OCR (Linux)


sudo apt install tesseract-ocr

Code Examples for Text Extraction with PyTesseract API

Below are examples demonstrating text extraction from images using PyTesseract.

Example 1: Extract Text from an Image

Extract Text from Image


import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image = Image.open("sample.png")
text = pytesseract.image_to_string(image)

print(text)

Example 2: Extract Text with Bounding Boxes

Extract Text with Bounding Boxes


import pytesseract
import cv2

image = cv2.imread("sample.png")
h, w, _ = image.shape
boxes = pytesseract.image_to_boxes(image)

for b in boxes.splitlines():
    b = b.split()
    x, y, x2, y2 = int(b[1]), int(b[2]), int(b[3]), int(b[4])
    cv2.rectangle(image, (x, h - y), (x2, h - y2), (0, 255, 0), 2)

cv2.imwrite("output.png", image)

Example 3: Extract Text from Grayscale Image

Extract Text from Grayscale Image


import pytesseract
import cv2

image = cv2.imread("sample.png")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
text = pytesseract.image_to_string(gray)

print(text)

Conclusion

PyTesseract API provides an efficient way to extract text from images, making it useful for document digitization, OCR automation, and data extraction workflows. With its ability to process images with high accuracy and its integration with OpenCV and PIL, PyTesseract is a powerful tool for developers working on OCR-based projects.