Published in: Data Engineering

The Hidden Complexity of PDF Processing

Author Elena Rodriguez

Published on: April 14, 2023

PDF is the cockroach of file formats. It was designed in 1993 by Adobe to faithfully reproduce printed documents on screen, and it has survived every attempt to replace it. Every business uses PDFs. Every developer eventually has to process them. And every developer who has processed PDFs at scale has a horror story about corrupted text, phantom tables, or mysterious encoding issues that consumed days of debugging.

Article Overview

The Hidden Complexity of PDF Processing

6 sections · Reading flow

01
The PDF Format: What You Are Actually Dealing With

→

02
Text Extraction: The Deceptive Simplicity

→

03
Table Extraction: Where Parsers Go to Die

→

04
Scanned PDFs and OCR Integration

→

05
Performance and Memory Considerations at Scale

→

06
Choosing the Right Library

HARBOR SOFTWARE · Engineering Insights

The horror stories exist because PDF is not what most people think it is. It is not a structured document format like HTML or DOCX where content is organized into semantic elements like paragraphs, headings, and tables. A PDF is a set of drawing instructions: put this glyph at coordinate (72, 650), draw a line from (100, 200) to (500, 200), place this image at these dimensions. There are no paragraphs, no tables, no columns in the PDF specification — only positioned elements on a canvas.

This fundamental mismatch between what humans see when they look at a PDF (a structured document with headings, paragraphs, and tables) and what the file actually contains (positioned glyphs on a canvas) is the source of all PDF processing pain. Here is a deep dive into that complexity and practical strategies for handling it in production systems.

The PDF Format: What You Are Actually Dealing With

A PDF file is a hierarchy of objects: a catalog that points to page trees, pages that contain content streams, and content streams that contain drawing operators. The operators place text, draw paths, and position images on the page. Understanding this structure is essential for understanding why text extraction is so much harder than it seems.

Text in a PDF is stored as a series of “show text” operators with font and position information:

BT                          % Begin text object
/F1 12 Tf                   % Set font F1 at 12pt
72 700 Td                   % Move to position (72, 700)
(Invoice Number: ) Tj       % Show text "Invoice Number: "
(INV-2023-0042) Tj          % Show text "INV-2023-0042"
ET                          % End text object

Notice what is absent: there is no semantic markup whatsoever. The PDF does not know that “Invoice Number” is a label and “INV-2023-0042” is a value. It does not know they are related. It only knows that these glyphs appear at specific coordinates in a specific font. The relationship between them is implicit in their visual proximity, which is obvious to a human reader but invisible to a parser.

This gets considerably worse in practice. The text “Invoice Number” might not even be stored as a single string. PDF generators frequently break text into fragments for kerning adjustments, font switches, or simply because their layout engine decided to flush the text buffer mid-word:

BT
/F1 12 Tf
72 700 Td
(Inv) Tj
(oice ) Tj
(Num) Tj
(ber: ) Tj
/F2 12 Tf            % Font change for the value
(INV-2023-) Tj
(0042) Tj
ET

Now “Invoice Number” is four separate text fragments that your code must reassemble into a single string. “INV-2023-0042” is two fragments because the generator decided to split mid-string. This is not exceptional or unusual — it is the norm. Most PDF generators, including Adobe InDesign, Microsoft Word’s PDF export, Chrome’s print-to-PDF, and wkhtmltopdf, produce fragmented text as standard behavior. Your extraction code must handle text reassembly as a fundamental operation, not an edge case.

Text Extraction: The Deceptive Simplicity

Extracting text from a PDF is deceptively simple with the right library. pdfplumber, PyMuPDF (fitz), pdfminer.six — all of them can extract text in a few lines of code:

import pdfplumber

with pdfplumber.open('invoice.pdf') as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

For simple single-column documents with standard fonts, this works well. The problems begin when you need more than raw text, or when the document has any structural complexity at all:

Reading order in multi-column layouts. A two-column document places text from both columns at various y-coordinates on the page. Naive extraction that sorts by y-coordinate then x-coordinate interleaves the columns. You get “Introduction This section discusses” followed by “Chapter 1 the methodology used” because both lines share the same y-coordinate but belong to different columns. Determining correct reading order requires detecting column boundaries first, which requires analyzing the spatial distribution of text elements across the entire page — a non-trivial computational geometry problem.

Whitespace and word boundaries. PDF text operators use absolute positioning — there is no concept of a space character between words in many PDFs. The visual space between “Invoice” and “Number” is just empty canvas. Text extraction libraries infer spaces by detecting gaps between characters that exceed a threshold (typically 30% of the average character width for the current font). This heuristic works most of the time but produces errors for tightly kerned fonts (where legitimate spaces are smaller than usual), wide letter-spacing used for aesthetic effect, and tabular data where columns are separated by visual alignment rather than explicit spacing characters.

Font encoding nightmares. PDF supports arbitrary font encodings. A PDF can define a custom font where the byte value 0x41 (normally the letter ‘A’ in ASCII and Unicode) maps to the glyph for ‘Z’ or even a custom symbol. Some PDF generators, especially those producing PDF from print drivers or from specialized publishing software, use custom encodings that make extracted text look like complete gibberish. The only reliable way to decode the text is to use the font’s ToUnicode mapping table, which maps internal glyph IDs to Unicode code points. But not all PDFs include ToUnicode tables — older PDFs and those from some low-quality generators omit them entirely, leaving no machine-readable path from bytes to characters.

Invisible text and overlapping elements. PDFs can contain invisible text (white text on white background, or text rendered in a zero-width font) used for various purposes: searchability layers added by OCR software, watermark mechanisms, or copy-protection schemes. They can also contain overlapping text elements where one piece of text is rendered on top of another. Your extraction pipeline needs to decide how to handle these: include invisible text (which might be duplicate OCR text) or exclude it (which might lose legitimately hidden content).

Table Extraction: Where Parsers Go to Die

Tables in PDFs have no semantic representation. A table is just text placed at grid-aligned coordinates, optionally with lines drawn between cells. Detecting and extracting tables requires inferring structure from visual layout — a task that is trivial for humans and brutally hard for software.

There are three main approaches to table extraction, each with significant trade-offs:

Line-based detection

Look for horizontal and vertical lines (PDF path operators) that form a grid. Cells are the rectangular regions bounded by these lines. This approach works perfectly for tables with visible borders and completely fails for borderless tables (which are extremely common in invoices, financial documents, and technical specification sheets).

import pdfplumber

with pdfplumber.open('invoice.pdf') as pdf:
    page = pdf.pages[0]

    # Extract tables using line-based detection
    tables = page.extract_tables({
        'vertical_strategy': 'lines',
        'horizontal_strategy': 'lines',
        'snap_tolerance': 3,
        'join_tolerance': 3,
    })

    for table in tables:
        for row in table:
            print(row)

Text-alignment detection

Analyze the x-coordinates of text elements to detect column boundaries, and y-coordinates to detect row boundaries. This handles borderless tables but struggles with irregular spacing, merged cells, and text that wraps within a cell (creating multiple y-positions for a single logical row). The algorithm must also distinguish between tabular data and regular multi-column text that happens to have aligned margins.

    # Fallback: use text alignment for borderless tables
    tables = page.extract_tables({
        'vertical_strategy': 'text',
        'horizontal_strategy': 'text',
        'min_words_vertical': 3,
        'min_words_horizontal': 1,
        'text_x_tolerance': 3,
        'text_y_tolerance': 3,
    })

ML-based detection

Use a machine learning model trained on document images to detect table regions and cell boundaries. Models like Microsoft’s Table Transformer (DETR-based architecture trained on PubTables-1M) and the PubLayNet dataset’s table detection models can identify table regions with high accuracy. The trade-off is computational cost — you need to render the PDF page as an image at sufficient resolution (typically 200+ DPI) and run inference through a neural network. This takes 200-500ms per page on a GPU versus 5-10ms for rule-based approaches.

In practice, we use a cascade: try line-based detection first (fastest, most accurate when lines exist), fall back to text-alignment if no lines are found, and escalate to ML-based detection for pages where text-alignment produces low-confidence results. This three-tier approach handles about 93% of tables we encounter in procurement documents correctly on the first pass.

Scanned PDFs and OCR Integration

A significant portion of PDFs in business workflows are scanned documents — physical papers that were run through a scanner or photographed with a phone camera. These PDFs contain only images, with no extractable text content at all. Calling page.extract_text() returns an empty string.

The solution is OCR (Optical Character Recognition), but integrating OCR into a PDF processing pipeline introduces its own complexity layers:

Resolution directly determines accuracy. OCR accuracy is directly correlated with image resolution. Scanned at 300 DPI, Tesseract achieves approximately 95% character accuracy on clean printed documents in English. At 150 DPI, accuracy drops to roughly 88%. At 72 DPI (common for web-optimized PDFs or screenshots), accuracy falls below 80% — unusable for reliable data extraction. Always check the effective DPI before running OCR and resample (upscale) the image if necessary. Upscaling from 150 DPI to 300 DPI using bicubic interpolation recovers about half the accuracy loss.

Pre-processing dramatically improves results. Raw scanned images often have skew (the page was slightly rotated in the scanner), noise (speckles from the scanner glass or compression artifacts), and uneven lighting (shadows from a phone camera or a book spine). A pre-processing pipeline of deskewing, denoising, and binarization (converting to clean black and white) can improve OCR accuracy by 5-15 percentage points:

import cv2
import numpy as np

def preprocess_for_ocr(image: np.ndarray) -> np.ndarray:
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Deskew: detect text angle and rotate to level
    coords = np.column_stack(np.where(gray < 128))
    if len(coords) > 100:
        angle = cv2.minAreaRect(coords)[-1]
        if angle < -45:
            angle = -(90 + angle)
        else:
            angle = -angle
        if abs(angle) > 0.5:  # Only deskew if angle is significant
            (h, w) = gray.shape[:2]
            center = (w // 2, h // 2)
            M = cv2.getRotationMatrix2D(center, angle, 1.0)
            gray = cv2.warpAffine(
                gray, M, (w, h),
                flags=cv2.INTER_CUBIC,
                borderMode=cv2.BORDER_REPLICATE
            )

    # Denoise using non-local means
    gray = cv2.fastNlMeansDenoising(gray, h=10)

    # Binarize using adaptive threshold for uneven lighting
    binary = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    return binary

Mixed PDFs are common and must be detected. A single PDF might have some pages with embedded text (digitally generated pages) and other pages that are scanned images (a signed contract page, an appended scan of a handwritten note, or a photographed appendix). Your pipeline needs to detect which pages need OCR and which already have extractable text. The heuristic is straightforward: if a page has fewer than 10 extractable text characters but has image content, treat it as a scanned page and run OCR. Apply this check per-page, not per-document, because mixed PDFs are surprisingly common in business workflows where people append scanned pages to digital documents.

Performance and Memory Considerations at Scale

Processing a single PDF is fast. Processing 100,000 PDFs per day introduces performance challenges that are easy to underestimate when testing with a handful of documents:

Memory management. Some PDF libraries load the entire document into memory, including embedded images. A 200-page PDF with high-resolution embedded images can consume 500MB or more of RAM. Processing multiple such documents concurrently on a worker with 4GB of memory will cause out-of-memory kills. Use libraries that support streaming or page-by-page processing, and set explicit memory limits per worker process. We use Python’s resource module on Linux to set hard memory limits that trigger a clean error instead of an OOM kill.

Malicious and malformed PDFs. PDF is a complex format with a long history of security vulnerabilities. In a production pipeline accepting documents from external sources, you will encounter malicious or corrupted PDFs. A PDF can contain JavaScript (yes, really), embedded executables, infinite loops in the page tree structure, and decompression bombs (a 1KB PDF that expands to 1GB of raw content when its embedded streams are decompressed). Always process untrusted PDFs in a sandboxed environment with strict resource limits on memory, CPU time, and wall-clock time.

Parallelism strategy. PDF processing is primarily CPU-bound (text extraction, image rendering for OCR) and sometimes GPU-bound (ML-based table detection, neural OCR). We use a task queue (Celery with Redis broker) with separate queues for lightweight tasks (text extraction from native PDFs, typically under 100ms per page) and heavyweight tasks (OCR plus ML-based layout analysis, typically 2-5 seconds per page). Lightweight tasks run on 8-core CPU workers; heavyweight tasks run on GPU-enabled workers. This prevents slow OCR jobs from blocking fast text extraction tasks and allows each worker pool to be scaled independently based on queue depth.

Choosing the Right Library

After extensive testing across thousands of real-world documents from procurement, legal, and financial domains, here is our library recommendation matrix:

pdfplumber: Best for table extraction from native PDFs. Built on pdfminer.six with a significantly cleaner API. Excellent for procurement documents with line-items tables. Our default choice when tables are the primary extraction target.
PyMuPDF (fitz): Fastest text extraction — roughly 10x faster than pdfplumber for raw text. Best for high-volume pipelines where you need raw text quickly without table structure. Also excellent for rendering PDF pages to images for OCR or ML processing.
pdfminer.six: Most detailed text extraction — gives you individual character positions, font information, and drawing operators. Use when you need fine-grained control over text reconstruction algorithms or when you need to build custom layout analysis.
Camelot: Table extraction specialist with both “stream” (text-alignment) and “lattice” (line-based) modes. Sometimes more accurate than pdfplumber for complex tables with merged cells, but significantly slower.
Tabula (via tabula-py): Java-based table extraction with good accuracy. The JVM requirement adds deployment complexity and startup latency that makes it less suitable for serverless or container-based deployments.

For OCR, the landscape breaks down by cost-accuracy trade-off:

Tesseract 5.x: Open-source standard. Good accuracy on clean printed text, no per-page cost. The --psm flag is critical: use mode 6 for uniform blocks of text, mode 3 for fully automatic page segmentation, mode 11 for sparse text.
AWS Textract: Best accuracy for English documents in our benchmarks. Includes native table extraction and form parsing that work better than Tesseract plus post-processing. $1.50 per 1,000 pages for text detection, $15 per 1,000 for tables and forms.
Google Document AI: Strong multi-language support, particularly good for Asian languages. Includes specialized processors for invoices, receipts, and contracts that combine OCR with semantic extraction in a single API call.

Conclusion

PDF processing is one of those problems that looks trivially simple until you try to solve it reliably at scale. The gap between “works on my test file” and “works on 100,000 files from 200 different sources with zero human intervention” is enormous. Every assumption you make about PDF structure — that text is stored as complete words, that columns are explicitly marked, that tables have visible borders, that fonts use standard encodings — will be violated by some real-world document sitting in your pipeline right now.

The path to a robust PDF processing pipeline is layered defenses: multiple extraction strategies with graceful fallbacks, aggressive error handling that quarantines failures without stopping the pipeline, resource limits that protect your infrastructure from malicious or malformed inputs, and continuous monitoring of accuracy metrics across different document types and sources. Accept that some documents will defeat your automated pipeline and build the human review path from day one, not as an afterthought when accuracy problems surface in production. The best PDF processing systems are the ones that know their own limits and handle gracefully the cases they cannot solve.

The Hidden Complexity of PDF Processing

The PDF Format: What You Are Actually Dealing With

Text Extraction: The Deceptive Simplicity

Table Extraction: Where Parsers Go to Die

Line-based detection

Text-alignment detection

ML-based detection

Scanned PDFs and OCR Integration

Performance and Memory Considerations at Scale

Choosing the Right Library

Conclusion

You may also like

Leave a comment Cancel reply