Skip to main content
Technical

How OCR works: the technology behind text extraction explained

From scan lines and template matching to convolutional networks — what happens between pixels and searchable text.


Optical character recognition (OCR) is the bridge between documents humans read and data machines can index. Every searchable scan, every invoice pulled into a database, and every court filing run through discovery search depends on it. This article walks through how OCR systems are built, why accuracy varies so widely by document type, and what you can control before a file ever reaches the engine.

What OCR is — and a brief history

OCR converts images of text into encoded character sequences. The problem sounds simple; the implementation is not. Early commercial systems in the 1970s targeted specific fonts — OCR-A and OCR-B were literally designed so machines could read them on checks and government forms. Those systems used hardware-assisted scanning and rigid templates: each glyph had a known shape, and recognition was closer to pattern matching than understanding language.

Desktop scanning in the 1990s spread OCR to offices; the 2000s improved statistical models and layout analysis. Deep learning — CNNs and sequence models that learn from data — now dominates recognition on noisy scans, skewed pages, and varied fonts while keeping the same pipeline shape as earlier systems.

On production sites such as way2pdf OCR, that pipeline runs server-side on uploaded PDFs: pages are rasterized, processed, and written back as a searchable PDF plus optional plain text export.

The core OCR pipeline

Although implementations differ, most OCR systems follow the same sequence. Treating these stages separately helps when you debug poor output — the failure is usually identifiable by stage.

1. Image preprocessing

Raw page images are rarely ideal. Preprocessing typically includes:

  • Deskewing — estimating page rotation and rotating back toward horizontal.
  • Binarization — converting grayscale to black text on white background (Otsu thresholding and adaptive methods are common).
  • Noise removal — despeckling salt-and-pepper artifacts from scanning or faxing.
  • Contrast normalization — compensating for yellowed paper or uneven lighting in phone photos.

Skipping or weak preprocessing cannot be fully recovered later. A recognizer cannot infer characters that are visually merged because the threshold was wrong.

2. Character segmentation

Classical OCR often attempted to locate individual character bounding boxes — column projection, connected-component analysis, and splitting touch characters. Segmentation errors cascade: one merged “rn” misread as “m” poisons the word.

Many modern systems use line- or word-level recognition with implicit segmentation inside the neural network, or attention mechanisms that align image columns to character sequences without an explicit per-glyph cut. Layout analysis (blocks, columns, tables) still runs as a separate step in document-heavy workflows.

3. Feature extraction

In older systems, engineers hand-crafted features: stroke density, aspect ratio, loop counts, intersection topology. Classifiers (nearest neighbor, SVMs) mapped feature vectors to character classes.

In CNN-based OCR, the network learns hierarchical features — edges, curves, ascenders — from labeled training images. The “feature extraction” layer is not human-readable, but it serves the same role: a compact representation optimized for discrimination between similar glyphs (e.g., “O” vs “0”).

4. Character recognition

The recognizer outputs a hypothesis per glyph, line, or sequence. Outputs include confidence scores. Low-confidence regions are prime candidates for human review in high-stakes workflows (legal, medical).

Sequence models (CTC loss, attention-based encoders/decoders) treat a line image as input and emit a string directly, which handles variable-width fonts without fixed cell grids.

5. Language model post-processing

Raw character hypotheses are noisy. Post-processing applies dictionaries, n-gram language models, and sometimes domain-specific vocabularies (medical codes, legal citations) to choose plausible words over visually similar alternatives. This stage is where “l” vs “1” and “O” vs “0” disambiguation often happens — not in the recognizer alone.

How CNNs changed OCR vs pattern matching

Template and feature-based OCR degrades when fonts, ink bleed, or scan blur depart from training assumptions. A CNN trained on millions of labeled glyph crops generalizes across typefaces and mild distortions because it learns invariances (slight rotation, weight variation) statistically.

Practical impact on printed English documents at 300 DPI: legacy engines might plateau in the low 90s percent character accuracy on clean pages; strong neural pipelines on the same material routinely reach the high 90s, with error rates dominated by layout and scan quality rather than font novelty. The gain is smaller on already-easy input and larger on degraded faxes, photocopies, and mobile photos — exactly where rule-based systems used to fail first.

Trade-off: neural models need compute and curated training data. They can also be overconfident on out-of-distribution inputs (stylized logos mistaken for letters, reverse text in watermarks).

Why handwriting is fundamentally harder than print

Printed text has discrete, repeatable glyph shapes with consistent baselines and spacing. Handwriting introduces:

  • High intra-writer variance — the same letter “a” may look different twice in one sentence.
  • Inter-writer variance — cursive connections differ across people.
  • Context-dependent letter forms — shape depends on neighboring letters.
  • Pressure and stroke width variation — binarization is less stable.

Handwriting recognition (ICR) often uses separate models, sometimes with language models tuned for word-level constraints. Expect materially lower word-error rates than print unless the input is constrained (forms with boxes, numeric fields). Unconstrained cursive remains an active research problem, not a solved one.

The DPI problem: pixels per character

Resolution is not abstract — it determines how many pixels each stroke occupies. A useful rule: aim for roughly 20–30 pixels of height on the x-height of body text (the height of “x”, not including ascenders).

At 10 point body text, the x-height is about 3.5 mm. At 300 DPI, that is roughly 41 pixels — comfortable. At 150 DPI, about 20 pixels — marginal; thin strokes may disappear after binarization. At 72 DPI, roughly 10 pixels — recognition error rates climb because serifs and holes in “e” and “a” collapse.

Downscaling a 600 DPI archive to save space can be fine if the effective resolution after scaling still preserves stroke topology. Upsampling a blurry 72 DPI fax does not recover lost information — interpolation invents pixels, not detail.

Language model correction in practice

When the recognizer returns “1nvoice” with similar scores for “l” and “1”, the language model prefers “invoice” because “lnvoice” is not a word. Domain lexicons strengthen this: “statute” over “statute” with a digit substitution.

Limitations matter: rare proper names, SKU codes, and alphanumeric IDs are often wrong after correction because they are out-of-vocabulary. Disabling the dictionary for those fields, or using charset whitelists (digits only in account number zones), is standard in form-processing pipelines.

Accuracy benchmarks by document type

Published benchmarks vary by dataset and metric (character error rate vs word error rate). Representative ranges for production systems:

  • Clean printed English, 300 DPI, single column — character accuracy often 98–99.5%; word accuracy slightly lower.
  • Newspaper or magazine columns — word accuracy drops with reading-order errors even when per-character scores look good.
  • Photocopies and fax — 90–97% character accuracy depending on generation loss.
  • Historical or stained documents — highly variable; preprocessing dominates outcomes.
  • Hand-printed block letters — moderate; cursive — substantially worse without specialized models.

Always measure on your own corpus. A vendor number on a marketing slide is not comparable to a batch of freight bills.

Limitations engineers plan around

Multi-column and complex layout

Reading order is not OCR alone — it is layout analysis plus OCR. Two-column PDFs may interleave lines left-right if column detection fails. Tables need cell detection; otherwise values slide into adjacent rows in the text export.

Mathematical formulas

Equations use symbols outside standard Latin alphabets, vertical fractions, and spatial positioning. General OCR flattens them into garbled Unicode. Use specialized math OCR or keep equations as vector graphics in the PDF.

Non-Latin scripts

Engines are trained per script or language pack. Arabic contextual shaping, Indic conjuncts, and CJK dense character sets need appropriate models — running an English-trained pack on Japanese yields nonsense. Mixed-script lines require language identification upstream.

Practical takeaway: maximizing OCR accuracy

Before upload, control what the pipeline can use:

  1. Scan at 300 DPI for archival text; 200 DPI minimum for 10–12 pt body copy.
  2. Use grayscale or color at 300 DPI for lightly colored paper; avoid aggressive compression (JPEG artifacts ring characters).
  3. Deskew and crop borders — dark scanner edges confuse binarization.
  4. Prefer single-column exports when possible; split spreads into pages.
  5. For downstream editing, run OCR first, then PDF to Word on the searchable output — not on a raw image-only scan.
  6. Spot-check identifiers (amounts, dates, policy numbers); language models fix words, not always numbers.

way2pdf runs Tesseract-based OCR on uploaded PDFs and returns a searchable PDF plus a text file. Files are removed within about one hour; no account is required. For a step-by-step workflow, see how to extract text from a scanned PDF.

Run OCR on a PDF


Related: OCR explained (overview) · Scanned PDF to Word

In-depth guides & tools

Step-by-step documentation on way2pdf tools—not just the blog article above.