Here's a scenario I've encountered more than once: someone digs out a contract or agreement from five years ago, scanned, filed, forgotten, and now needs to find a specific clause. They open the PDF in their browser and press Ctrl+F. Nothing. The search bar doesn't find the word they're looking for because the document is a photograph, not text.

That's the problem OCR solves. This guide explains what it actually does, why it works brilliantly on some documents and poorly on others, and how to get the best results when you need to extract text from a scanned PDF.

What Is OCR?

OCR stands for Optical Character Recognition. It is the technology that looks at an image of text and converts it into actual, machine-readable characters that a computer can store, search, and process.

When you scan a physical document, your scanner captures a photograph of the page. A photograph is just a grid of colored pixels, it has no idea that some of those pixels form the letter "A" or the word "invoice." OCR analyzes those pixels and says, "This cluster of dark pixels looks like an 'A', this looks like a 'B'." until it has reconstructed the text that was visible in the image.

Digital PDF vs. Scanned PDF: The Critical Difference

Not all PDFs need OCR. There are two fundamentally different types:

Digital PDFs (text-layer PDFs)

A digital PDF was generated by software, exported from Word, Excel, a web browser, or any application that "prints to PDF." The text is stored as Unicode characters in the PDF file itself. You can select, copy, search, and convert this text without any OCR.

Test: Open the PDF in any viewer and try to click and drag to select some text. If it highlights individual words, it's digital.

Image-Only PDFs (scanned PDFs)

An image-only PDF was created by scanning a physical document. Each page is stored as a raster image (JPEG or PNG pixels). There is no text layer, the "text" you see is just ink-colored pixels. OCR is required to extract any text from these files.

Test: Try to select text. If you can't select anything, or if clicking and dragging selects the entire page as one block, it's image-only.

Mixed PDFs

Some PDFs are a mix, digital pages combined with scanned pages (common in legal documents where some pages are typed and others are handwritten or photocopied exhibits). OCR will process the image pages and leave the digital pages unchanged.

How OCR Technology Works

Modern OCR engines use a multi-stage pipeline:

Stage 1: Preprocessing

Before recognizing characters, the engine prepares the image. This includes deskewing (straightening a page that was scanned at an angle), despeckling (removing noise and scanning artifacts), binarization (converting to pure black and white to maximize contrast), and detecting page orientation.

Stage 2: Layout Analysis

The engine identifies the structure of the page, where the text blocks are, where images are, whether there are multiple columns, where tables begin and end. This determines the correct reading order of the content.

Stage 3: Character Segmentation

Text regions are segmented into individual lines, then words, then characters. This is harder than it sounds, characters that touch each other (common with low-resolution scans or bold text) must be separated, and broken characters (where ink is missing) must be connected.

Stage 4: Character Recognition

Each segmented character is compared against a trained model. Modern OCR engines use deep neural networks trained on millions of document images. The network outputs a probability distribution over all possible characters and selects the most likely match.

Stage 5: Language Model Post-Processing

Raw character recognition is corrected using a language model. If the character recognizer outputs "tbe" but the language model knows "the" is far more common, it will correct the output. This significantly reduces errors on common words.

Factors That Affect OCR Accuracy

OCR is not magic, its accuracy depends heavily on the quality of the input image. Here are the biggest factors:

Scan Resolution

Resolution is measured in DPI (dots per inch). Higher DPI means more detail. The minimum for acceptable OCR is 200 DPI, but 300 DPI is the standard recommendation for clean, accurate results. At 150 DPI, small fonts and complex characters become ambiguous. At 600 DPI, quality is excellent but files are unnecessarily large.

Image Contrast

High contrast between text and background is essential. Black text on white paper is ideal. Problems arise with:

Faded ink or pencil marks
Colored paper or textured backgrounds
Colored text (especially light yellow, light gray, or watermarks)
Heavy paper texture visible through thin paper (bleed-through)

Page Skew

A page tilted more than 5° from horizontal significantly degrades OCR accuracy. Modern OCR engines auto-correct minor skew, but extreme tilt (from placing a document crookedly in a scanner) can still cause errors. Always align documents carefully when scanning.

Font Type and Size

Standard serif and sans-serif fonts (Times New Roman, Arial, Calibri) are recognized at near-perfect accuracy. Decorative fonts, handwriting-style fonts, and very small fonts (below 8pt) are harder to recognize. Uppercase and well-spaced text is more reliably recognized than dense, tightly kerned lowercase.

Document Condition

Physical damage reduces OCR quality. Wrinkled pages create shadows, coffee stains obscure characters, yellowed old documents reduce contrast, and pages with holes (punched or torn) are obvious problems. If scanning damaged documents, clean and flatten them as much as possible first.

Practical Tips for Better OCR Results

Scan at 300 DPI

Set your scanner to 300 DPI for standard documents. Use 600 DPI only for archival copies or very small text you need to zoom into later.

Use Black and White Mode for Text Documents

Scanning in grayscale or black-and-white mode (rather than color) produces sharper, more contrasty images and smaller file sizes. Use color only if the document contains color illustrations that matter.

Scan Straight

Use the guide rails in your scanner to align pages consistently. Scans tilted more than 2–3 degrees will have visibly reduced accuracy on small text.

Compress the PDF Before OCR if Needed

If your scanned PDF is very large (over 50 MB), run it through our Compress PDF tool to bring it within the upload limit. Compression downsamples images slightly but maintains accuracy at the 150–200 DPI range that works well for OCR.

Process Page by Page for Problem Documents

If a multi-page document has some pages with poor OCR results, use our Split PDF tool to isolate those pages, re-scan them at higher quality, and re-run OCR on those pages separately. Then merge the results back together.

OCR Output Formats

Our OCR tool returns a searchable PDF, a PDF that now has an invisible text layer placed over the original page images. This means:

The document looks exactly the same as the original scan
You can now select, copy, and search all the text
The file can be indexed by search engines and document management systems
You can now convert the PDF to Word, Excel, or plain text using our Convert tool

When OCR Reaches Its Limits

Some content cannot be reliably extracted by any OCR engine:

Handwriting: especially cursive. Handwriting recognition is a distinct, harder problem. Printed block letters are easier but still error-prone.
Mathematical equations: symbols, fractions, and notation are often misread or output as garbled text.
Highly stylized or decorative fonts: artistic lettering, hand-drawn text, or fonts designed to look like graffiti.
Very small text: footnotes or legal fine print at below 6pt in a 200 DPI scan may produce unreliable output.
Right-to-left scripts in complex layouts: Arabic, Hebrew, and Persian in multi-column layouts can produce reordering issues.

Scan won't search?

Got a scan that won't search? Upload it to the OCR tool, most typed documents finish in under half a minute.

Run OCR Now

OCR Explained: How to Extract Text from Scanned PDFs