What Is OCR?
OCR stands for Optical Character Recognition. It is the technology that looks at an image of text and converts it into actual, machine-readable characters that a computer can store, search, and process.
Think of it this way: when you scan a physical document, your scanner captures a photograph of the page. A photograph is just a grid of colored pixels — it has no idea that some of those pixels form the letter "A" or the word "invoice." OCR is the process that analyzes those pixels and says, "This cluster looks like an 'A', this looks like a 'B'..." until it has re-created the text that was in the original document.
Digital PDF vs. Scanned PDF: The Critical Difference
Not all PDFs need OCR. There are two fundamentally different types:
Digital PDFs (text-layer PDFs)
A digital PDF was generated by software — exported from Word, Excel, a web browser, or any application that "prints to PDF." The text is stored as Unicode characters in the PDF file itself. You can select, copy, search, and convert this text without any OCR.
Test: Open the PDF in any viewer and try to click and drag to select some text. If it highlights individual words, it's digital.
Image-Only PDFs (scanned PDFs)
An image-only PDF was created by scanning a physical document. Each page is stored as a raster image (JPEG or PNG pixels). There is no text layer — the "text" you see is just ink-colored pixels. OCR is required to extract any text from these files.
Test: Try to select text. If you can't select anything, or if clicking and dragging selects the entire page as one block, it's image-only.
Mixed PDFs
Some PDFs are a mix — digital pages combined with scanned pages (common in legal documents where some pages are typed and others are handwritten or photocopied exhibits). OCR will process the image pages and leave the digital pages unchanged.
How OCR Technology Works
Modern OCR engines use a multi-stage pipeline:
Stage 1: Preprocessing
Before recognizing characters, the engine prepares the image. This includes deskewing (straightening a page that was scanned at an angle), despeckling (removing noise and scanning artifacts), binarization (converting to pure black and white to maximize contrast), and detecting page orientation.
Stage 2: Layout Analysis
The engine identifies the structure of the page — where the text blocks are, where images are, whether there are multiple columns, where tables begin and end. This determines the correct reading order of the content.
Stage 3: Character Segmentation
Text regions are segmented into individual lines, then words, then characters. This is harder than it sounds — characters that touch each other (common with low-resolution scans or bold text) must be separated, and broken characters (where ink is missing) must be connected.
Stage 4: Character Recognition
Each segmented character is compared against a trained model. Modern OCR engines use deep neural networks trained on millions of document images. The network outputs a probability distribution over all possible characters and selects the most likely match.
Stage 5: Language Model Post-Processing
Raw character recognition is corrected using a language model. If the character recognizer outputs "tbe" but the language model knows "the" is far more common, it will correct the output. This significantly reduces errors on common words.
Factors That Affect OCR Accuracy
OCR is not magic — its accuracy depends heavily on the quality of the input image. Here are the biggest factors:
Scan Resolution
Resolution is measured in DPI (dots per inch). Higher DPI means more detail. The minimum for acceptable OCR is 200 DPI, but 300 DPI is the standard recommendation for clean, accurate results. At 150 DPI, small fonts and complex characters become ambiguous. At 600 DPI, quality is excellent but files are unnecessarily large.
Image Contrast
High contrast between text and background is essential. Black text on white paper is ideal. Problems arise with:
- Faded ink or pencil marks
- Colored paper or textured backgrounds
- Colored text (especially light yellow, light gray, or watermarks)
- Heavy paper texture visible through thin paper (bleed-through)
Page Skew
A page tilted more than 5° from horizontal significantly degrades OCR accuracy. Modern OCR engines auto-correct minor skew, but extreme tilt (from placing a document crookedly in a scanner) can still cause errors. Always align documents carefully when scanning.
Font Type and Size
Standard serif and sans-serif fonts (Times New Roman, Arial, Calibri) are recognized at near-perfect accuracy. Decorative fonts, handwriting-style fonts, and very small fonts (below 8pt) are harder to recognize. Uppercase and well-spaced text is more reliably recognized than dense, tightly kerned lowercase.
Document Condition
Physical damage reduces OCR quality. Wrinkled pages create shadows, coffee stains obscure characters, yellowed old documents reduce contrast, and pages with holes (punched or torn) are obvious problems. If scanning damaged documents, clean and flatten them as much as possible first.
Practical Tips for Better OCR Results
Scan at 300 DPI
Set your scanner to 300 DPI for standard documents. Use 600 DPI only for archival copies or very small text you need to zoom into later.
Use Black and White Mode for Text Documents
Scanning in grayscale or black-and-white mode (rather than color) produces sharper, more contrasty images and smaller file sizes. Use color only if the document contains color illustrations that matter.
Scan Straight
Use the guide rails in your scanner to align pages consistently. Scans tilted more than 2–3 degrees will have visibly reduced accuracy on small text.
Compress the PDF Before OCR if Needed
If your scanned PDF is very large (over 50 MB), run it through our Compress PDF tool to bring it within the upload limit. Compression downsamples images slightly but maintains accuracy at the 150–200 DPI range that works well for OCR.
Process Page by Page for Problem Documents
If a multi-page document has some pages with poor OCR results, use our Split PDF tool to isolate those pages, re-scan them at higher quality, and re-run OCR on those pages separately. Then merge the results back together.
OCR Output Formats
Our OCR tool returns a searchable PDF — a PDF that now has an invisible text layer placed over the original page images. This means:
- The document looks exactly the same as the original scan
- You can now select, copy, and search all the text
- The file can be indexed by search engines and document management systems
- You can now convert the PDF to Word, Excel, or plain text using our Convert tool
When OCR Reaches Its Limits
Some content cannot be reliably extracted by any OCR engine:
- Handwriting — especially cursive. Handwriting recognition is a distinct, harder problem. Printed block letters are easier but still error-prone.
- Mathematical equations — symbols, fractions, and notation are often misread or output as garbled text.
- Highly stylized or decorative fonts — artistic lettering, hand-drawn text, or fonts designed to look like graffiti.
- Very small text — footnotes or legal fine print at below 6pt in a 200 DPI scan may produce unreliable output.
- Right-to-left scripts in complex layouts — Arabic, Hebrew, and Persian in multi-column layouts can produce reordering issues.
Try OCR on Your Document
Ready to extract text from your scanned PDF? Our free OCR tool processes most documents in under 30 seconds.
Run OCR Now