Skip to main content
OCR

How to OCR a Scanned PDF: Step-by-Step Guide

A practical guide to extracting selectable, searchable text from scanned PDFs — including how OCR works, what affects accuracy, and how to fix common problems.

8 min read way2pdf Team 2026

What Does "OCR a Scanned PDF" Actually Mean?

When you scan a paper document, your scanner creates a photograph of the page. The resulting PDF contains an image — not text. You can see the words, but you can't click on them, copy them, search for them, or edit them. The file is essentially a picture file inside a PDF wrapper.

OCR (Optical Character Recognition) is the process of analysing those images and identifying the characters in them, then generating a real text layer that sits behind the image. After OCR, the PDF becomes searchable and copy-able — you can select individual words, Ctrl+F to find phrases, and copy content into other documents.

Quick start: Go to way2pdf.com/ocr, upload your scanned PDF, click Extract Text, and download the searchable result. The rest of this guide explains what to expect and how to get the best results.

Step 1 — Check Whether Your PDF Is Already Searchable

Before running OCR, verify that your PDF actually needs it. Open the PDF in any viewer and try to select text with your cursor:

  • If you can select and copy text — the PDF already has a text layer. OCR is not needed. Use the Convert tool to export to Word or TXT instead.
  • If your cursor turns into a crosshair and nothing selects — it's a scanned (image-only) PDF. Proceed to OCR.
  • If you can select some text but not other parts — the PDF is partially OCR'd, possibly from a mixed document (some pages scanned, some digital). Running OCR will fix the image pages.

Step 2 — Prepare Your Scan for Best OCR Accuracy

OCR accuracy depends heavily on the quality of the input image. If you're scanning a physical document, these settings make a significant difference:

Resolution: Use 300 DPI Minimum

Scan at 300 DPI (dots per inch) or higher. This is the industry standard for OCR. At 200 DPI, recognition accuracy drops noticeably — especially for smaller fonts. At 150 DPI or below, expect many errors. Most flatbed scanners default to 300 DPI; phone camera scans are typically equivalent to 200–250 DPI at normal distances.

Colour Mode: Greyscale or Black & White

For text documents, greyscale produces the best OCR results — it preserves contrast without the file size of full colour. Black and white (binary) mode works well for clean, high-contrast documents like invoices and forms but struggles with faint text or grey backgrounds. Avoid colour scans for text-only documents as they create larger files without improving accuracy.

Alignment: Keep Pages Straight

Pages rotated more than 5–10 degrees cause significant recognition errors. Most OCR engines include deskewing (automatic straightening), but it works best when the rotation is minor. When scanning, align pages carefully with the scanner bed edge.

Contrast: Avoid Shadows and Folds

Book spine shadows and page fold creases create dark gradients that confuse OCR engines. Flatten the document as much as possible. For bound books, press the spine firmly against the scanner glass or use a book scanner that photographs from above with even lighting.

Step 3 — Run OCR on way2pdf

  1. Go to way2pdf.com/ocr.
  2. Click Upload PDF and select your scanned file (up to 50 MB).
  3. Click Extract Text. Processing time is typically 5–30 seconds depending on page count and complexity.
  4. Download the result. You'll receive a text file containing all the recognised text, which you can copy into any document or paste into a search.
Large files: If your scanned PDF is over 50 MB, compress it first. Reducing the file size before OCR speeds up processing and usually doesn't reduce recognition accuracy.

Step 4 — Review and Clean Up the OCR Output

Even high-quality OCR is rarely 100% perfect. Common issues to look for:

Character Substitution Errors

The most common OCR errors involve visually similar characters: l (lowercase L) mistaken for 1 (one) or I (uppercase i), 0 (zero) confused with O (letter O), and rn read as m. These errors are especially common in older documents with worn type or non-standard fonts.

Hyphenation and Line Breaks

OCR engines often preserve hard line breaks where the original document used them for layout, creating fragmented text. In a word processor, use Find & Replace to replace lone line breaks (soft returns) with spaces — but only where they appear mid-sentence.

Table Extraction

OCR extracts table content as text, but the column alignment is lost. If you need tabular data from a scanned document in structured form, consider running the OCR output through a spreadsheet app or using the Convert to Excel tool on a digital PDF source if available.

Step 5 — Make the PDF Itself Searchable (Optional)

The way2pdf OCR tool extracts and returns the text content as a text file. If you specifically need a searchable PDF (a PDF with an invisible text layer behind the original scanned images), this is a slightly different output format called PDF/A with embedded text. You can achieve this by:

  • Using Adobe Acrobat's "Recognize Text" feature (paid software)
  • Using a desktop OCR tool like ABBYY FineReader or Tesseract
  • Using Google Drive: upload the scanned PDF → right-click → Open with Google Docs → Google automatically OCRs it and creates a searchable doc → export back to PDF

For most use cases — copying text, searching content, pasting into other documents — the plain text output from way2pdf is sufficient and faster.

OCR Accuracy by Document Type

Document TypeExpected AccuracyNotes
Printed letter (modern, clean)95–99%Near-perfect on good scans
Printed book (clear type)92–98%Spine shadows reduce accuracy
Typed document (typewriter)85–95%Non-standard fonts cause errors
Printed form (filled by hand)70–85%Handwriting within forms is harder
Handwritten text40–70%Standard OCR is not designed for cursive
Low-res scan (< 200 DPI)50–80%Resolution is the biggest accuracy factor

When OCR Won't Work Well

OCR has fundamental limitations you should be aware of:

  • Handwritten documents — Standard OCR is trained on printed fonts. Cursive or informal handwriting produces very low accuracy. Specialised handwriting recognition tools (like Microsoft Azure's Computer Vision or Google Vision API) do better but are not free.
  • Non-Latin scripts with complex shaping — Arabic, Persian, Devanagari, and some East Asian scripts require specialised OCR engines. Standard tools designed for Latin characters will produce garbage output on these scripts.
  • Severely degraded documents — Documents that are torn, heavily stained, faded, or printed on patterned backgrounds may produce accuracy too low to be useful.
  • Mathematical equations and chemical formulas — Special notation is rarely correctly recognised by general-purpose OCR engines.

Try OCR on way2pdf Now

Extract Text from Scanned PDF Compress PDF First