Invoice Digitization Workflow: Scan → OCR → Excel | way2pdf Skip to main content
Workflow Guide

Invoice Digitization: Scan, OCR, and Export to Excel

A complete, practical workflow for converting paper invoices into structured spreadsheets — covering the exact steps, the settings that matter, and the mistakes that waste an afternoon.

12 min read  ·  By  ·  Updated May 17, 2026


Why paper invoices are still a problem in 2026

Despite the rise of e-invoicing, a substantial portion of supplier invoices still arrive as paper documents, faxes, or photographs — especially from small suppliers, international vendors, and industries like construction, food service, and manufacturing where paper is the norm. An accounts payable team processing 50–500 such invoices a month faces the same bottleneck: someone has to type numbers from paper into a spreadsheet, and manual entry at scale produces errors, delays, and audit risk.

The correct fix is a three-step pipeline: scan → OCR → extract to spreadsheet. Each step is distinct and each matters. Skipping or rushing any one of them produces problems downstream that take longer to fix than doing it right from the start.

What this workflow requires

You need three things: a scanner or phone with a document-scanning app, a way to run OCR on the scanned PDF, and a PDF-to-Excel converter. All three steps can be done with way2pdf's free tools at no cost and without uploading to a permanent cloud drive. Files are deleted automatically within one hour.

Before you start: This workflow is for scanned paper invoices. If your invoices arrive as email attachments that are already digital PDFs (from accounting software like Xero, QuickBooks, or SAP), skip to Step 3 — they already have a text layer and do not need OCR.

Step 1: Scan the paper invoice to PDF

Scanning settings that affect OCR accuracy

Resolution: 300 DPI minimum, 400 DPI recommended. This is the single most important setting. OCR software reads the pixel representation of each character. At 150 DPI (the default on many scanner apps), small text — 8pt or smaller, which is common on invoice line items — becomes too blurry to read reliably. At 300 DPI, standard invoice text is clean. At 400 DPI, even thermal-printer receipts with faint characters become legible to OCR. Do not exceed 600 DPI unless the document has extremely fine print — higher resolution increases file size without improving OCR results on typical invoice text.

Colour mode: Greyscale or black-and-white. Colour scanning triples file size and does not improve OCR accuracy on text documents. Use greyscale for most invoices. Use black-and-white (1-bit) only for invoices with no greyscale content — pure black text on white paper — since it produces the smallest file sizes. Avoid black-and-white for invoices with logos, shaded tables, or light-grey background text.

File format: PDF. Scan directly to PDF rather than JPEG or PNG. PDF preserves document structure (page size, orientation), and the OCR result can be overlaid as a text layer on top of the image — this makes the final PDF both human-readable and machine-readable without changing how it looks.

Page alignment: Crooked pages reduce OCR accuracy significantly. Most modern scanners have auto-deskew, but if yours does not, align the invoice squarely on the scanner bed. way2pdf's Rotate PDF tool can fix orientation after scanning, but deskew (fixing the angle) requires a scanner with auto-deskew or dedicated software.

Phone scanning vs flatbed scanner

Phone scanning apps (Microsoft Lens, Apple Notes scan feature, Google Drive scan) are adequate for occasional use. They introduce perspective distortion that trained correction algorithms partially fix. For batch processing — 20+ invoices at a time — a flatbed or document-feed scanner produces more consistent results and requires less post-processing. A document-feed scanner lets you process a stack of invoices in minutes; a phone requires individual document photos.

Step 2: Run OCR

OCR (Optical Character Recognition) converts the page image in your scanned PDF into actual text characters. Without this step, the scanned PDF is a photograph — the PDF-to-Excel converter in Step 3 cannot find any data to extract from an image alone.

Upload to way2pdf OCR

Go to way2pdf OCR, upload the scanned PDF, and click run. The tool uses Tesseract OCR — an open-source engine maintained by Google — configured for English and common accounting character sets. Processing time is typically under 30 seconds for a standard 1–5 page invoice.

Download the result and open it in your PDF viewer before proceeding. Select a few numbers with your mouse cursor — if you can highlight individual digits, the OCR worked. If clicking selects nothing, the OCR failed (see troubleshooting below) and Step 3 will produce an empty spreadsheet.

OCR quality signals to check

  • Numbers misread as letters0 read as O, 1 read as l, 5 read as S. These are the most critical errors in invoice data. Re-scan at higher resolution if this occurs frequently.
  • Decimal points lost1,234.56 becoming 123456 or 1,23456. Caused by low resolution or ink bleed on thermal paper.
  • Column data running together — table columns merged into a single text run. Usually caused by faint or absent table border lines. The PDF-to-Excel converter handles this better when column borders are visible.
  • Completely blank output — often means the invoice was scanned in landscape but saved as portrait, or the image DPI was below 200. Re-scan and check orientation.

Step 3: Convert the OCR'd PDF to Excel

Upload the OCR-processed PDF to way2pdf PDF-to-Excel. The converter uses pdfplumber — a Python library that parses PDF structure to find table regions — to extract rows and columns into an .xlsx spreadsheet. Download the file and open it.

What typically extracts well

  • Line items with clear column boundaries (item description, quantity, unit price, total)
  • Date, invoice number, and vendor name in the header region
  • Subtotal, tax, and grand total rows at the bottom
  • Any table that has visible grid lines or consistent column alignment

What requires manual cleanup

  • Multi-line item descriptions — a line item that wraps onto two rows in the PDF may appear as two rows in Excel. Merge them manually.
  • Numbers in wrong columns — when borders are faint, the parser may miscalculate column boundaries. Compare the totals column in Excel against the PDF visually.
  • Currency symbols — $ and £ are usually stripped. Add them as cell formatting rather than text values so Excel treats the numbers as numeric.
  • Header region — invoice number, vendor name, and PO number often appear above the table and extract as text rows. Move them to a standard header area or reference cells.

Step 4: Verify the extracted data

This step is not optional. OCR and table extraction are pattern-matching processes; they are accurate on clean, well-structured invoices and less accurate on unusual layouts. A systematic check takes two minutes and prevents costly payment errors:

  1. Verify the invoice total in Excel matches the total in the PDF.
  2. Check line-item counts — the number of rows in Excel should equal the number of line items on the invoice.
  3. Check that unit prices and quantities produce the correct extended prices when multiplied.
  4. Confirm vendor name and invoice number match your purchase order records.
  5. Flag any cell containing a letter where you expect only a number — these are usually OCR misreads.

Batch processing: handling high volumes efficiently

For teams processing 100+ invoices monthly, doing this one-at-a-time becomes its own bottleneck. Practical improvements for higher volumes:

  • Use a document-feed scanner with a stack feeder to scan 20–50 pages in a single pass. Most office all-in-ones have this feature.
  • Consistent supplier formatting — work with frequent suppliers to standardise their invoice layout. A supplier who always puts the total in cell D12 of a fixed-format invoice can be processed much faster than one who varies their layout.
  • Template matching — once you have processed several invoices from the same supplier, note which cells always contain which data. Create a collection template in Excel that pulls those cells automatically.
  • Archive originals — keep the original scanned PDFs in a named folder system (by year → supplier → invoice date). Audits sometimes require the source document, and the Excel file alone is not sufficient evidence of what was received.

Limitations of this workflow

This pipeline works well for standard text-based invoices. It has inherent limitations that are worth knowing before choosing whether to invest in more specialised software:

  • Handwritten invoices — Tesseract OCR accuracy on cursive handwriting is significantly lower than on printed text. Expect 60–75% accuracy rather than 95%+. Handwritten invoices may require more manual review than they save.
  • Thermal paper — receipts and thermal invoices fade over time and often have very light printing from the start. Scan them promptly (ideally the same day received) at 400 DPI.
  • Complex multi-page invoices — invoices that span 5+ pages with subtotals per section and a master total on the final page require careful verification of which subtotal maps to which section.
  • Non-English text — way2pdf's OCR supports multiple languages but configuration defaults to English. Invoices in other scripts (Arabic, Chinese, Cyrillic) may need a dedicated multilingual OCR configuration.

When to consider dedicated AP automation software

This three-step workflow is the right approach for teams processing up to roughly 200 invoices a month with some tolerance for manual verification. Above that volume — or in regulated industries where a complete audit trail is required — dedicated accounts payable automation tools (Dext, Hubdoc, Rossum, ABBYY FlexiCapture) provide automated field extraction, PO matching, approval workflows, and direct accounting software integration that the manual workflow cannot replicate efficiently. This guide assumes you are not yet at that scale or budget.

Tools used in this workflow

Related workflows

Legal Bundle Preparation

Merge exhibits, add watermarks, and password-protect a complete legal submission package.

Academic PDF Submission

Compress, add page numbers, and convert to PDF/A for thesis and journal submission requirements.