PDF Glossary

120 PDF terms explained in plain English, from everyday concepts to deep technical detail.

The PDF file format is one of the most complex document standards in common use. If you're a developer, a power user, or just trying to understand why your file won't open, this glossary covers the terms you'll encounter.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

AcroForm

The original interactive form specification built into PDF. An AcroForm document contains fields (text boxes, checkboxes, radio buttons, dropdowns, signatures) that users can fill in with a PDF viewer. AcroForms are defined entirely within the PDF file itself using PDF dictionary objects, no external scripts or web server needed. Distinguished from XFA forms, which use an XML-based approach introduced by Adobe later. Most PDF viewers support AcroForms; XFA support is more limited.

Annotation

An overlay element added to a PDF page without modifying the underlying page content. Common annotation types include: text comments (sticky notes), highlight/underline/strikethrough markup, freehand ink drawings, links, stamps, and signature fields. Annotations are stored separately from page content and can be shown or hidden independently. When you "comment" on a PDF in Adobe Reader or similar tools, you're adding annotations.

Appearance Stream (AP)

A PDF content stream that defines exactly how an interactive element (form field, annotation) should be drawn on screen. PDF viewers use the appearance stream to render the element without needing to interpret its semantics from scratch. Form fields that have been filled in typically have their filled appearance stored as an AP stream, allowing the filled state to be visible even in viewers that don't support form editing.

Artifact

In PDF/UA (accessibility standard) and PDF/A, an artifact is content that is present in a page's visual output but is not meaningful for the document's logical content, for example, running headers, page numbers, decorative borders, and background patterns. Marking content as an artifact tells assistive technologies (screen readers) to skip it.

Bates Numbering

A sequential numbering system applied to documents in legal and medical contexts to provide a unique, consistent identifier for every page. A Bates number typically combines a prefix, a zero-padded number, and sometimes a suffix, for example DEPO-000147. In PDF tools, Bates numbering is a specific variant of adding page numbers with a custom prefix and suffix format. The term comes from the Bates Manufacturing Company, which made the mechanical stamping device historically used for this purpose.

Blend Mode

A mathematical function that determines how a graphic element's colour combines with the colour of elements behind it. PDF supports the same blend modes as Adobe Photoshop, Multiply, Screen, Overlay, Darken, Lighten, and others. Blend modes are part of PDF's transparency model (introduced in PDF 1.4) and are most commonly encountered in PDFs created from design applications like InDesign or Illustrator.

Bookmark (Outline)

A named navigation entry in the PDF outline tree, visible as a clickable item in the Bookmarks panel of PDF viewers. Bookmarks link to specific pages or locations within the PDF and can be nested hierarchically. They are stored in the document catalog as an outline dictionary. Bookmarks are separate from browser bookmarks, they're embedded navigation aids within the PDF file itself.

CIDFont

A font type used in PDF for embedding large character set fonts, particularly East Asian fonts (Chinese, Japanese, Korean) which contain thousands of characters. CID stands for Character Identifier. A CIDFont maps character identifiers to glyph outlines, allowing the PDF to include only the specific characters used in the document rather than the entire font. Usually used in conjunction with a CMaps (character mappings) to provide text extraction and copy-paste functionality.

Colour Space

Defines how colour values in a PDF are interpreted. Common colour spaces in PDF: DeviceRGB (red-green-blue, for screen display), DeviceCMYK (cyan-magenta-yellow-black, for print), DeviceGray (single channel grayscale), CalRGB/CalGray (calibrated colour spaces with a defined white point), and ICCBased (colour space defined by an ICC colour profile for accurate colour management across devices).

Compliance Level

PDF/A, PDF/X, PDF/UA, and PDF/E each define multiple conformance levels with different restrictions. For example: PDF/A-1b requires visual appearance preservation; PDF/A-1a adds logical structure and accessibility requirements; PDF/A-2 adds JPEG 2000 compression and digital signature support. A document that meets a higher conformance level automatically meets the requirements of lower levels in the same family.

Content Stream

The sequence of PDF operators and their operands that describes the visual content of a PDF page. A content stream is a binary data stream containing instructions like: move to this position, set this font, draw this text, set this fill colour, draw this path. Content streams are compressed (typically with Flate/zlib) and must be decoded before they can be parsed. Understanding content streams is fundamental to PDF editing at a low level.

Cross-Reference Table (xref)

A critical internal structure in a PDF file that maps the byte offset of every object within the file. When a PDF viewer opens a file, it reads the xref table first to find where all objects are located, this allows efficient random access without reading the entire file sequentially. A corrupted xref table is the most common cause of a PDF appearing broken or unreadable. PDF repair tools work primarily by reconstructing a valid xref table by scanning the raw file bytes.

Also called: xref table, cross-reference stream (in newer PDFs)

DPI (Dots Per Inch)

A measure of the resolution of raster images within a PDF. Images scanned at 600 DPI contain four times as many pixels as the same image at 300 DPI, making the file much larger. For screen viewing, 96–150 DPI is generally sufficient. For professional print output, 300 DPI is the standard minimum; high-quality photo printing may use 450–600 DPI. PDF itself does not use DPI as a native unit, it uses points (1/72 inch), but the effective resolution of embedded images is described in DPI terms.

Digital Signature

A cryptographic mechanism embedded in a PDF that proves the document was signed by a specific entity and has not been modified since signing. PDF digital signatures use public key cryptography: the signer's private key creates a hash of the document content, and any recipient can verify it using the signer's public certificate. Distinguished from a handwritten signature image (which is just a picture), a proper digital signature is tamper-evident and verifiable. PDF/A-2 and later versions support digital signatures.

Document Catalog

The root object of a PDF file's logical structure. Every PDF has exactly one document catalog (object type /Catalog) that contains references to all top-level document structures: the page tree, outline (bookmarks), named destinations, AcroForm dictionary, metadata, document-level JavaScript, and more. PDF viewers start reading a document by locating the catalog via the trailer dictionary.

Embedded Font

A font whose glyph data is included within the PDF file. Embedding a font guarantees that the document will render correctly on any device, even if the font is not installed on that system. There are two embedding strategies: full embedding (the complete font is included) and subsetting (only the glyphs actually used in the document are included, reducing file size). PDF/A requires all fonts to be embedded. Non-embedded fonts are substituted by the viewer, often causing layout changes.

Encryption (PDF)

PDF supports two levels of password protection: user password (required to open the file) and owner password (controls permissions: printing, copying, editing). The encryption algorithm has evolved across PDF versions, RC4-40 bit (very weak, breakable in seconds), RC4-128 bit (PDF 1.4+), AES-128 bit (PDF 1.6+), and AES-256 bit (PDF 1.7 extension, widely supported). PDF encryption encrypts page content streams and other resources but leaves the xref table and document structure metadata readable.

Extract Pages

The operation of pulling one or more pages out of a PDF to create a new, smaller PDF. Technically, this involves copying the selected page objects (content streams, resources, annotations) from the source document into a new PDF with its own xref table and document catalog. A correctly implemented extraction preserves all content on the extracted pages, images, fonts, interactive elements, while discarding pages not in the selection.

Fast Web View (Linearization)

A PDF optimisation where the file is restructured so the first page can be displayed before the entire file is downloaded. In a standard PDF, the xref table is at the end of the file, so the viewer must download everything before rendering anything. A linearized PDF places a special linearization dictionary and the first page's content at the beginning of the file. Important for large PDFs served over the web.

Also called: linearized PDF, web-optimised PDF

Flate Compression

The most common compression algorithm used for PDF content streams and other data. Flate is the same algorithm used in ZIP files and PNG images (based on the DEFLATE algorithm). It provides lossless compression, the decompressed data is identical to the original. Most PDF content streams (page content, font programs, ICC profiles) are Flate-compressed. An uncompressed PDF (useful for debugging) can be produced by removing Flate filters.

Also called: FlateDecode (the PDF filter name)

Font Subset

A version of a font embedded in a PDF that contains only the characters (glyphs) actually used in the document, rather than the complete font. Subsetting reduces file size significantly, a full Unicode font might be several MB; the subset for a document using only basic Latin characters might be 20–40 KB. Font subsets are identified in PDF by a 6-character random prefix before the font name (e.g., ABCDEF+Helvetica). This prefix signals to viewers that the font is a subset.

Form Fields

Interactive elements in a PDF (AcroForm) that accept user input. Types include: TextField (text input), CheckBox, RadioButton, ListBox, ComboBox (dropdown), PushButton, and Signature. Form field data can be submitted to a web server (via PDF's submit action), exported to FDF/XFDF format, or saved within the PDF itself. Field values can also be connected by JavaScript to create calculated fields and dynamic validation.

Glyph

The specific visual representation of a character in a particular font. A single character (e.g., the letter "a") may have multiple glyphs, the regular "a", italic "a", and bold "a" are all different glyphs. In PDF, text is encoded as a sequence of glyph IDs (character codes) that are mapped to glyph outlines in the embedded font program. The distinction between characters and glyphs matters for text extraction, a character has semantic meaning; a glyph is a visual shape.

Graphics State

The collection of parameters that control how graphical elements are rendered in a PDF page: current transformation matrix, line width, line cap style, line join style, fill colour, stroke colour, font and font size, rendering intent, opacity, blend mode, and clipping path. The graphics state can be saved to a stack (operator q) and restored (operator Q), allowing complex nested graphics operations. Understanding the graphics state is essential for implementing PDF rendering.

Header (PDF File Header)

The first line of a PDF file, which identifies the file as a PDF and specifies the version: %PDF-1.7 or %PDF-2.0. The second line typically contains four bytes with values above 127 (e.g., %âãÏÓ), which signals to file transfer programs that the file is binary data and should not be treated as text. A missing or malformed header is one reason a PDF viewer may refuse to open a file.

Hyperlink (Link Annotation)

A clickable area in a PDF that navigates to a URL, a specific page/location within the same PDF, or another PDF file. Implemented as a Link annotation with an action (URI action for external links, GoTo action for internal navigation). Hyperlinks in PDFs are always explicit, unlike HTML where any element can be made clickable, PDF links require a defined rectangular clickable region.

ICC Colour Profile

A standardised data file that characterises the colour reproduction of a device or colour space, created by the International Colour Consortium (ICC). In PDFs intended for professional print, images are often tagged with an ICC profile (e.g., ISO Coated v2 for European CMYK printing) to ensure accurate colour reproduction on press. PDF/X mandates specific ICC profile requirements. Including ICC profiles increases PDF file size slightly.

Incremental Update

A PDF editing mechanism where changes are appended to the end of the existing file rather than rewriting the entire file. The original content is preserved and the new xref table at the end of the file takes precedence. Incremental updates are used by digital signatures (so the signed content is not modified) and by many PDF editors. The downside: repeated incremental updates accumulate revision history, increasing file size. PDF optimisation (linearization or "save as copy") removes this history.

Indirect Object

A PDF object that is stored at a specific location in the file and referenced by other objects using an object number and generation number (e.g., 12 0 R means object 12, generation 0). Most substantial PDF data, pages, images, fonts, content streams, are stored as indirect objects so they can be referenced from multiple places without duplication and located via the xref table. Contrast with direct objects (embedded inline within another object).

JavaScript (in PDF)

PDF supports an embedded JavaScript engine (based on ECMAScript) that can respond to document events: opening, closing, printing, form field changes, button clicks. Common uses: calculating field values automatically (e.g., totals in a form), validating input, showing/hiding fields conditionally, and submitting form data. PDF JavaScript is a security concern, malicious PDFs can exploit JavaScript vulnerabilities in viewers. Most PDF viewers allow JavaScript to be disabled in settings.

JPEG Compression (DCT)

The lossy compression algorithm most commonly used for photographic images within PDFs. In PDF terminology, JPEG compression is implemented via the DCTDecode filter. JPEG reduces image file size by discarding fine detail that human vision is less sensitive to, at a configurable quality level. Critically, re-compressing an already-JPEG-compressed image (as happens during PDF compression) introduces additional quality loss at each generation. For high-quality archival, lossless formats are preferred.

JPEG 2000

An advanced image compression standard that supports both lossy and lossless modes, better quality-to-size ratios than JPEG at high compression, and progressive decoding. In PDF, JPEG 2000 is supported from PDF 1.5 onwards (via the JPXDecode filter) and is mandatory in some PDF/A-2 and PDF/A-3 contexts. Despite technical advantages, JPEG 2000 is less universally supported than standard JPEG in web and consumer contexts.

Layer (Optional Content Group)

A mechanism that allows different content to be shown or hidden in a PDF viewer. Layers in PDF are called Optional Content Groups (OCGs). They are commonly used in engineering drawings (to show/hide dimensions, annotations, or different design layers), maps (to show/hide labels, terrain, or infrastructure), and multilingual documents (to show text in different languages on the same page). Layers are defined in the document catalog and can be toggled by the user in supporting viewers.

Also called: OCG (Optional Content Group)

Linearization

See Fast Web View.

Marked Content

Tagged regions of PDF content that carry semantic meaning, identified by a tag name and optional properties. Used for document accessibility (PDF/UA), marked content identifies headings, paragraphs, list items, figures, and table cells so that screen readers and other assistive technologies can interpret document structure. A PDF with full marked content is called a "tagged PDF." Creating properly tagged PDFs requires careful authoring; many PDFs produced from print-focused workflows are untagged.

Metadata (PDF)

Descriptive information about the PDF document, stored separately from page content. PDF metadata exists in two forms: DocInfo dictionary (the older format, containing Title, Author, Subject, Keywords, Creator, Producer, creation and modification dates) and XMP metadata (an XML-based format embedded as a stream, used in PDF 1.4+). XMP is more extensible and is required by PDF/A. Metadata is readable without opening the full document and is indexed by operating systems and search engines.

MediaBox

A rectangle that defines the full physical size of a PDF page, including areas outside the visible/printable region. Other page boxes, CropBox, BleedBox, TrimBox, ArtBox, must be equal to or smaller than the MediaBox. The MediaBox is the outermost page boundary and is required for every PDF page. It is defined as an array of four numbers: [x_min, y_min, x_max, y_max] in points (1 point = 1/72 inch).

Merge (PDF Merge)

The operation of combining two or more PDF files into a single PDF document. During a merge, page objects from each source document are combined into a single page tree, fonts and images are deduplicated where possible, and a new document catalog and xref table are generated. A quality merge operation also consolidates bookmarks from all source documents. Page order in the merged document matches the input order, which is typically configurable.

Named Destination

A bookmark or link target that is referenced by name rather than by page number. For example, a hyperlink might point to a named destination "section-3-intro" rather than page 47. Named destinations are stored in the document catalog's Names dictionary. They are useful for cross-document links that should remain valid even if page numbers change due to editing. PDF viewers typically support navigating to named destinations via command-line arguments or URL fragment identifiers.

Object

The fundamental data unit in a PDF file. PDF objects can be: Boolean (true/false), Integer, Real number, String, Name, Array, Dictionary, Stream, or Null. Most meaningful PDF data is stored as dictionaries (key-value pairs) or streams (binary data with a dictionary header). Objects are either direct (embedded inline) or indirect (stored at a file offset, referenced by object number). The entire structure of a PDF, pages, fonts, images, form fields, is built from these object types.

OCR (Optical Character Recognition)

The technology that converts raster images of text (such as scanned documents) into machine-readable text. In the PDF context, OCR is used to add a searchable text layer to scanned PDFs. The process involves: image preprocessing (deskew, denoise, binarize), character segmentation, pattern recognition, and post-processing. Modern OCR engines like Tesseract use neural networks for recognition. OCR accuracy depends heavily on scan quality, font, and language. The resulting text layer is typically invisible but selectable and searchable.

Operator (PDF Operator)

A keyword in a PDF content stream that performs a specific drawing or state-changing action. Examples: BT/ET (begin/end text block), Tf (set font), Tj (show text string), m/l/h (move/line/close path), f/S (fill/stroke path), cm (modify current transformation matrix), q/Q (save/restore graphics state). PDF content streams are a sequence of operands followed by operators, similar in structure to PostScript but not identical.

Optimize / Optimise (PDF Optimisation)

A broad term for operations that reduce PDF file size or improve performance without compromising content. Optimisation steps may include: downsampling and recompressing images, removing duplicate resources (fonts, images used multiple times), compressing object streams, removing unused objects and revision history (from incremental updates), linearizing for fast web view, and stripping unnecessary metadata. PDF optimisation is lossless with respect to text and vector content, but may degrade image quality depending on settings.

Page Tree

The hierarchical structure within a PDF that organises all pages. The document catalog references a root Pages node, which contains references to individual page objects or to intermediate Pages nodes. This tree structure allows PDF viewers to efficiently navigate documents with thousands of pages without loading all page objects at once. The page tree is a required component of every valid PDF file.

PDF (Portable Document Format)

A file format created by Adobe Systems in 1993 and standardised as ISO 32000 in 2008. PDF was designed to represent documents in a manner independent of application software, hardware, and operating system, a document should look identical on any device. PDF supports text, vector graphics, raster images, hyperlinks, forms, digital signatures, 3D content, video, and audio. The format is now open: ISO 32000-2 (PDF 2.0) was published in 2017 without any proprietary elements.

PDF/A

A family of ISO standards (ISO 19005) for long-term archival of PDF documents. PDF/A restricts features that might make a document unrenderable in the future: no encryption, no external content dependencies, all fonts embedded, colour spaces fully specified with ICC profiles, no JavaScript, no audio/video. Versions: PDF/A-1 (based on PDF 1.4), PDF/A-2 (based on PDF 1.7, adds JPEG 2000 and digital signatures), PDF/A-3 (allows embedding of arbitrary file formats as attachments). Government archives, legal systems, and libraries commonly require PDF/A format.

PDF/UA

ISO 14289, the PDF accessibility standard ("UA" stands for Universal Accessibility). PDF/UA defines requirements for creating PDFs that work correctly with assistive technologies (screen readers). Requirements include: all content tagged with correct structure tags, reading order specified, images with alt text, no flickering content, document language specified, and bookmarks present for long documents. PDF/UA compliance is increasingly required for public-sector and government documents under accessibility legislation.

PDF/X

A family of ISO standards for reliable PDF exchange in the graphic arts and printing industry. PDF/X eliminates features that cause inconsistent print output: requires embedded fonts, prohibits encryption, enforces specific colour space handling. Key versions: PDF/X-1a (CMYK and spot colours only, all fonts embedded), PDF/X-3 (allows calibrated RGB and ICC-managed colour), PDF/X-4 (supports transparency natively). Professional print workflows, magazines, packaging, large-format print, typically require PDF/X compliance.

PDF 2.0

The current major version of the PDF specification, published as ISO 32000-2 in 2017 (revised in 2020). PDF 2.0 adds: improved handling of page-level output intents, better unassociated annotation support, new encryption algorithms (AES-256), improvements to digital signatures, new page boundary definitions, better support for geospatial data, and removal of deprecated features from older versions. It also formally removes proprietary Adobe-only features that had crept into earlier specifications.

Point (pt)

The base unit of measurement in PDF. One point equals 1/72 of an inch (approximately 0.353 mm). All page dimensions, font sizes, and coordinates in a PDF are expressed in points. An A4 page (210 × 297 mm) is 595.28 × 841.89 points. A US Letter page (8.5 × 11 inches) is 612 × 792 points. This unit comes from traditional typography, where a "point" was a standard unit for expressing font sizes.

PostScript

Adobe's page description language that PDF was originally derived from. PostScript is a full Turing-complete programming language interpreted by printers to produce output. PDF simplified PostScript by removing its programmatic nature (loops, conditionals, procedures) in favour of a static, page-based model. Many PDF creation workflows still involve an intermediate PostScript step (e.g., printing to a PostScript printer driver, then distilling to PDF). PostScript files typically have the .ps extension.

Producer (PDF Producer)

A metadata field in a PDF's document information dictionary that identifies the software library or component that generated the final PDF file (as distinct from the Creator, which identifies the application that authored it). For example, a document created in Microsoft Word and saved as PDF might show Creator: "Microsoft Word" and Producer: "Microsoft: Print To PDF". Tools like Ghostscript, iText, ReportLab, and PyMuPDF each identify themselves in the Producer field.

Redaction

The permanent removal of sensitive content from a PDF. True redaction replaces content with a solid filled rectangle (typically black), permanently destroying the underlying data. It is not sufficient to draw a black rectangle on top of content, the original text and images remain in the document and can be exposed by removing the rectangle or by examining the content stream. Proper PDF redaction burns the redacted content into the page structure. Many court systems and governments require proper redaction of filed documents.

Resource Dictionary

A dictionary associated with a PDF page or content stream that lists all external resources used by that content: fonts, images (XObjects), colour spaces, patterns, shadings, and graphical states. A content stream cannot reference any resource that is not listed in its resource dictionary. Resources can be defined at the page level or inherited from a parent node in the page tree, allowing shared resources to be listed once for multiple pages.

RGB (DeviceRGB)

A colour model that represents colours as combinations of red, green, and blue light. In PDF, DeviceRGB is a device-dependent colour space, the actual colours produced depend on the display or printer. For accurate cross-device colour, calibrated colour spaces (sRGB, which is an ICCBased colour space) should be used instead. RGB is the standard colour model for screen display; CMYK is standard for print. Converting between RGB and CMYK introduces colour shift that cannot be perfectly reversed.

Signature Field

A form field in a PDF that is intended to hold a digital signature. A signature field appears as an empty rectangular area until signed. When a user signs the field, the PDF viewer creates a digital signature using the user's certificate and private key, embeds it in the field, and locks the signed content against future modification. Multiple signature fields can exist in one document (for multi-party signing workflows). Visual appearance (a scanned handwritten signature image, for example) can be added to the field's appearance stream.

Split (PDF Split)

The operation of dividing a PDF document into multiple smaller PDFs. Common split modes: by individual pages (each page becomes a separate PDF), by page range (pages 1–10 in one file, 11–20 in another), by file size, or by blank page separators. Each resulting PDF must be a complete, valid document with its own document catalog, page tree, and xref table. Fonts and images referenced by the split pages must be copied into each resulting file.

Stream

A PDF object consisting of a dictionary (describing the stream's properties, including length and compression filters) followed by binary data enclosed between stream and endstream keywords. Streams are used for all variable-length binary data in PDF: page content, image data, font programs, ICC profiles, metadata, and embedded files. Streams are usually compressed with FlateDecode (zlib) and sometimes additionally encoded with ASCII85 or other filters.

Structure Tree

A hierarchical organisation of a PDF's logical content into semantic elements, document, section, paragraph, heading, list, list item, table, table row, table cell, figure, etc. The structure tree is the backbone of accessible (tagged) PDFs and is required by PDF/UA. It maps logical structure elements to their corresponding marked content in page content streams, allowing screen readers to present the document in a meaningful reading order independent of the visual layout.

Tagged PDF

A PDF that contains a structure tree marking up the logical content with semantic tags. Tagged PDFs are required for accessibility compliance (PDF/UA) and are recommended for any PDF that will be read by people using assistive technology. Creating a tagged PDF from scratch requires careful authoring; many PDFs produced from design or print-focused workflows are untagged. Some PDF tools (Adobe Acrobat, PAC 2024) can auto-tag PDFs, but the results require review and correction.

Tesseract

An open-source OCR engine originally developed by HP in the 1980s, open-sourced in 2005, and now maintained by Google. Tesseract is the most widely used open-source OCR engine and supports over 100 languages. It uses LSTM (long short-term memory) neural networks for character recognition since version 4.0. Tesseract is the OCR engine used by way2pdf's text extraction feature. Performance varies by document quality, clean printed text on white paper gives excellent results; handwriting and decorative fonts give poor results.

Transformation Matrix (CTM)

A 3×3 matrix in PDF's graphics model that maps coordinates from user space to device space. The Current Transformation Matrix (CTM) controls scaling, rotation, translation, and skewing of all graphical elements on a page. Operators like cm (concatenate matrix) modify the CTM. The CTM is part of the graphics state and can be saved and restored with q/Q. Understanding transformation matrices is essential for correctly rendering and extracting positioned content from PDFs.

Trailer

The final section of a PDF file, located at the very end (just before the %%EOF marker). The trailer dictionary contains the location of the xref table, the total number of objects, and a reference to the document catalog. When a PDF viewer opens a file, it reads from the end backwards: find %%EOF, read the trailer, locate the xref table, locate the document catalog, then begin rendering. A missing or corrupt trailer is a common cause of a PDF being unreadable.

TrimBox

A page box that defines the intended finished dimensions of a printed page after trimming. In a print production PDF (PDF/X), the TrimBox defines where the printer's guillotine will cut. Content outside the TrimBox is in the bleed area (defined by BleedBox) and will be trimmed off. The TrimBox is always equal to or smaller than the MediaBox. Required by most PDF/X standards. Not relevant for PDFs not intended for physical print production.

Unicode

The international standard for text encoding that assigns a unique code point to every character in every writing system. In PDF, proper Unicode mapping of text is critical for: copy-paste of text from PDFs (without garbled characters), text search, screen reader accessibility, and OCR. A PDF font can embed a ToUnicode CMap that maps glyph IDs to Unicode code points. PDFs without proper ToUnicode mappings, common in PDFs created from design software, often produce garbled text when you try to copy-paste from them.

User Space

The coordinate system in which PDF content is specified. In user space, the default unit is the point (1/72 inch), the origin (0,0) is at the bottom-left corner of the page, x increases to the right, and y increases upward (unlike screen coordinate systems where y increases downward). User space coordinates are transformed to device space (actual pixels or printer dots) via the Current Transformation Matrix. Content can be placed anywhere in user space, including outside page boundaries.

Version (PDF Version)

PDF has gone through multiple versions since its introduction in 1993: 1.0 (1993), 1.1, 1.2, 1.3, 1.4 (transparency model), 1.5 (compressed object streams, JPEG 2000), 1.6 (AES encryption, 3D), 1.7 (became ISO 32000-1 in 2008), and 2.0 (ISO 32000-2, published 2017). The version is declared in the PDF header (%PDF-1.7) and affects which features are valid. Most modern viewers support at least PDF 1.7. Some older viewers may reject PDF 2.0 files.

Watermark

Text or an image overlaid semi-transparently on PDF pages, typically the word "DRAFT", "CONFIDENTIAL", or a company logo. In PDF, watermarks are usually implemented as either a page content stream addition (permanently burned in) or as an XObject (a reusable graphic) stamped onto pages. Some PDF tools add watermarks as annotations (which can be removed). True watermarks that are part of the page content stream cannot be removed without modifying the content stream directly.

Web-Optimised PDF

See Fast Web View (Linearization).

XFA (XML Forms Architecture)

An Adobe-proprietary XML-based form format that can be embedded within a PDF file. XFA forms are more powerful than AcroForms, they support dynamic layouts that reflow based on content length, complex conditional logic, and rich styling. However, XFA is not part of the ISO PDF standard and is supported only by Adobe Reader and Adobe Acrobat (not by most other PDF viewers, including Chrome's built-in PDF viewer). Adobe deprecated XFA in Acrobat DC and it is no longer supported in modern PDF 2.0.

XMP (Extensible Metadata Platform)

An XML-based metadata standard developed by Adobe and now an ISO standard (ISO 16684). In PDF, XMP metadata is stored as an XML stream attached to the document catalog (and optionally to individual objects). XMP can store far richer metadata than the older DocInfo dictionary, including copyright information, rights management, custom schemas, and synchronisation with Dublin Core, IPTC, and EXIF metadata. PDF/A-2 and later require XMP metadata; the DocInfo dictionary alone is insufficient.

XObject

A reusable PDF content element that is defined once and referenced multiple times. Types: Form XObject (a chunk of PDF content, not a form field, that can be stamped anywhere, used for watermarks, headers/footers, and repeated graphics) and Image XObject (a raster image resource). XObjects allow efficient storage when the same image or graphic appears on multiple pages, it is stored once in the file and referenced many times, rather than duplicated.

xref (Cross-Reference Table)

See Cross-Reference Table.

ZIP Compression

In the PDF context, "ZIP compression" informally refers to Flate (DEFLATE) compression, which is the same algorithm used inside ZIP archives. PDF uses Flate compression (via the FlateDecode filter) for content streams, font programs, and other data. It provides lossless compression, compressed data decompresses to the exact original byte sequence. Technically, PDF uses raw DEFLATE rather than the ZIP container format, but the underlying compression algorithm is identical.

Put the knowledge to use

Free PDF tools, conversion, compression, OCR, merge, split, and more. No signup required.

Explore All Tools PDF Guides & Tutorials

Guides & in-depth tool documentation

Long-form help beyond short tool labels—written for reviewers and users who need context before uploading a file.

Guides hub (all tutorials)

PDF & document blog

Form Designer guide

Developers: CI/CD formatters

All code formatters