v17

v17.6.0

  • OCRmyPDF now validates and repairs malformed page-boundary boxes (/MediaBox, /CropBox, /TrimBox, /ArtBox, /BleedBox) in its input, following the PDF 2.0 specification. Coordinates written in invalid exponential notation are reinterpreted (#1398); rectangles whose corners are given in reversed order are normalized, which previously crashed with NegativeDimensionError (#1526); and a crop/trim/art/bleed box that falls outside the MediaBox is clamped to their intersection, or discarded when that intersection is empty, which previously produced an output with a zero-height effective page that some viewers refused to open (#1400). When a box is discarded, clamped, or reinterpreted, OCRmyPDF logs a warning recommending visual inspection of the output. Thanks @ajdlinux for the initial fix in PR #1691.

  • OCRmyPDF now discards an embedded Adobe full-text search index (/Root/PieceInfo/SearchIndex) from its output. This proprietary index, produced by Acrobat’s “Embed Index” feature, is read only by Adobe Acrobat; other viewers ignore it and search the text on the fly. Because any change to a PDF invalidates the index, retaining it after OCRmyPDF rewrites the document would leave a stale index that returns incorrect search results in Acrobat. Modern viewers rebuild a search index on demand, so there is no loss of search capability.

  • OCRmyPDF now discards embedded per-page thumbnail images (the optional /Thumb image XObject on a page) from its output. OCRmyPDF alters page appearance (deskew, clean, rasterize, re-render) and plugins may edit pages arbitrarily, so a retained thumbnail would be stale and no longer match its page. Embedded thumbnails are a navigation aid that modern viewers generate on demand, so there is no loss of functionality.

  • Fixed a regression in OCR quality for PDFs that paint a 1-bit image mask (stencil) with a gray or colored fill color. Previously such pages were rasterized as 1-bit black-and-white before OCR, so Ghostscript dithered mid-tone text into an unreadable stipple and Tesseract failed to recognize it. The rasterizer now inspects the fill color used to paint a mask and promotes the page to grayscale or full color as needed, so the distinction is preserved for the OCR engine. This applies to both the Ghostscript and pypdfium rasterizers. #1688

  • The default 1-bit raster device for Ghostscript is now pngmonod (error-diffusion) instead of pngmono (ordered dithering). It produces better input for OCR on faint or anti-aliased scans at negligible cost and no change to output file size, since the rasterized image is an intermediate that is discarded after OCR.

v17.5.0

  • Added support for the end alias in --pages, denoting the last page of the document. For example, --pages 3-end OCRs from page 3 through the final page. #1615

  • Added --ghostscript-jpeg-quality and --ghostscript-jpeg-maxdpi advanced options for tuning Ghostscript’s PDF/A output. The optimizer’s --jpeg-quality remains the recommended file-size control.

  • Fixed pypdfium2 rasterizer clipping content when the CropBox was smaller than the MediaBox (e.g. JSTOR or cropped PDFs). #1685

  • Fixed Form XObject cycle detection in the optimizer’s image xref scan. Self-referential or DAG-shaped Form graphs (notably from PowerPoint exports) previously produced floods of recursion warnings and could hang for minutes. #1321

  • Tesseract config errors are now surfaced as TesseractConfigError with actionable guidance, instead of crashing later with a confusing FileNotFoundError on the missing hOCR output. #1687

  • Refreshed the Chinese README translation. Thanks @cislunarspace.

  • Internal refactoring of the _exec and subprocess modules to separate probing from execution.

  • CI dependency updates.

v17.4.2

  • Fixed Python API unconditionally overriding PIL.Image.MAX_IMAGE_PIXELS when the caller did not explicitly set max_image_mpixels. Host applications (e.g. Paperless-NGX) that configure the PIL limit before invoking ocrmypdf.ocr() now have their setting respected. The CLI default of 250 megapixels is unchanged. #1665

  • Updated uv.lock to avoid pinning a vulnerable version of Pillow. #1666

v17.4.1

  • Fixed RTL text extraction order in the fpdf2 renderer. Arabic lam-alef ligatures and other multi-character CMap entries were garbled by the bidi algorithm during text extraction. #1655

  • Fixed work_folder not being set in PdfContext options when using the Python API. Thanks @bluebox-steven. #1613

  • Updated Ghostscript JPEG corruption warning to include the detected version number, confirming the bug persists in Ghostscript 10.7.0.

  • Internal refactoring.

  • CI dependency updates.

v17.4.0

  • Added --no-overwrite / -n option to prevent overwriting output files. If the destination file already exists, OCRmyPDF exits with code 5 (OutputFileAccessError). #1642

  • Fixed text layer stretching in the fpdf2 renderer for widely-spaced words. The horizontal scaling (Tz) was incorrectly stretched to fill inter-word gaps instead of relying on Td positioning, causing text selection to highlight far beyond the actual word boundaries. #1635

  • Fixed optimize=2 or optimize=3 crash when using the Python API without explicitly setting jpg_quality or png_quality. #1641

  • Fixed verapdf availability check crashing with NotADirectoryError on some platforms. #1638

v17.3.0

  • Fixed Python API ignoring the language parameter, always defaulting to eng. The API now correctly maps language to OcrOptions languages and splits +-separated codes (e.g. eng+deu) to match CLI behavior. #1640

  • Fixed Python API producing empty OCR output because tesseract_timeout defaulted to 0, causing Tesseract to time out immediately. The default is now None, falling back to the plugin’s 180-second timeout. #1636

  • Fixed OCR text layer displacement on PDFs with non-zero MediaBox origins (e.g. JSTOR or cropped PDFs). The coordinate transformation matrix is now always computed, not skipped when rotation is zero. #1630

  • Restored image overlay support (--image) for the hocrtransform tool, enabling sandwich PDF output with the fpdf2 renderer. #1634

  • Docker: updated Alpine base image to 3.23.

  • Documentation restructured into per-major-version release notes files.

  • Release process improvements.

v17.2.0

  • Fixed incorrect word spacing in poppler-based PDF viewers and tools (Evince, pdftotext, and others) where words on the same line appeared separated by double newlines. This works around a poppler bug where Tz (horizontal scaling) is not carried across BT/ET boundaries. #1632

  • Fixed OCR text layer being visible instead of invisible due to incorrect fpdf2 text rendering mode attribute. This caused OCR text to appear when images were removed from the PDF. #1631

  • Fixed OCR text layer misalignment with non-zero mediabox origins, which affected cropped PDFs and JSTOR PDFs generated by iText. The --redo-ocr mode would shift text vertically on these files. #1630

  • Fixed Ghostscript rasterization failure with very low DPI values (below 10). OCRmyPDF now renders at a minimum of 10 DPI and resizes the output to match the originally requested dimensions. #1612

v17.1.0

  • Added --tagged-pdf-mode to allow skipping the TaggedPDF error message, if desired.

  • Fixed an issue where deflated JPEGs (FlateDecode + DCTDecode) were counted as lossless images for the purpose of determining whether to compress to JPEG, causing file size inflation with some workflows (--mode force in particular).

v17.0.1

  • Fixed output file size inflation when using pypdfium as rasterizer and force-ocr mode.

v17.0.0

Breaking changes

  • Plugin interface migration: Plugin hooks now receive OcrOptions objects instead of argparse.Namespace objects. Most plugins will continue working due to duck-typing compatibility, but plugin developers should update their type hints from Namespace to OcrOptions.

  • Built-in plugins no longer modify options in-place, improving immutability and code clarity.

  • Lossy JBIG2 removed: The --jbig2-lossy and --jbig2-page-group-size options have been removed due to well-documented risks of character substitution errors. These options are now deprecated and will emit warnings if used. Only lossless JBIG2 compression is supported.

  • PDF/A output behavior change: If neither Ghostscript nor verapdf is installed, --output-type auto (the new default) will produce a standard PDF instead of PDF/A. This is a change from previous versions where Ghostscript was required and PDF/A was always produced. This configuration is rare but users should be aware of the change.

New features

  • pypdfium2 rasterizer: Added optional pypdfium2-based PDF rasterization plugin as an alternative to Ghostscript for page rendering. Use --rasterizer pypdfium to enable (requires pip install pypdfium2). The default --rasterizer auto prefers pypdfium when available and falls back to Ghostscript.

  • Pluggable OCR engines: New --ocr-engine option allows selecting OCR engines:

    • auto (default): Uses Tesseract

    • tesseract: Explicit Tesseract selection

    • none: Skip OCR entirely for PDF processing-only workflows

    This prepares the foundation for future third-party OCR engine plugins.

  • Smart PDF/A conversion: New --output-type auto (now the default) produces best-effort PDF/A output without requiring Ghostscript when the verapdf validator is available. Falls back to traditional Ghostscript conversion when needed.

  • verapdf integration: Added optional verapdf validation for fast PDF/A conversion. When available, OCRmyPDF attempts speculative PDF/A conversion using pikepdf, validates with verapdf, and skips Ghostscript if validation passes.

  • Optional Ghostscript: As a consequence of the changes above, Ghostscript is no longer a required dependency. It is optional.

  • fpdf2 text renderer: Replaced legacy hOCR text renderer with new fpdf2-based implementation, providing better multilingual support and more accurate text positioning.

  • Improved Occulta glyphless font: The new Occulta font provides better handling of zero-width markers and double-width CJK characters for accurate text layer positioning.

  • Expanded multilingual font support: Added FontProvider infrastructure with language-aware font selection for Devanagari (Hindi, Sanskrit, Marathi, Nepali), CJK (Chinese, Japanese, Korean), Arabic script, and many other scripts. System font discovery reduces package size.

  • Simplified mode selection: New --mode (-m) argument consolidates processing options:

    • default: Error if text is found (standard behavior)

    • force: Rasterize all content and run OCR (replaces --force-ocr)

    • skip: Skip pages with existing text (replaces --skip-text)

    • redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)

    Legacy flags remain as silent aliases for backward compatibility.

API improvements

  • Centralized validation logic in the OcrOptions Pydantic model

  • Removed scattered option mutation throughout the codebase

  • Better type safety for plugin development

  • Simplified plugin option handling

  • New OcrElement, OcrClass, and BoundingBox exports for OCR engine plugin developers

  • Extended OcrEngine ABC with generate_ocr() method for direct OCR tree output, eliding the need to translate a modern engine’s output to hOCR or directly write to PDF.

Bug fixes

  • Fixed double-compression of already-deflated JPEGs.

  • Fixed tesseract_cache plugin to properly handle cache misses.

  • Fixed handling of PDF page boxes (ArtBox, BleedBox) which were not being processed correctly.

  • Added thread safety lock to pypdfium plugin for concurrent operations.

  • Improved pdfminer.six compatibility with explicit word spacing.

Documentation

  • Updated cookbook to replace deprecated --tesseract-timeout 0 with --ocr-engine none.

  • Added comprehensive plugin documentation for new OCR engine framework.

Dependency changes

  • Requires: one of pypdfium2 or ghostscript for PDF rasterization (PDF to image)

    • Preferred: both

  • Requires: one of verapdf or ghostscript for PDF/A generation

    • Preferred: both

  • Recommended: pypdfium2 for PDF rasterization (new dependency)

  • Recommended: ghostscript (used to be Required)

  • Recommended: Noto fonts for improved OCR text positioning

  • Optional: verapdf for fast PDF/A validation (new dependency)

  • Requires: fpdf2 for text layer rendering (new dependency)

  • Recommended: replace typer with cyclopts in misc scripts (new dependency)

  • See docs/maintainers.md for details.

Migration guide for plugin developers

  • Update imports: from ocrmypdf._options import OcrOptions

  • Update type hints: def check_options(options: OcrOptions) instead of options: Namespace

  • Attribute access remains unchanged: options.languages, options.output_type, etc.

  • Remove any in-place option modifications - compute values at point of use instead

  • Most existing plugins will continue working without changes due to duck-typing