v17

v17.5.0

  • Added support for the end alias in --pages, denoting the last page of the document. For example, --pages 3-end OCRs from page 3 through the final page. #1615

  • Added --ghostscript-jpeg-quality and --ghostscript-jpeg-maxdpi advanced options for tuning Ghostscript’s PDF/A output. The optimizer’s --jpeg-quality remains the recommended file-size control.

  • Fixed pypdfium2 rasterizer clipping content when the CropBox was smaller than the MediaBox (e.g. JSTOR or cropped PDFs). #1685

  • Fixed Form XObject cycle detection in the optimizer’s image xref scan. Self-referential or DAG-shaped Form graphs (notably from PowerPoint exports) previously produced floods of recursion warnings and could hang for minutes. #1321

  • Tesseract config errors are now surfaced as TesseractConfigError with actionable guidance, instead of crashing later with a confusing FileNotFoundError on the missing hOCR output. #1687

  • Refreshed the Chinese README translation. Thanks @cislunarspace.

  • Internal refactoring of the _exec and subprocess modules to separate probing from execution.

  • CI dependency updates.

v17.4.2

  • Fixed Python API unconditionally overriding PIL.Image.MAX_IMAGE_PIXELS when the caller did not explicitly set max_image_mpixels. Host applications (e.g. Paperless-NGX) that configure the PIL limit before invoking ocrmypdf.ocr() now have their setting respected. The CLI default of 250 megapixels is unchanged. #1665

  • Updated uv.lock to avoid pinning a vulnerable version of Pillow. #1666

v17.4.1

  • Fixed RTL text extraction order in the fpdf2 renderer. Arabic lam-alef ligatures and other multi-character CMap entries were garbled by the bidi algorithm during text extraction. #1655

  • Fixed work_folder not being set in PdfContext options when using the Python API. Thanks @bluebox-steven. #1613

  • Updated Ghostscript JPEG corruption warning to include the detected version number, confirming the bug persists in Ghostscript 10.7.0.

  • Internal refactoring.

  • CI dependency updates.

v17.4.0

  • Added --no-overwrite / -n option to prevent overwriting output files. If the destination file already exists, OCRmyPDF exits with code 5 (OutputFileAccessError). #1642

  • Fixed text layer stretching in the fpdf2 renderer for widely-spaced words. The horizontal scaling (Tz) was incorrectly stretched to fill inter-word gaps instead of relying on Td positioning, causing text selection to highlight far beyond the actual word boundaries. #1635

  • Fixed optimize=2 or optimize=3 crash when using the Python API without explicitly setting jpg_quality or png_quality. #1641

  • Fixed verapdf availability check crashing with NotADirectoryError on some platforms. #1638

v17.3.0

  • Fixed Python API ignoring the language parameter, always defaulting to eng. The API now correctly maps language to OcrOptions languages and splits +-separated codes (e.g. eng+deu) to match CLI behavior. #1640

  • Fixed Python API producing empty OCR output because tesseract_timeout defaulted to 0, causing Tesseract to time out immediately. The default is now None, falling back to the plugin’s 180-second timeout. #1636

  • Fixed OCR text layer displacement on PDFs with non-zero MediaBox origins (e.g. JSTOR or cropped PDFs). The coordinate transformation matrix is now always computed, not skipped when rotation is zero. #1630

  • Restored image overlay support (--image) for the hocrtransform tool, enabling sandwich PDF output with the fpdf2 renderer. #1634

  • Docker: updated Alpine base image to 3.23.

  • Documentation restructured into per-major-version release notes files.

  • Release process improvements.

v17.2.0

  • Fixed incorrect word spacing in poppler-based PDF viewers and tools (Evince, pdftotext, and others) where words on the same line appeared separated by double newlines. This works around a poppler bug where Tz (horizontal scaling) is not carried across BT/ET boundaries. #1632

  • Fixed OCR text layer being visible instead of invisible due to incorrect fpdf2 text rendering mode attribute. This caused OCR text to appear when images were removed from the PDF. #1631

  • Fixed OCR text layer misalignment with non-zero mediabox origins, which affected cropped PDFs and JSTOR PDFs generated by iText. The --redo-ocr mode would shift text vertically on these files. #1630

  • Fixed Ghostscript rasterization failure with very low DPI values (below 10). OCRmyPDF now renders at a minimum of 10 DPI and resizes the output to match the originally requested dimensions. #1612

v17.1.0

  • Added --tagged-pdf-mode to allow skipping the TaggedPDF error message, if desired.

  • Fixed an issue where deflated JPEGs (FlateDecode + DCTDecode) were counted as lossless images for the purpose of determining whether to compress to JPEG, causing file size inflation with some workflows (--mode force in particular).

v17.0.1

  • Fixed output file size inflation when using pypdfium as rasterizer and force-ocr mode.

v17.0.0

Breaking changes

  • Plugin interface migration: Plugin hooks now receive OcrOptions objects instead of argparse.Namespace objects. Most plugins will continue working due to duck-typing compatibility, but plugin developers should update their type hints from Namespace to OcrOptions.

  • Built-in plugins no longer modify options in-place, improving immutability and code clarity.

  • Lossy JBIG2 removed: The --jbig2-lossy and --jbig2-page-group-size options have been removed due to well-documented risks of character substitution errors. These options are now deprecated and will emit warnings if used. Only lossless JBIG2 compression is supported.

  • PDF/A output behavior change: If neither Ghostscript nor verapdf is installed, --output-type auto (the new default) will produce a standard PDF instead of PDF/A. This is a change from previous versions where Ghostscript was required and PDF/A was always produced. This configuration is rare but users should be aware of the change.

New features

  • pypdfium2 rasterizer: Added optional pypdfium2-based PDF rasterization plugin as an alternative to Ghostscript for page rendering. Use --rasterizer pypdfium to enable (requires pip install pypdfium2). The default --rasterizer auto prefers pypdfium when available and falls back to Ghostscript.

  • Pluggable OCR engines: New --ocr-engine option allows selecting OCR engines:

    • auto (default): Uses Tesseract

    • tesseract: Explicit Tesseract selection

    • none: Skip OCR entirely for PDF processing-only workflows

    This prepares the foundation for future third-party OCR engine plugins.

  • Smart PDF/A conversion: New --output-type auto (now the default) produces best-effort PDF/A output without requiring Ghostscript when the verapdf validator is available. Falls back to traditional Ghostscript conversion when needed.

  • verapdf integration: Added optional verapdf validation for fast PDF/A conversion. When available, OCRmyPDF attempts speculative PDF/A conversion using pikepdf, validates with verapdf, and skips Ghostscript if validation passes.

  • Optional Ghostscript: As a consequence of the changes above, Ghostscript is no longer a required dependency. It is optional.

  • fpdf2 text renderer: Replaced legacy hOCR text renderer with new fpdf2-based implementation, providing better multilingual support and more accurate text positioning.

  • Improved Occulta glyphless font: The new Occulta font provides better handling of zero-width markers and double-width CJK characters for accurate text layer positioning.

  • Expanded multilingual font support: Added FontProvider infrastructure with language-aware font selection for Devanagari (Hindi, Sanskrit, Marathi, Nepali), CJK (Chinese, Japanese, Korean), Arabic script, and many other scripts. System font discovery reduces package size.

  • Simplified mode selection: New --mode (-m) argument consolidates processing options:

    • default: Error if text is found (standard behavior)

    • force: Rasterize all content and run OCR (replaces --force-ocr)

    • skip: Skip pages with existing text (replaces --skip-text)

    • redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)

    Legacy flags remain as silent aliases for backward compatibility.

API improvements

  • Centralized validation logic in the OcrOptions Pydantic model

  • Removed scattered option mutation throughout the codebase

  • Better type safety for plugin development

  • Simplified plugin option handling

  • New OcrElement, OcrClass, and BoundingBox exports for OCR engine plugin developers

  • Extended OcrEngine ABC with generate_ocr() method for direct OCR tree output, eliding the need to translate a modern engine’s output to hOCR or directly write to PDF.

Bug fixes

  • Fixed double-compression of already-deflated JPEGs.

  • Fixed tesseract_cache plugin to properly handle cache misses.

  • Fixed handling of PDF page boxes (ArtBox, BleedBox) which were not being processed correctly.

  • Added thread safety lock to pypdfium plugin for concurrent operations.

  • Improved pdfminer.six compatibility with explicit word spacing.

Documentation

  • Updated cookbook to replace deprecated --tesseract-timeout 0 with --ocr-engine none.

  • Added comprehensive plugin documentation for new OCR engine framework.

Dependency changes

  • Requires: one of pypdfium2 or ghostscript for PDF rasterization (PDF to image)

    • Preferred: both

  • Requires: one of verapdf or ghostscript for PDF/A generation

    • Preferred: both

  • Recommended: pypdfium2 for PDF rasterization (new dependency)

  • Recommended: ghostscript (used to be Required)

  • Recommended: Noto fonts for improved OCR text positioning

  • Optional: verapdf for fast PDF/A validation (new dependency)

  • Requires: fpdf2 for text layer rendering (new dependency)

  • Recommended: replace typer with cyclopts in misc scripts (new dependency)

  • See docs/maintainers.md for details.

Migration guide for plugin developers

  • Update imports: from ocrmypdf._options import OcrOptions

  • Update type hints: def check_options(options: OcrOptions) instead of options: Namespace

  • Attribute access remains unchanged: options.languages, options.output_type, etc.

  • Remove any in-place option modifications - compute values at point of use instead

  • Most existing plugins will continue working without changes due to duck-typing