v17
v17.5.0
Added support for the
endalias in--pages, denoting the last page of the document. For example,--pages 3-endOCRs from page 3 through the final page. #1615Added
--ghostscript-jpeg-qualityand--ghostscript-jpeg-maxdpiadvanced options for tuning Ghostscript’s PDF/A output. The optimizer’s--jpeg-qualityremains the recommended file-size control.Fixed pypdfium2 rasterizer clipping content when the CropBox was smaller than the MediaBox (e.g. JSTOR or cropped PDFs). #1685
Fixed Form XObject cycle detection in the optimizer’s image xref scan. Self-referential or DAG-shaped Form graphs (notably from PowerPoint exports) previously produced floods of recursion warnings and could hang for minutes. #1321
Tesseract config errors are now surfaced as
TesseractConfigErrorwith actionable guidance, instead of crashing later with a confusingFileNotFoundErroron the missing hOCR output. #1687Refreshed the Chinese README translation. Thanks @cislunarspace.
Internal refactoring of the
_execandsubprocessmodules to separate probing from execution.CI dependency updates.
v17.4.2
Fixed Python API unconditionally overriding
PIL.Image.MAX_IMAGE_PIXELSwhen the caller did not explicitly setmax_image_mpixels. Host applications (e.g. Paperless-NGX) that configure the PIL limit before invokingocrmypdf.ocr()now have their setting respected. The CLI default of 250 megapixels is unchanged. #1665Updated uv.lock to avoid pinning a vulnerable version of Pillow. #1666
v17.4.1
Fixed RTL text extraction order in the fpdf2 renderer. Arabic lam-alef ligatures and other multi-character CMap entries were garbled by the bidi algorithm during text extraction. #1655
Fixed
work_foldernot being set inPdfContextoptions when using the Python API. Thanks @bluebox-steven. #1613Updated Ghostscript JPEG corruption warning to include the detected version number, confirming the bug persists in Ghostscript 10.7.0.
Internal refactoring.
CI dependency updates.
v17.4.0
Added
--no-overwrite/-noption to prevent overwriting output files. If the destination file already exists, OCRmyPDF exits with code 5 (OutputFileAccessError). #1642Fixed text layer stretching in the fpdf2 renderer for widely-spaced words. The horizontal scaling (Tz) was incorrectly stretched to fill inter-word gaps instead of relying on Td positioning, causing text selection to highlight far beyond the actual word boundaries. #1635
Fixed
optimize=2oroptimize=3crash when using the Python API without explicitly settingjpg_qualityorpng_quality. #1641Fixed
verapdfavailability check crashing withNotADirectoryErroron some platforms. #1638
v17.3.0
Fixed Python API ignoring the
languageparameter, always defaulting toeng. The API now correctly mapslanguageto OcrOptionslanguagesand splits+-separated codes (e.g.eng+deu) to match CLI behavior. #1640Fixed Python API producing empty OCR output because
tesseract_timeoutdefaulted to 0, causing Tesseract to time out immediately. The default is nowNone, falling back to the plugin’s 180-second timeout. #1636Fixed OCR text layer displacement on PDFs with non-zero MediaBox origins (e.g. JSTOR or cropped PDFs). The coordinate transformation matrix is now always computed, not skipped when rotation is zero. #1630
Restored image overlay support (
--image) for the hocrtransform tool, enabling sandwich PDF output with the fpdf2 renderer. #1634Docker: updated Alpine base image to 3.23.
Documentation restructured into per-major-version release notes files.
Release process improvements.
v17.2.0
Fixed incorrect word spacing in poppler-based PDF viewers and tools (Evince, pdftotext, and others) where words on the same line appeared separated by double newlines. This works around a poppler bug where Tz (horizontal scaling) is not carried across BT/ET boundaries. #1632
Fixed OCR text layer being visible instead of invisible due to incorrect fpdf2 text rendering mode attribute. This caused OCR text to appear when images were removed from the PDF. #1631
Fixed OCR text layer misalignment with non-zero mediabox origins, which affected cropped PDFs and JSTOR PDFs generated by iText. The
--redo-ocrmode would shift text vertically on these files. #1630Fixed Ghostscript rasterization failure with very low DPI values (below 10). OCRmyPDF now renders at a minimum of 10 DPI and resizes the output to match the originally requested dimensions. #1612
v17.1.0
Added
--tagged-pdf-modeto allow skipping the TaggedPDF error message, if desired.Fixed an issue where deflated JPEGs (FlateDecode + DCTDecode) were counted as lossless images for the purpose of determining whether to compress to JPEG, causing file size inflation with some workflows (
--mode forcein particular).
v17.0.1
Fixed output file size inflation when using pypdfium as rasterizer and force-ocr mode.
v17.0.0
Breaking changes
Plugin interface migration: Plugin hooks now receive
OcrOptionsobjects instead ofargparse.Namespaceobjects. Most plugins will continue working due to duck-typing compatibility, but plugin developers should update their type hints fromNamespacetoOcrOptions.Built-in plugins no longer modify options in-place, improving immutability and code clarity.
Lossy JBIG2 removed: The
--jbig2-lossyand--jbig2-page-group-sizeoptions have been removed due to well-documented risks of character substitution errors. These options are now deprecated and will emit warnings if used. Only lossless JBIG2 compression is supported.PDF/A output behavior change: If neither Ghostscript nor verapdf is installed,
--output-type auto(the new default) will produce a standard PDF instead of PDF/A. This is a change from previous versions where Ghostscript was required and PDF/A was always produced. This configuration is rare but users should be aware of the change.
New features
pypdfium2 rasterizer: Added optional pypdfium2-based PDF rasterization plugin as an alternative to Ghostscript for page rendering. Use
--rasterizer pypdfiumto enable (requirespip install pypdfium2). The default--rasterizer autoprefers pypdfium when available and falls back to Ghostscript.Pluggable OCR engines: New
--ocr-engineoption allows selecting OCR engines:auto(default): Uses Tesseracttesseract: Explicit Tesseract selectionnone: Skip OCR entirely for PDF processing-only workflows
This prepares the foundation for future third-party OCR engine plugins.
Smart PDF/A conversion: New
--output-type auto(now the default) produces best-effort PDF/A output without requiring Ghostscript when the verapdf validator is available. Falls back to traditional Ghostscript conversion when needed.verapdf integration: Added optional verapdf validation for fast PDF/A conversion. When available, OCRmyPDF attempts speculative PDF/A conversion using pikepdf, validates with verapdf, and skips Ghostscript if validation passes.
Optional Ghostscript: As a consequence of the changes above, Ghostscript is no longer a required dependency. It is optional.
fpdf2 text renderer: Replaced legacy hOCR text renderer with new fpdf2-based implementation, providing better multilingual support and more accurate text positioning.
Improved Occulta glyphless font: The new Occulta font provides better handling of zero-width markers and double-width CJK characters for accurate text layer positioning.
Expanded multilingual font support: Added FontProvider infrastructure with language-aware font selection for Devanagari (Hindi, Sanskrit, Marathi, Nepali), CJK (Chinese, Japanese, Korean), Arabic script, and many other scripts. System font discovery reduces package size.
Simplified mode selection: New
--mode(-m) argument consolidates processing options:default: Error if text is found (standard behavior)force: Rasterize all content and run OCR (replaces--force-ocr)skip: Skip pages with existing text (replaces--skip-text)redo: Re-OCR pages, stripping old text layer (replaces--redo-ocr)
Legacy flags remain as silent aliases for backward compatibility.
API improvements
Centralized validation logic in the
OcrOptionsPydantic modelRemoved scattered option mutation throughout the codebase
Better type safety for plugin development
Simplified plugin option handling
New
OcrElement,OcrClass, andBoundingBoxexports for OCR engine plugin developersExtended
OcrEngineABC withgenerate_ocr()method for direct OCR tree output, eliding the need to translate a modern engine’s output to hOCR or directly write to PDF.
Bug fixes
Fixed double-compression of already-deflated JPEGs.
Fixed tesseract_cache plugin to properly handle cache misses.
Fixed handling of PDF page boxes (ArtBox, BleedBox) which were not being processed correctly.
Added thread safety lock to pypdfium plugin for concurrent operations.
Improved pdfminer.six compatibility with explicit word spacing.
Documentation
Updated cookbook to replace deprecated
--tesseract-timeout 0with--ocr-engine none.Added comprehensive plugin documentation for new OCR engine framework.
Dependency changes
Requires: one of
pypdfium2orghostscriptfor PDF rasterization (PDF to image)Preferred: both
Requires: one of
verapdforghostscriptfor PDF/A generationPreferred: both
Recommended:
pypdfium2for PDF rasterization (new dependency)Recommended:
ghostscript(used to be Required)Recommended: Noto fonts for improved OCR text positioning
Optional:
verapdffor fast PDF/A validation (new dependency)Requires:
fpdf2for text layer rendering (new dependency)Recommended: replace
typerwithcycloptsin misc scripts (new dependency)See docs/maintainers.md for details.
Migration guide for plugin developers
Update imports:
from ocrmypdf._options import OcrOptionsUpdate type hints:
def check_options(options: OcrOptions)instead ofoptions: NamespaceAttribute access remains unchanged:
options.languages,options.output_type, etc.Remove any in-place option modifications - compute values at point of use instead
Most existing plugins will continue working without changes due to duck-typing