% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0 # v17 ## v17.5.0 - Added support for the ``end`` alias in ``--pages``, denoting the last page of the document. For example, ``--pages 3-end`` OCRs from page 3 through the final page. {issue}`1615` - Added ``--ghostscript-jpeg-quality`` and ``--ghostscript-jpeg-maxdpi`` advanced options for tuning Ghostscript's PDF/A output. The optimizer's ``--jpeg-quality`` remains the recommended file-size control. - Fixed pypdfium2 rasterizer clipping content when the CropBox was smaller than the MediaBox (e.g. JSTOR or cropped PDFs). {issue}`1685` - Fixed Form XObject cycle detection in the optimizer's image xref scan. Self-referential or DAG-shaped Form graphs (notably from PowerPoint exports) previously produced floods of recursion warnings and could hang for minutes. {issue}`1321` - Tesseract config errors are now surfaced as ``TesseractConfigError`` with actionable guidance, instead of crashing later with a confusing ``FileNotFoundError`` on the missing hOCR output. {issue}`1687` - Refreshed the Chinese README translation. Thanks @cislunarspace. - Internal refactoring of the ``_exec`` and ``subprocess`` modules to separate probing from execution. - CI dependency updates. ## v17.4.2 - Fixed Python API unconditionally overriding ``PIL.Image.MAX_IMAGE_PIXELS`` when the caller did not explicitly set ``max_image_mpixels``. Host applications (e.g. Paperless-NGX) that configure the PIL limit before invoking ``ocrmypdf.ocr()`` now have their setting respected. The CLI default of 250 megapixels is unchanged. {issue}`1665` - Updated uv.lock to avoid pinning a vulnerable version of Pillow. {issue}`1666` ## v17.4.1 - Fixed RTL text extraction order in the fpdf2 renderer. Arabic lam-alef ligatures and other multi-character CMap entries were garbled by the bidi algorithm during text extraction. {issue}`1655` - Fixed ``work_folder`` not being set in ``PdfContext`` options when using the Python API. Thanks @bluebox-steven. {issue}`1613` - Updated Ghostscript JPEG corruption warning to include the detected version number, confirming the bug persists in Ghostscript 10.7.0. - Internal refactoring. - CI dependency updates. ## v17.4.0 - Added ``--no-overwrite`` / ``-n`` option to prevent overwriting output files. If the destination file already exists, OCRmyPDF exits with code 5 (``OutputFileAccessError``). {issue}`1642` - Fixed text layer stretching in the fpdf2 renderer for widely-spaced words. The horizontal scaling (Tz) was incorrectly stretched to fill inter-word gaps instead of relying on Td positioning, causing text selection to highlight far beyond the actual word boundaries. {issue}`1635` - Fixed ``optimize=2`` or ``optimize=3`` crash when using the Python API without explicitly setting ``jpg_quality`` or ``png_quality``. {issue}`1641` - Fixed ``verapdf`` availability check crashing with ``NotADirectoryError`` on some platforms. {issue}`1638` ## v17.3.0 - Fixed Python API ignoring the ``language`` parameter, always defaulting to ``eng``. The API now correctly maps ``language`` to OcrOptions ``languages`` and splits ``+``-separated codes (e.g. ``eng+deu``) to match CLI behavior. {issue}`1640` - Fixed Python API producing empty OCR output because ``tesseract_timeout`` defaulted to 0, causing Tesseract to time out immediately. The default is now ``None``, falling back to the plugin's 180-second timeout. {issue}`1636` - Fixed OCR text layer displacement on PDFs with non-zero MediaBox origins (e.g. JSTOR or cropped PDFs). The coordinate transformation matrix is now always computed, not skipped when rotation is zero. {issue}`1630` - Restored image overlay support (``--image``) for the hocrtransform tool, enabling sandwich PDF output with the fpdf2 renderer. {issue}`1634` - Docker: updated Alpine base image to 3.23. - Documentation restructured into per-major-version release notes files. - Release process improvements. ## v17.2.0 - Fixed incorrect word spacing in poppler-based PDF viewers and tools (Evince, pdftotext, and others) where words on the same line appeared separated by double newlines. This works around a poppler bug where Tz (horizontal scaling) is not carried across BT/ET boundaries. {issue}`1632` - Fixed OCR text layer being visible instead of invisible due to incorrect fpdf2 text rendering mode attribute. This caused OCR text to appear when images were removed from the PDF. {issue}`1631` - Fixed OCR text layer misalignment with non-zero mediabox origins, which affected cropped PDFs and JSTOR PDFs generated by iText. The ``--redo-ocr`` mode would shift text vertically on these files. {issue}`1630` - Fixed Ghostscript rasterization failure with very low DPI values (below 10). OCRmyPDF now renders at a minimum of 10 DPI and resizes the output to match the originally requested dimensions. {issue}`1612` ## v17.1.0 - Added `--tagged-pdf-mode` to allow skipping the TaggedPDF error message, if desired. - Fixed an issue where deflated JPEGs (FlateDecode + DCTDecode) were counted as lossless images for the purpose of determining whether to compress to JPEG, causing file size inflation with some workflows (`--mode force` in particular). ## v17.0.1 - Fixed output file size inflation when using pypdfium as rasterizer and force-ocr mode. ## v17.0.0 **Breaking changes** - **Plugin interface migration**: Plugin hooks now receive `OcrOptions` objects instead of `argparse.Namespace` objects. Most plugins will continue working due to duck-typing compatibility, but plugin developers should update their type hints from `Namespace` to `OcrOptions`. - Built-in plugins no longer modify options in-place, improving immutability and code clarity. - **Lossy JBIG2 removed**: The `--jbig2-lossy` and `--jbig2-page-group-size` options have been removed due to well-documented risks of character substitution errors. These options are now deprecated and will emit warnings if used. Only lossless JBIG2 compression is supported. - **PDF/A output behavior change**: If neither Ghostscript nor verapdf is installed, `--output-type auto` (the new default) will produce a standard PDF instead of PDF/A. This is a change from previous versions where Ghostscript was required and PDF/A was always produced. This configuration is rare but users should be aware of the change. **New features** - **pypdfium2 rasterizer**: Added optional pypdfium2-based PDF rasterization plugin as an alternative to Ghostscript for page rendering. Use `--rasterizer pypdfium` to enable (requires `pip install pypdfium2`). The default `--rasterizer auto` prefers pypdfium when available and falls back to Ghostscript. - **Pluggable OCR engines**: New `--ocr-engine` option allows selecting OCR engines: - `auto` (default): Uses Tesseract - `tesseract`: Explicit Tesseract selection - `none`: Skip OCR entirely for PDF processing-only workflows This prepares the foundation for future third-party OCR engine plugins. - **Smart PDF/A conversion**: New `--output-type auto` (now the default) produces best-effort PDF/A output without requiring Ghostscript when the verapdf validator is available. Falls back to traditional Ghostscript conversion when needed. - **verapdf integration**: Added optional verapdf validation for fast PDF/A conversion. When available, OCRmyPDF attempts speculative PDF/A conversion using pikepdf, validates with verapdf, and skips Ghostscript if validation passes. - **Optional Ghostscript**: As a consequence of the changes above, Ghostscript is no longer a required dependency. It is optional. - **fpdf2 text renderer**: Replaced legacy hOCR text renderer with new fpdf2-based implementation, providing better multilingual support and more accurate text positioning. - **Improved Occulta glyphless font**: The new Occulta font provides better handling of zero-width markers and double-width CJK characters for accurate text layer positioning. - **Expanded multilingual font support**: Added FontProvider infrastructure with language-aware font selection for Devanagari (Hindi, Sanskrit, Marathi, Nepali), CJK (Chinese, Japanese, Korean), Arabic script, and many other scripts. System font discovery reduces package size. - **Simplified mode selection**: New `--mode` (`-m`) argument consolidates processing options: - `default`: Error if text is found (standard behavior) - `force`: Rasterize all content and run OCR (replaces `--force-ocr`) - `skip`: Skip pages with existing text (replaces `--skip-text`) - `redo`: Re-OCR pages, stripping old text layer (replaces `--redo-ocr`) Legacy flags remain as silent aliases for backward compatibility. **API improvements** - Centralized validation logic in the `OcrOptions` Pydantic model - Removed scattered option mutation throughout the codebase - Better type safety for plugin development - Simplified plugin option handling - New `OcrElement`, `OcrClass`, and `BoundingBox` exports for OCR engine plugin developers - Extended `OcrEngine` ABC with `generate_ocr()` method for direct OCR tree output, eliding the need to translate a modern engine's output to hOCR or directly write to PDF. **Bug fixes** - Fixed double-compression of already-deflated JPEGs. - Fixed tesseract_cache plugin to properly handle cache misses. - Fixed handling of PDF page boxes (ArtBox, BleedBox) which were not being processed correctly. - Added thread safety lock to pypdfium plugin for concurrent operations. - Improved pdfminer.six compatibility with explicit word spacing. **Documentation** - Updated cookbook to replace deprecated `--tesseract-timeout 0` with `--ocr-engine none`. - Added comprehensive plugin documentation for new OCR engine framework. **Dependency changes** - Requires: one of `pypdfium2` or `ghostscript` for PDF rasterization (PDF to image) - Preferred: both - Requires: one of `verapdf` or `ghostscript` for PDF/A generation - Preferred: both - Recommended: `pypdfium2` for PDF rasterization (new dependency) - Recommended: `ghostscript` (used to be Required) - Recommended: Noto fonts for improved OCR text positioning - Optional: `verapdf` for fast PDF/A validation (new dependency) - Requires: `fpdf2` for text layer rendering (new dependency) - Recommended: replace `typer` with `cyclopts` in misc scripts (new dependency) - See docs/maintainers.md for details. **Migration guide for plugin developers** - Update imports: `from ocrmypdf._options import OcrOptions` - Update type hints: `def check_options(options: OcrOptions)` instead of `options: Namespace` - Attribute access remains unchanged: `options.languages`, `options.output_type`, etc. - Remove any in-place option modifications - compute values at point of use instead - Most existing plugins will continue working without changes due to duck-typing