% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0

# v17

## v17.5.0

- Added support for the ``end`` alias in ``--pages``, denoting the last page
  of the document. For example, ``--pages 3-end`` OCRs from page 3 through
  the final page. {issue}`1615`
- Added ``--ghostscript-jpeg-quality`` and ``--ghostscript-jpeg-maxdpi``
  advanced options for tuning Ghostscript's PDF/A output. The optimizer's
  ``--jpeg-quality`` remains the recommended file-size control.
- Fixed pypdfium2 rasterizer clipping content when the CropBox was smaller
  than the MediaBox (e.g. JSTOR or cropped PDFs). {issue}`1685`
- Fixed Form XObject cycle detection in the optimizer's image xref scan.
  Self-referential or DAG-shaped Form graphs (notably from PowerPoint
  exports) previously produced floods of recursion warnings and could hang
  for minutes. {issue}`1321`
- Tesseract config errors are now surfaced as ``TesseractConfigError`` with
  actionable guidance, instead of crashing later with a confusing
  ``FileNotFoundError`` on the missing hOCR output. {issue}`1687`
- Refreshed the Chinese README translation. Thanks @cislunarspace.
- Internal refactoring of the ``_exec`` and ``subprocess`` modules to
  separate probing from execution.
- CI dependency updates.

## v17.4.2

- Fixed Python API unconditionally overriding ``PIL.Image.MAX_IMAGE_PIXELS``
  when the caller did not explicitly set ``max_image_mpixels``. Host
  applications (e.g. Paperless-NGX) that configure the PIL limit before
  invoking ``ocrmypdf.ocr()`` now have their setting respected. The CLI
  default of 250 megapixels is unchanged. {issue}`1665`
- Updated uv.lock to avoid pinning a vulnerable version of Pillow. {issue}`1666`

## v17.4.1

- Fixed RTL text extraction order in the fpdf2 renderer. Arabic lam-alef
  ligatures and other multi-character CMap entries were garbled by the bidi
  algorithm during text extraction. {issue}`1655`
- Fixed ``work_folder`` not being set in ``PdfContext`` options when using
  the Python API. Thanks @bluebox-steven. {issue}`1613`
- Updated Ghostscript JPEG corruption warning to include the detected version
  number, confirming the bug persists in Ghostscript 10.7.0.
- Internal refactoring.
- CI dependency updates.

## v17.4.0

- Added ``--no-overwrite`` / ``-n`` option to prevent overwriting output files.
  If the destination file already exists, OCRmyPDF exits with code 5
  (``OutputFileAccessError``). {issue}`1642`
- Fixed text layer stretching in the fpdf2 renderer for widely-spaced words.
  The horizontal scaling (Tz) was incorrectly stretched to fill inter-word gaps
  instead of relying on Td positioning, causing text selection to highlight far
  beyond the actual word boundaries. {issue}`1635`
- Fixed ``optimize=2`` or ``optimize=3`` crash when using the Python API without
  explicitly setting ``jpg_quality`` or ``png_quality``. {issue}`1641`
- Fixed ``verapdf`` availability check crashing with ``NotADirectoryError`` on
  some platforms. {issue}`1638`

## v17.3.0

- Fixed Python API ignoring the ``language`` parameter, always defaulting to
  ``eng``. The API now correctly maps ``language`` to OcrOptions ``languages``
  and splits ``+``-separated codes (e.g. ``eng+deu``) to match CLI behavior.
  {issue}`1640`
- Fixed Python API producing empty OCR output because ``tesseract_timeout``
  defaulted to 0, causing Tesseract to time out immediately. The default is
  now ``None``, falling back to the plugin's 180-second timeout. {issue}`1636`
- Fixed OCR text layer displacement on PDFs with non-zero MediaBox origins
  (e.g. JSTOR or cropped PDFs). The coordinate transformation matrix is now
  always computed, not skipped when rotation is zero. {issue}`1630`
- Restored image overlay support (``--image``) for the hocrtransform tool,
  enabling sandwich PDF output with the fpdf2 renderer. {issue}`1634`
- Docker: updated Alpine base image to 3.23.
- Documentation restructured into per-major-version release notes files.
- Release process improvements.

## v17.2.0

- Fixed incorrect word spacing in poppler-based PDF viewers and tools (Evince,
  pdftotext, and others) where words on the same line appeared separated by
  double newlines. This works around a poppler bug where Tz (horizontal scaling)
  is not carried across BT/ET boundaries. {issue}`1632`
- Fixed OCR text layer being visible instead of invisible due to incorrect fpdf2
  text rendering mode attribute. This caused OCR text to appear when images were
  removed from the PDF. {issue}`1631`
- Fixed OCR text layer misalignment with non-zero mediabox origins, which
  affected cropped PDFs and JSTOR PDFs generated by iText. The ``--redo-ocr``
  mode would shift text vertically on these files. {issue}`1630`
- Fixed Ghostscript rasterization failure with very low DPI values (below 10).
  OCRmyPDF now renders at a minimum of 10 DPI and resizes the output to match
  the originally requested dimensions. {issue}`1612`

## v17.1.0

- Added `--tagged-pdf-mode` to allow skipping the TaggedPDF error message, if desired.
- Fixed an issue where deflated JPEGs (FlateDecode + DCTDecode) were counted as
  lossless images for the purpose of determining whether to compress to JPEG,
  causing file size inflation with some workflows (`--mode force` in particular).

## v17.0.1

- Fixed output file size inflation when using pypdfium as rasterizer and force-ocr
  mode.

## v17.0.0

**Breaking changes**

- **Plugin interface migration**: Plugin hooks now receive `OcrOptions` objects instead of
  `argparse.Namespace` objects. Most plugins will continue working due to duck-typing
  compatibility, but plugin developers should update their type hints from `Namespace`
  to `OcrOptions`.
- Built-in plugins no longer modify options in-place, improving immutability and
  code clarity.
- **Lossy JBIG2 removed**: The `--jbig2-lossy` and `--jbig2-page-group-size` options have been
  removed due to well-documented risks of character substitution errors. These options are now
  deprecated and will emit warnings if used. Only lossless JBIG2 compression is supported.
- **PDF/A output behavior change**: If neither Ghostscript nor verapdf is installed,
  `--output-type auto` (the new default) will produce a standard PDF instead of PDF/A. This is
  a change from previous versions where Ghostscript was required and PDF/A was always produced.
  This configuration is rare but users should be aware of the change.

**New features**

- **pypdfium2 rasterizer**: Added optional pypdfium2-based PDF rasterization plugin as an
  alternative to Ghostscript for page rendering. Use `--rasterizer pypdfium` to enable
  (requires `pip install pypdfium2`). The default `--rasterizer auto` prefers pypdfium when
  available and falls back to Ghostscript.
- **Pluggable OCR engines**: New `--ocr-engine` option allows selecting OCR engines:
  - `auto` (default): Uses Tesseract
  - `tesseract`: Explicit Tesseract selection
  - `none`: Skip OCR entirely for PDF processing-only workflows

  This prepares the foundation for future third-party OCR engine plugins.
- **Smart PDF/A conversion**: New `--output-type auto` (now the default) produces best-effort
  PDF/A output without requiring Ghostscript when the verapdf validator is available. Falls back
  to traditional Ghostscript conversion when needed.
- **verapdf integration**: Added optional verapdf validation for fast PDF/A conversion. When
  available, OCRmyPDF attempts speculative PDF/A conversion using pikepdf, validates with verapdf,
  and skips Ghostscript if validation passes.
- **Optional Ghostscript**: As a consequence of the changes above, Ghostscript is no longer a required dependency. It is optional.
- **fpdf2 text renderer**: Replaced legacy hOCR text renderer with new fpdf2-based implementation,
  providing better multilingual support and more accurate text positioning.
- **Improved Occulta glyphless font**: The new Occulta font provides better handling of
  zero-width markers and double-width CJK characters for accurate text layer positioning.
- **Expanded multilingual font support**: Added FontProvider infrastructure with language-aware
  font selection for Devanagari (Hindi, Sanskrit, Marathi, Nepali), CJK (Chinese, Japanese,
  Korean), Arabic script, and many other scripts. System font discovery reduces package size.
- **Simplified mode selection**: New `--mode` (`-m`) argument consolidates processing options:
  - `default`: Error if text is found (standard behavior)
  - `force`: Rasterize all content and run OCR (replaces `--force-ocr`)
  - `skip`: Skip pages with existing text (replaces `--skip-text`)
  - `redo`: Re-OCR pages, stripping old text layer (replaces `--redo-ocr`)

  Legacy flags remain as silent aliases for backward compatibility.

**API improvements**

- Centralized validation logic in the `OcrOptions` Pydantic model
- Removed scattered option mutation throughout the codebase
- Better type safety for plugin development
- Simplified plugin option handling
- New `OcrElement`, `OcrClass`, and `BoundingBox` exports for OCR engine plugin developers
- Extended `OcrEngine` ABC with `generate_ocr()` method for direct OCR tree output, eliding the need to translate a modern engine's output to hOCR or directly write to PDF.

**Bug fixes**

- Fixed double-compression of already-deflated JPEGs.
- Fixed tesseract_cache plugin to properly handle cache misses.
- Fixed handling of PDF page boxes (ArtBox, BleedBox) which were not being processed correctly.
- Added thread safety lock to pypdfium plugin for concurrent operations.
- Improved pdfminer.six compatibility with explicit word spacing.

**Documentation**

- Updated cookbook to replace deprecated `--tesseract-timeout 0` with `--ocr-engine none`.
- Added comprehensive plugin documentation for new OCR engine framework.

**Dependency changes**

- Requires: one of `pypdfium2` or `ghostscript` for PDF rasterization (PDF to image)
  - Preferred: both
- Requires: one of `verapdf` or `ghostscript` for PDF/A generation
  - Preferred: both
- Recommended: `pypdfium2` for PDF rasterization (new dependency)
- Recommended: `ghostscript` (used to be Required)
- Recommended: Noto fonts for improved OCR text positioning
- Optional: `verapdf` for fast PDF/A validation (new dependency)
- Requires: `fpdf2` for text layer rendering (new dependency)
- Recommended: replace `typer` with `cyclopts` in misc scripts (new dependency)
- See docs/maintainers.md for details.

**Migration guide for plugin developers**

- Update imports: `from ocrmypdf._options import OcrOptions`
- Update type hints: `def check_options(options: OcrOptions)` instead of `options: Namespace`
- Attribute access remains unchanged: `options.languages`, `options.output_type`, etc.
- Remove any in-place option modifications - compute values at point of use instead
- Most existing plugins will continue working without changes due to duck-typing