% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0 # v5 ## v5.7.0 - Fixed an issue that caused poor CPU utilization on machines with more than 4 cores when running Tesseract 4. (Related to {issue}`217`.) - The 'hocr' renderer has been improved. The 'sandwich' and 'tesseract' renderers are still better for most use cases, but 'hocr' may be useful for people who work with the PDF.js renderer in English/ASCII languages. ({issue}`225`) - It now formats text in a matter that is easier for certain PDF viewers to select and extract copy and paste text. This should help macOS Preview and PDF.js in particular. - The appearance of selected text and behavior of selecting text is improved. - The PDF content stream now uses relative moves, making it more compact and easier for viewers to determine when two words on the same line. - It can now deal with text on a skewed baseline. - Thanks to @cforcey for the pull request, @jbreiden for many helpful suggestions, @ctbarbour for another round of improvements, and @acaloiaro for an independent review. ## v5.6.3 - Suppress two debug messages that were too verbose ## v5.6.2 - Development branch accidentally tagged as release. Do not use. ## v5.6.1 - Fixed {issue}`219`: change how the final output file is created to avoid triggering permission errors when the output is a special file such as `/dev/null` - Fixed test suite failures due to a qpdf 8.0.0 regression and Python 3.5's handling of symlink - The "encrypted PDF" error message was different depending on the type of PDF encryption. Now a single clear message appears for all types of PDF encryption. - ocrmypdf is now in Homebrew. Homebrew users are advised to the version of ocrmypdf in the official homebrew-core formulas rather than the private tap. - Some linting ## v5.6.0 - Fixed {issue}`216`: preserve "text as curves" PDFs without rasterizing file - Related to the above, messages about rasterizing are more consistent - For consistency versions minor releases will now get the trailing .0 they always should have had. ## v5.5 - Add new argument `--max-image-mpixels`. Pillow 5.0 now raises an exception when images may be decompression bombs. This argument can be used to override the limit Pillow sets. - Fixed output page cropped when using the sandwich renderer and OCR is skipped on a rotated and image-processed page - A warning is now issued when old versions of Ghostscript are used in cases known to cause issues with non-Latin characters - Fixed a few parameter validation checks for `-output-type pdfa-1` and `pdfa-2` ## v5.4.4 - Fixed {issue}`181`: fix final merge failure for PDFs with more pages than the system file handle limit (`ulimit -n`) - Fixed {issue}`200`: an uncommon syntax for formatting decimal numbers in a PDF would cause qpdf to issue a warning, which ocrmypdf treated as an error. Now this the warning is relayed. - Fixed an issue where intermediate PDFs would be created at version 1.3 instead of the version of the original file. It's possible but unlikely this had side effects. - A warning is now issued when older versions of qpdf are used since issues like {issue}`200` cause qpdf to infinite-loop - Address issue {issue}`140`: if Tesseract outputs invalid UTF-8, escape it and print its message instead of aborting with a Unicode error - Adding previously unlisted setup requirement, pytest-runner - Update documentation: fix an error in the example script for Synology with Docker images, improved security guidance, advised `pip install --user` ## v5.4.3 - If a subprocess fails to report its version when queried, exit cleanly with an error instead of throwing an exception - Added test to confirm that the system locale is Unicode-aware and fail early if it's not - Clarified some copyright information - Updated pinned requirements.txt so the homebrew formula captures more recent versions ## v5.4.2 - Fixed a regression from v5.4.1 that caused sidecar files to be created as empty files ## v5.4.1 - Add workaround for Tesseract v4.00alpha crash when trying to obtain orientation and the latest language packs are installed ## v5.4 - Change wording of a deprecation warning to improve clarity - Added option to generate PDF/A-1b output if desired (`--output-type pdfa-1`); default remains PDF/A-2b generation - Update documentation ## v5.3.3 - Fixed missing error message that should occur when trying to force `--pdf-renderer sandwich` on old versions of Tesseract - Update copyright information in test files - Set system `LANG` to UTF-8 in Dockerfiles to avoid UTF-8 encoding errors ## v5.3.2 - Fixed a broken test case related to language packs ## v5.3.1 - Fixed wrong return code given for missing Tesseract language packs - Fixed "brew audit" crashing on Travis when trying to auto-brew ## v5.3 - Added `--user-words` and `--user-patterns` arguments which are forwarded to Tesseract OCR as words and regular expressions respective to use to guide OCR. Supplying a list of subject-domain words should assist Tesseract with resolving words. ({issue}`165`) - Using a non Latin-1 language with the "hocr" renderer now warns about possible OCR quality and recommends workarounds ({issue}`176`) - Output file path added to error message when that location is not writable ({issue}`175`) - Otherwise valid PDFs with leading whitespace at the beginning of the file are now accepted ## v5.2 - When using Tesseract 3.05.01 or newer, OCRmyPDF will select the "sandwich" PDF renderer by default, unless another PDF renderer is specified with the `--pdf-renderer` argument. The previous behavior was to select `--pdf-renderer=hocr`. - The "tesseract" PDF renderer is now deprecated, since it can cause problems with Ghostscript on Tesseract 3.05.00 - The "tess4" PDF renderer has been renamed to "sandwich". "tess4" is now a deprecated alias for "sandwich". ## v5.1 - Files with pages larger than 200" (5080 mm) in either dimension are now supported with `--output-type=pdf` with the page size preserved (in the PDF specification this feature is called UserUnit scaling). Due to Ghostscript limitations this is not available in conjunction with PDF/A output. ## v5.0.1 - Fixed {issue}`169`, exception due to failure to create sidecar text files on some versions of Tesseract 3.04, including the jbarlow83/ocrmypdf Docker image ## v5.0 - Backward incompatible changes > - Support for Python 3.4 dropped. Python 3.5 is now required. > - Support for Tesseract 3.02 and 3.03 dropped. Tesseract 3.04 or > newer is required. Tesseract 4.00 (alpha) is supported. > - The OCRmyPDF.sh script was removed. - Add a new feature, `--sidecar`, which allows creating "sidecar" text files which contain the OCR results in plain text. These OCR text is more reliable than extracting text from PDFs. Closes {issue}`126`. - New feature: `--pdfa-image-compression`, which allows overriding Ghostscript's lossy-or-lossless image encoding heuristic and making all images JPEG encoded or lossless encoded as desired. Fixes {issue}`163`. - Fixed {issue}`143`, added `--quiet` to suppress "INFO" messages - Fixed {issue}`164`, a typo - Removed the command line parameters `-n` and `--just-print` since they have not worked for some time (reported as Ubuntu bug [#1687308](https://bugs.launchpad.net/ubuntu/+source/ocrmypdf/+bug/1687308))