v6
v6.2.5
Disable a failing test due to Tesseract 4.0rc1 behavior change. Previously, Tesseract would exit with an error message if its configuration was invalid, and OCRmyPDF would intercept this message. Now Tesseract issues a warning, which OCRmyPDF v6.2.5 may relay or ignore. (In v7.x, OCRmyPDF will respond to the warning.)
This release branch no longer supports using the optional PyMuPDF installation, since it was removed in v7.x.
This release branch no longer supports macOS. macOS users should upgrade to v7.x.
v6.2.4
Backport Ghostscript 9.25 compatibility fixes, which removes support for setting Unicode metadata
Backport blacklisting Ghostscript 9.24
Older versions of Ghostscript are still supported
v6.2.3
Fixed compatibility with img2pdf >= 0.3.0 by rejecting input images that have an alpha channel
This version will be included in Ubuntu 18.10
v6.2.2
Backport compatibility fixes for Python 3.7 and ruffus 2.7.0 from v7.0.0
Backport fix to ignore masks when deciding what colors are on a page
Backport some minor improvements from v7.0.0: better argument validation and warnings about the Tesseract 4.0.0
--user-wordsregression
v6.2.1
Fixed recent versions of Tesseract (after 4.0.0-beta1) not being detected as supporting the
sandwichrenderer (#271).
v6.2.0
Docker: The Docker image
ocrmypdf-tess4has been removed. The main Docker images,ocrmypdfandocrmypdf-polyglotnow use Ubuntu 18.04 as a base image, and as such Tesseract 4.0.0-beta1 is now the Tesseract version they use. There is no Docker image based on Tesseract 3.05 anymore.Creation of PDF/A-3 is now supported. However, there is no ability to attach files to PDF/A-3.
Lists more reasons why the file size might grow.
Fixed #262,
--remove-backgrounderror on PDFs contained colormapped (paletted) images.Fixed another XMP metadata validation issue, in cases where the input file’s creation date has no timezone and the creation date is not overridden.
v6.1.5
Fixed #253, a possible division by zero when using the
hocrrenderer.Fixed incorrectly formatted
<xmp:ModifyDate>field inside XMP metadata for PDF/As. veraPDF flags this as a PDF/A validation failure. The error is caused the timezone and final digit of the seconds of modified time to be omitted, so at worst the modification time stamp is rounded to the nearest 10 seconds.
v6.1.4
Fixed #248
--cleanargument may remove OCR from left column of text on certain documents. We now set--layout noneto suppress this.The test cache was updated to reflect the change above.
Change test suite to accommodate Ghostscript 9.23’s new ability to insert JPEGs into PDFs without transcoding.
XMP metadata in PDFs is now examined using
defusedxmlfor safety.If an external process exits with a signal when asked to report its version, we now print the system error message instead of suppressing it. This occurred when the required executable was found but was missing a shared library.
qpdf 7.0.0 or newer is now required as the test suite can no longer pass without it.
Notes
An apparent regression in Ghostscript 9.23 will cause some ocrmypdf output files to become invalid in rare cases; the workaround for the moment is to set
--force-ocr.
v6.1.3
Fixed #247,
/CreationDatemetadata not copied from input to output.A warning is now issued when Python 3.5 is used on files with a large page count, as this case is known to regress to single core performance. The cause of this problem is unknown.
v6.1.2
Upgrade to PyMuPDF v1.12.5 which includes a more complete fix to #239.
Add
defusedxmldependency.
v6.1.1
Fixed text being reported as found on all pages if PyMuPDF is not installed.
v6.1.0
PyMuPDF is now an optional but recommended dependency, to alleviate installation difficulties on platforms that have less access to PyMuPDF than the author anticipated. (For version 6.x only) install OCRmyPDF with
pip install ocrmypdf[fitz]to use it to its full potential.Fixed
FileExistsErrorthat could occur if OCR timed out while it was generating the output file. (#218)Fixed table of contents/bookmarks all being redirected to page 1 when generating a PDF/A (with PyMuPDF). (Without PyMuPDF the table of contents is removed in PDF/A mode.)
Fixed “RuntimeError: invalid key in dict” when table of contents/bookmarks titles contained the character
). (#239)Added a new argument
--skip-repairto skip the initial PDF repair step if the PDF is already well-formed (because another program repaired it).
v6.0.0
The software license has been changed to GPLv3 [it has since changed again]. Test resource files and some individual sources may have other licenses.
OCRmyPDF now depends on PyMuPDF. Including PyMuPDF is the primary reason for the change to GPLv3.
Other backward incompatible changes
The
OCRMYPDF_TESSERACT,OCRMYPDF_QPDF,OCRMYPDF_GSandOCRMYPDF_UNPAPERenvironment variables are no longer used. ChangePATHif you need to override the external programs OCRmyPDF uses.The
ocrmypdfpackage has been moved tosrc/ocrmypdfto avoid issues with accidental import.The function
ocrmypdf.exec.get_programwas removed.The deprecated module
ocrmypdf.pageinfowas removed.The
--pdf-renderer tess4alias forsandwichwas removed.
Fixed an issue where OCRmyPDF failed to detect existing text on pages, depending on how the text and fonts were encoded within the PDF. (#233, #232)
Fixed an issue that caused dramatic inflation of file sizes when
--skip-text --output-type pdfwas used. OCRmyPDF now removes duplicate resources such as fonts, images and other objects that it generates. (#237)Improved performance of the initial page splitting step. Originally this step was not believed to be expensive and ran in a process. Large file testing revealed it to be a bottleneck, so it is now parallelized. On a 700 page file with quad core machine, this change saves about 2 minutes. (#234)
The test suite now includes a cache that can be used to speed up test runs across platforms. This also does not require computing checksums, so it’s faster. (#217)