v3

v3.2.1

Changes

  • Fixed #47 “convert() got and unexpected keyword argument ‘dpi’” by upgrading to img2pdf 0.2

  • Tweaked the Dockerfiles

v3.2

New features

  • Lossless reconstruction: when possible, OCRmyPDF will inject text layers without otherwise manipulating the content and layout of a PDF page. For example, a PDF containing a mix of vector and raster content would see the vector content preserved. Images may still be transcoded during PDF/A conversion. (--deskew and --clean-final disable this mode, necessarily.)

  • New argument --tesseract-pagesegmode allows you to pass page segmentation arguments to Tesseract OCR. This helps for two column text and other situations that confuse Tesseract.

  • Added a new “polyglot” version of the Docker image, that generates Tesseract with all languages packs installed, for the polyglots among us. It is much larger.

Changes

  • JPEG transcoding quality is now 95 instead of the default 75. Bigger file sizes for less degradation.

v3.1.1

Changes

  • Fixed bug that caused incorrect page size and DPI calculations on documents with mixed page sizes

v3.1

Changes

  • Default output format is now PDF/A-2b instead of PDF/A-1b

  • Python 3.5 and macOS El Capitan are now supported platforms - no changes were needed to implement support

  • Improved some error messages related to missing input files

  • Fixed #20: uppercase .PDF extension not accepted

  • Fixed an issue where OCRmyPDF failed to text that certain pages contained previously OCR’ed text, such as OCR text produced by Tesseract 3.04

  • Inserts /Creator tag into PDFs so that errors can be traced back to this project

  • Added new option --pdf-renderer=auto, to let OCRmyPDF pick the best PDF renderer. Currently it always chooses the ‘hocrtransform’ renderer but that behavior may change.

  • Set up Travis CI automatic integration testing

v3.0

New features

  • Easier installation with a Docker container or Python’s pip package manager

  • Eliminated many external dependencies, so it’s easier to setup

  • Now installs ocrmypdf to /usr/local/bin or equivalent for system-wide access and easier typing

  • Improved command line syntax and usage help (--help)

  • Tesseract 3.03+ PDF page rendering can be used instead for better positioning of recognized text (--pdf-renderer tesseract)

  • PDF metadata (title, author, keywords) are now transferred to the output PDF

  • PDF metadata can also be set from the command line (--title, etc.)

  • Automatic repairs malformed input PDFs if possible

  • Added test cases to confirm everything is working

  • Added option to skip extremely large pages that take too long to OCR and are often not OCRable (e.g. large scanned maps or diagrams); other pages are still processed (--skip-big)

  • Added option to kill Tesseract OCR process if it seems to be taking too long on a page, while still processing other pages (--tesseract-timeout)

  • Less common colorspaces (CMYK, palette) are now supported by conversion to RGB

  • Multiple images on the same PDF page are now supported

Changes

  • New, robust rewrite in Python 3.4+ with ruffus pipelines

  • Now uses Ghostscript 9.14’s improved color conversion model to preserve PDF colors

  • OCR text is now rendered in the PDF as invisible text. Previous versions of OCRmyPDF incorrectly rendered visible text with an image on top.

  • All “tasks” in the pipeline can be executed in parallel on any available CPUs, increasing performance

  • The -o DPI argument has been phased out, in favor of --oversample DPI, in case we need -o OUTPUTFILE in the future

  • Removed several dependencies, so it’s easier to install. We no longer use:

  • Some new external dependencies are required or optional, compared to v2.x:

    • Ghostscript 9.14+

    • qpdf 5.0.0+

    • Unpaper 6.1 (optional)

    • some automatically managed Python packages

Release candidates^

  • rc9:

    • Fix #118: report error if ghostscript iccprofiles are missing

    • fixed another issue related to #111: PDF rasterized to palette file

    • add support image files with a palette

    • don’t try to validate PDF file after an exception occurs

  • rc8:

    • Fix #111: exception thrown if PDF is missing DocumentInfo dictionary

  • rc7:

    • fix error when installing direct from pip, “no such file ‘requirements.txt’”

  • rc6:

    • dropped libxml2 (Python lxml) since Python 3’s internal XML parser is sufficient

    • set up Docker container

    • fix Unicode errors if recognized text contains Unicode characters and system locale is not UTF-8

  • rc5:

    • dropped Java and JHOVE in favour of qpdf

    • improved command line error output

    • additional tests and bug fixes

    • tested on Ubuntu 14.04 LTS

  • rc4:

    • dropped MuPDF in favour of qpdf

    • fixed some installer issues and errors in installation instructions

    • improve performance: run Ghostscript with multithreaded rendering

    • improve performance: use multiple cores by default

    • bug fix: checking for wrong exception on process timeout

  • rc3: skipping version number intentionally to avoid confusion with Tesseract

  • rc2: first release for public testing to test-PyPI, Github

  • rc1: testing release process

Compatibility notes

  • ./OCRmyPDF.sh script is still available for now

  • Stacking the verbosity option like -vvv is no longer supported

  • The configuration file config.sh has been removed. Instead, you can feed a file to the arguments for common settings:

ocrmypdf input.pdf output.pdf @settings.txt

where settings.txt contains one argument per line, for example:

-l
deu
--author
A. Merkel
--pdf-renderer
tesseract

Fixes

  • Handling of filenames containing spaces: fixed

Notes and known issues

  • Some dependencies may work with lower versions than tested, so try overriding dependencies if they are “in the way” to see if they work.

  • --pdf-renderer tesseract will output files with an incorrect page size in Tesseract 3.03, due to a bug in Tesseract.

  • PDF files containing “inline images” are not supported and won’t be for the 3.0 release. Scanned images almost never contain inline images.