PDF optimization

OCRmyPDF includes an image-oriented PDF optimizer. By default, the optimizer runs with safe settings with the goal of improving compression at no loss of quality. At higher optimization levels, lossy optimizations may be applied and tuned. Optimization occurs after OCR, and only if OCR succeeded. It does not perform other possible optimizations such as deduplicating resources, consolidating fonts, simplifying vector drawings, or anything of that nature.

Optimization ranges from -O0 through -O3, where 0 disables optimization and 3 implements all options. 1, the default, performs only safe and lossless optimizations. (This is similar to GCC’s optimization parameter.) The exact type of optimizations performed will vary over time.

PDF optimization requires third-party, optional tools for certain optimizations. If these are not installed or cannot be found by OCRmyPDF, optimization will not be as good.

Optimizations that always occurs

OCRmyPDF will automatically replace obsolete or inferior compression schemes such as RLE or LZW with superior schemes such as Deflate and converting monochrome images to CCITT G4. Since this is harmless it always occurs and there is no way to disable it. Other non-image compressed objects are compressed as well.

Fast web view

OCRmyPDF automatically optimizes PDFs for “fast web view” in Adobe Acrobat’s parlance, or equivalently, linearizes PDFs so that the resources they reference are presented in the order a viewer needs them for sequential display. This reduces the latency of viewing a PDF both online and from local storage. This actually slightly increases the file size.

To disable this optimization and all others, use ocrmypdf --optimize 0 ... or the shorthand -O0.

Lossless optimizations

At optimization level -O1 (the default), OCRmyPDF will also attempt lossless image optimization.

If a JBIG2 encoder is available, then monochrome images will be converted to JBIG2, with the potential for huge savings on large black and white images, since JBIG2 is far more efficient than any other monochrome (bi-level) compression. (All known US patents related to JBIG2 have probably expired, but it remains the responsibility of the user to supply a JBIG2 encoder such as jbig2enc. OCRmyPDF does not implement JBIG2 encoding on its own.)

OCRmyPDF currently does not attempt to recompress losslessly compressed objects more aggressively.

Lossy optimizations

At optimization level -O2 and -O3, OCRmyPDF will some attempt lossy image optimization.

If pngquant is installed, OCRmyPDF will use it to perform quantize paletted images to reduce their size.

The quality of JPEGs may be lowered, on the assumption that a lower quality image may be suitable for storage after OCR.

It is not possible to optimize all image types. Uncommon image types may be skipped by the optimizer.

OCRmyPDF provides lossy mode JBIG2 as an advanced feature that additional requires the argument --jbig2-lossy.