OCRmyPDF includes an image-oriented PDF optimizer. By default, the optimizer runs with safe settings with the goal of improving compression at no loss of quality. At higher optimization levels, lossy optimizations may be applied and tuned. Optimization occurs after OCR, and only if OCR succeeded. It does not perform other possible optimizations such as deduplicating resources, consolidating fonts, simplifying vector drawings, or anything of that nature.
The exact type of optimizations performed will vary over time, and depend on the availability of third-party tools.
Despite optimizations, OCRmyPDF might still increase the overall file size, since it must embed information about the recognized text, and depending on the settings chosen, may not be able to represent the output file as compactly as the input file.
Optimizations that always occurs¶
OCRmyPDF will automatically replace obsolete or inferior compression schemes such as RLE or LZW with superior schemes such as Deflate, and convert monochrome images to CCITT G4. Since this is lossless, it always occurs and there is no way to disable it. Other non-image compressed objects are compressed as well.
Fast web view¶
OCRmyPDF automatically optimizes PDFs for “fast web view” in Adobe Acrobat’s parlance, or equivalently, linearizes PDFs so that the resources they reference are presented in the order a viewer needs them for sequential display. This reduces the latency of viewing a PDF both online and from local storage, in exchange for a slight increase in file size.
To disable this optimization and all others, use
ocrmypdf --optimize 0 ...
or the shorthand
Adobe Acrobat might not report the file as being “fast web view”.
At optimization level
-O1 (the default), OCRmyPDF will also attempt lossless
If a JBIG2 encoder is available, then monochrome images will be converted to JBIG2, with the potential for huge savings on large black and white images, since JBIG2 is far more efficient than any other monochrome (bi-level) compression. (All known US patents related to JBIG2 have probably expired, but it remains the responsibility of the user to supply a JBIG2 encoder such as jbig2enc. OCRmyPDF does not implement JBIG2 encoding on its own.)
OCRmyPDF currently does not attempt to recompress losslessly compressed objects more aggressively.
At optimization level
-O3, OCRmyPDF will some attempt lossy
pngquant is installed, OCRmyPDF will use it to perform quantize paletted
images to reduce their size.
The quality of JPEGs may be lowered, on the assumption that a lower quality image may be suitable for storage after OCR.
It is not possible to optimize all image types. Uncommon image types may be skipped by the optimizer.
OCRmyPDF provides lossy mode JBIG2 as an advanced feature
that additional requires the argument