PDF optimization

OCRmyPDF includes an image-oriented PDF optimizer. By default, the optimizer runs with safe settings with the goal of improving compression at no loss of quality. At higher optimization levels, lossy optimizations may be applied and tuned. Optimization occurs after OCR, and only if OCR succeeded. It does not perform other possible optimizations such as deduplicating resources, consolidating fonts, simplifying vector drawings, or anything of that nature.

OCRmyPDF optimization settings
Optimization level	Shorthand	Description
`--optimize 0`	`-O0`	Disable most optimizations.
`--optimize 1` (default)	`-O1`	Enables lossless optimizations, such as transcoding images to more efficient formats. Also compress other uncompressed objects in the PDF and enables the more efficient “object streams” within the PDF.
`--optimize 2`	`-O2`	All of the above, and enables lossy optimizations and color quantization.
`--optimize 3`	`-O3`	All of the above, and enables more aggressive optimizations and targets lower image quality.

The exact type of optimizations performed will vary over time, and depend on what third party tools are installed.

Despite optimizations, OCRmyPDF might still increase the overall file size, since it must embed information about the recognized text, and depending on the settings chosen, may not be able to represent the output file as compactly as the input file.

Optimizations that always occurs

OCRmyPDF will automatically replace obsolete or inferior compression schemes such as RLE or LZW with superior schemes such as Deflate, and convert monochrome images to CCITT G4. Since this is lossless, it always occurs and there is no way to disable it. Other non-image compressed objects are compressed as well.

Fast web view

OCRmyPDF automatically optimizes PDFs for “fast web view” in Adobe Acrobat’s parlance, or equivalently, linearizes PDFs so that the resources they reference are presented in the order a viewer needs them for sequential display. This reduces the latency of viewing a PDF both online and from local storage, in exchange for a slight increase in file size.

To disable this optimization and all others, use ocrmypdf --optimize 0 ... or the shorthand -O0.

Adobe Acrobat might not report the file as being “fast web view”.

Lossless optimizations

At optimization level -O1 (the default), OCRmyPDF will also attempt lossless image optimization.

If a JBIG2 encoder is available, then monochrome images will be converted to JBIG2, with the potential for huge savings on large black and white images, since JBIG2 is far more efficient than any other monochrome (bi-level) compression. (All known US patents related to JBIG2 have probably expired, but it remains the responsibility of the user to supply a JBIG2 encoder such as jbig2enc. OCRmyPDF does not implement JBIG2 encoding on its own.)

OCRmyPDF currently does not attempt to recompress losslessly compressed objects more aggressively.

Lossy optimizations

At optimization level -O1, -O2 and -O3, OCRmyPDF will some attempt loss image optimization.

If Ghostscript is used to create a PDF/A (the default), Ghostscript will optimize some images by converting them to JPEG, which are lossy. If --output-type pdf is used, there are no lossy optimizations. Ghostscript’s JPEG conversion is quite safe.

If pngquant is installed, OCRmyPDF will use it to perform quantize paletted images to reduce their size.

The quality of JPEGs may be lowered, on the assumption that a lower quality image may be suitable for storage after OCR. Use --jpeg-quality to control the optimizer’s JPEG quality target. The optimizer is the recommended way to reduce JPEG image sizes: it applies consistently regardless of whether Ghostscript was used to produce a PDF/A.

If you specifically need to tune Ghostscript’s own PDF/A image handling (for example, to force a hard DPI cap), see Advanced Ghostscript tuning for the separate --ghostscript-jpeg-quality and --ghostscript-jpeg-maxdpi options.

It is not possible to optimize all image types. Uncommon image types may be skipped by the optimizer.