Plugins

You can use plugins to customize the behavior of OCRmyPDF at certain points of interest.

Currently, it is possible to:

  • add new command line arguments
  • override the decision for whether or not to perform OCR on a particular file
  • modify the image is about to be sent for OCR
  • modify the page image before it is converted to PDF

OCRmyPDF plugins are based on the Python pluggy package and conform to its conventions. Note that: plugins installed with as setuptools entrypoints are not checked currently, because OCRmyPDF assumes you may not want to enable plugins for all files. Also, plugins must be functions, not classes.

How plugins are imported

Plugins are imported on demand, by the OCRmyPDF worker process that needs to use them. As such, plugins cannot share state with other plugins, cannot rely on their module’s or the interpreter’s global state, and should expect asynchronous copies of themselves to be running. Plugins can write intermediate files to the folder specified in options.work_folder.

Plugins should work whether executed in threads or processes.

Script plugins

Script plugins may be called from the command line, by specifying the name of a file.

ocrmypdf --plugin example_plugin.py input.pdf output.pdf

Multiple plugins may be called by issuing the --plugin argument multiple times.

Packaged plugins

Installed plugins may be installed into the same virtual environment as OCRmyPDF is installed into. They may be invoked using Python standard module naming.

ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf

OCRmyPDF does not automatically import plugins, because the assumption is that plugins affect different files differently and you may not want them activated all the time. The command line or ocrmypdf.ocr(plugin='...') must call for them.

Third parties that wish to distribute packages for ocrmypdf should package them as packaged plugins, and these modules should begin with the name ocrmypdf_ similar to pytest packages such as pytest-cov (the package) and pytest_cov (the module).

Plugin hooks

A plugin may provide the following hooks. Hooks should be decorated with ocrmypdf.hookimpl, for example:

from ocrmpydf import hookimpl

@hookimpl
def prepare(options):
    pass

The following is a complete list of hooks that may be installed and when they are called.

class ocrmypdf.pluginspec.OcrEngine
static creator_tag(options: argparse.Namespace) → str

Returns the creator tag to identify this software’s role in creating the PDF.

static generate_hocr(input_file: pathlib.Path, output_hocr: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None

Called to produce a hOCR file.

static generate_pdf(input_file: pathlib.Path, output_pdf: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None

Called to produce a text only PDF (no image, invisible text).

static get_orientation(input_file: pathlib.Path, options: argparse.Namespace) → ocrmypdf.pluginspec.OrientationConfidence

Returns the orientation of the image.

static languages(options: argparse.Namespace) → AbstractSet[str]

Returns set of languages that are supported.

static version() → str

Returns the version of the OCR engine.

class ocrmypdf.pluginspec.OrientationConfidence(angle, confidence)
angle

Alias for field number 0

confidence

Alias for field number 1

ocrmypdf.pluginspec.add_options(parser: argparse.ArgumentParser) → None

Allows the plugin to add its own command line arguments.

Even if you do not intend to use plugins in a command line context, you should use this function to create your options.

ocrmypdf.pluginspec.check_options(options: argparse.Namespace) → None

Called to ask the plugin to check all of its options.

The plugin may modify the options. All objects that are in options must be picklable so they can be marshalled to child worker processes.

ocrmypdf.pluginspec.filter_ocr_image(page: PageContext, image: <module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/ocrmypdf/envs/v10.0.1/lib/python3.6/site-packages/PIL/Image.py'>) → <module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/ocrmypdf/envs/v10.0.1/lib/python3.6/site-packages/PIL/Image.py'>

Called to filter the image before it is sent to OCR.

This is the image that OCR sees, not what the user sees when they view the PDF.

ocrmypdf.pluginspec.filter_page_image(page: PageContext, image_filename: pathlib.Path) → pathlib.Path

Called to filter the whole page before it is inserted into the PDF.

A whole page image is only produced when preprocessing command line arguments are issued or when --force-ocr is issued. If no whole page is image is produced for a given page, this function will not be called. This is not the image that will be shown to OCR.

ocrmypdf will create the PDF page based on the image format used. If you convert the image to a JPEG, the output page will be created as a JPEG, etc. Note that the ocrmypdf image optimization stage may ultimately chose a different format.

ocrmypdf.pluginspec.generate_pdfa(pdf_pages: List[pathlib.Path], pdfmark: pathlib.Path, output_file: pathlib.Path, compression: str, pdf_version: str, pdfa_part: str) → pathlib.Path

Generate a PDF/A.

The pdf_pages, a list of files, will be merged into output_file. One or more PDF files may be merged. The pdfmark file is a PostScript.ps file that provides Ghostscript with details on how to perform the PDF/A conversion. By default with we pick PDF/A-2b, but this works for 1 or 3.

compression can be ‘jpeg’, ‘lossless’, or an empty string. In ‘jpeg’, Ghostscript is instructed to convert color and grayscale images to DCT (JPEG encoding). In ‘lossless’ Ghostscript is told to convert images to Flate (lossless/PNG). If the parameter is omitted Ghostscript is left to make its own decisions about how to encode images; it appears to use a heuristic to decide how to encode images. As of Ghostscript 9.25, we support passthrough JPEG which allows Ghostscript to avoid transcoding images entirely. (The feature was added in 9.23 but broken, and the 9.24 release of Ghostscript had regressions, so we don’t support it until 9.25.)

Returns:output_file
ocrmypdf.pluginspec.rasterize_pdf_page(input_file: pathlib.Path, output_file: pathlib.Path, raster_device: str, raster_dpi: ocrmypdf.helpers.Resolution, pageno: int, page_dpi: Optional[ocrmypdf.helpers.Resolution] = None, rotation: Optional[int] = None, filter_vector: bool = False) → pathlib.Path

Rasterize one page of a PDF at resolution raster_dpi in canvas units.

The image is sized to match the integer pixels dimensions implied by raster_dpi even if those numbers are noninteger. The image’s DPI will be overridden with the values in page_dpi.

Parameters:
  • raster_device – type of image to produce at output_file
  • raster_dpi – resolution at which to rasterize page
  • pageno – page number to rasterize (beginning at page 1)
  • page_dpi – resolution, overriding output image DPI
  • rotation – cardinal angle, clockwise, to rotate page
  • filter_vector – if True, remove vector graphics objects
Returns:

output_file

ocrmypdf.pluginspec.validate(pdfinfo: PdfInfo, options: argparse.Namespace) → None

Called to give a plugin an opportunity to review options and pdfinfo.

options contains the “work order” to process a particular file. pdfinfo contains information about the input file obtained after loading and parsing. The plugin may modify the options. For example, you could decide that a certain type of file should be treated with options.force_ocr = True based on information in its pdfinfo.

The plugin may raise ocrmypdf.exceptions.InputFileError or any ocrmypdf.exceptions.ExitCodeException to request normal termination. ocrmypdf will hold the plugin responsible for raising exceptions of any other type.

The return value is ignored. To abort processing, raise an ExitCodeException.