Plugins¶

You can use plugins to customize the behavior of OCRmyPDF at certain points of interest.

Currently, it is possible to:

add new command line arguments
override the decision for whether or not to perform OCR on a particular file
modify the image is about to be sent for OCR
modify the page image before it is converted to PDF

OCRmyPDF plugins are based on the Python pluggy package and conform to its conventions. Note that: plugins installed with as setuptools entrypoints are not checked currently, because OCRmyPDF assumes you may not want to enable plugins for all files. Also, plugins must be functions, not classes.

How plugins are imported¶

Plugins are imported on demand, by the OCRmyPDF worker process that needs to use them. As such, plugins cannot share state with other plugins, cannot rely on their module’s or the interpreter’s global state, and should expect asynchronous copies of themselves to be running. Plugins can write intermediate files to the folder specified in options.work_folder.

Plugins should work whether executed in threads or processes.

Script plugins¶

Script plugins may be called from the command line, by specifying the name of a file.

ocrmypdf --plugin example_plugin.py input.pdf output.pdf

Multiple plugins may be called by issuing the --plugin argument multiple times.

Packaged plugins¶

Installed plugins may be installed into the same virtual environment as OCRmyPDF is installed into. They may be invoked using Python standard module naming.

ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf

OCRmyPDF does not automatically import plugins, because the assumption is that plugins affect different files differently and you may not want them activated all the time. The command line or ocrmypdf.ocr(plugin='...') must call for them.

Third parties that wish to distribute packages for ocrmypdf should package them as packaged plugins, and these modules should begin with the name ocrmypdf_ similar to pytest packages such as pytest-cov (the package) and pytest_cov (the module).

Plugin hooks¶

A plugin may provide the following hooks. Hooks should be decorated with ocrmypdf.hookimpl, for example:

from ocrmpydf import hookimpl

@hookimpl
def prepare(options):
    pass

The following is a complete list of hooks that may be installed and when they are called.

class ocrmypdf.pluginspec.OcrEngine¶

static creator_tag(options: argparse.Namespace) → str¶: Returns the creator tag to identify this software’s role in creating the PDF.

static generate_hocr(input_file: pathlib.Path, output_hocr: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None¶: Called to produce a hOCR file.

static generate_pdf(input_file: pathlib.Path, output_pdf: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None¶: Called to produce a text only PDF (no image, invisible text).

static get_orientation(input_file: pathlib.Path, options: argparse.Namespace) → ocrmypdf.pluginspec.OrientationConfidence¶: Returns the orientation of the image.

static languages(options: argparse.Namespace) → AbstractSet[str]¶: Returns set of languages that are supported.

static version() → str¶: Returns the version of the OCR engine.

class ocrmypdf.pluginspec.OrientationConfidence(angle, confidence)¶

angle¶: Alias for field number 0

confidence¶: Alias for field number 1

ocrmypdf.pluginspec.add_options(parser: argparse.ArgumentParser) → None¶

Allows the plugin to add its own command line arguments.

Even if you do not intend to use plugins in a command line context, you should use this function to create your options.

ocrmypdf.pluginspec.check_options(options: argparse.Namespace) → None¶

Called to ask the plugin to check all of its options.

The plugin may modify the options. All objects that are in options must be picklable so they can be marshalled to child worker processes.

ocrmypdf.pluginspec.filter_ocr_image(page: PageContext, image: <module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/ocrmypdf/envs/v10.0.1/lib/python3.6/site-packages/PIL/Image.py'>) → <module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/ocrmypdf/envs/v10.0.1/lib/python3.6/site-packages/PIL/Image.py'>¶

Called to filter the image before it is sent to OCR.

This is the image that OCR sees, not what the user sees when they view the PDF.

ocrmypdf.pluginspec.filter_page_image(page: PageContext, image_filename: pathlib.Path) → pathlib.Path¶

Called to filter the whole page before it is inserted into the PDF.

A whole page image is only produced when preprocessing command line arguments are issued or when --force-ocr is issued. If no whole page is image is produced for a given page, this function will not be called. This is not the image that will be shown to OCR.

ocrmypdf will create the PDF page based on the image format used. If you convert the image to a JPEG, the output page will be created as a JPEG, etc. Note that the ocrmypdf image optimization stage may ultimately chose a different format.

ocrmypdf.pluginspec.generate_pdfa(pdf_pages: List[pathlib.Path], pdfmark: pathlib.Path, output_file: pathlib.Path, compression: str, pdf_version: str, pdfa_part: str) → pathlib.Path¶

Generate a PDF/A.

The pdf_pages, a list of files, will be merged into output_file. One or more PDF files may be merged. The pdfmark file is a PostScript.ps file that provides Ghostscript with details on how to perform the PDF/A conversion. By default with we pick PDF/A-2b, but this works for 1 or 3.

compression can be ‘jpeg’, ‘lossless’, or an empty string. In ‘jpeg’, Ghostscript is instructed to convert color and grayscale images to DCT (JPEG encoding). In ‘lossless’ Ghostscript is told to convert images to Flate (lossless/PNG). If the parameter is omitted Ghostscript is left to make its own decisions about how to encode images; it appears to use a heuristic to decide how to encode images. As of Ghostscript 9.25, we support passthrough JPEG which allows Ghostscript to avoid transcoding images entirely. (The feature was added in 9.23 but broken, and the 9.24 release of Ghostscript had regressions, so we don’t support it until 9.25.)

Returns:	output_file

ocrmypdf.pluginspec.rasterize_pdf_page(input_file: pathlib.Path, output_file: pathlib.Path, raster_device: str, raster_dpi: ocrmypdf.helpers.Resolution, pageno: int, page_dpi: Optional[ocrmypdf.helpers.Resolution] = None, rotation: Optional[int] = None, filter_vector: bool = False) → pathlib.Path¶

Rasterize one page of a PDF at resolution raster_dpi in canvas units.

The image is sized to match the integer pixels dimensions implied by raster_dpi even if those numbers are noninteger. The image’s DPI will be overridden with the values in page_dpi.

Parameters:	raster_device – type of image to produce at output_file raster_dpi – resolution at which to rasterize page pageno – page number to rasterize (beginning at page 1) page_dpi – resolution, overriding output image DPI rotation – cardinal angle, clockwise, to rotate page filter_vector – if True, remove vector graphics objects
Returns:	output_file

ocrmypdf.pluginspec.validate(pdfinfo: PdfInfo, options: argparse.Namespace) → None¶

Called to give a plugin an opportunity to review options and pdfinfo.

options contains the “work order” to process a particular file. pdfinfo contains information about the input file obtained after loading and parsing. The plugin may modify the options. For example, you could decide that a certain type of file should be treated with options.force_ocr = True based on information in its pdfinfo.

The plugin may raise ocrmypdf.exceptions.InputFileError or any ocrmypdf.exceptions.ExitCodeException to request normal termination. ocrmypdf will hold the plugin responsible for raising exceptions of any other type.

The return value is ignored. To abort processing, raise an ExitCodeException.