Plugins

You can use plugins to customize the behavior of OCRmyPDF at certain points of interest.

Currently, it is possible to:

  • add new command line arguments
  • override the decision for whether or not to perform OCR on a particular file
  • modify the image is about to be sent for OCR
  • modify the page image before it is converted to PDF

OCRmyPDF plugins are based on the Python pluggy package and conform to its conventions. Note that: plugins installed with as setuptools entrypoints are not checked currently, because OCRmyPDF assumes you may not want to enable plugins for all files. Also, plugins must be functions, not classes.

How plugins are imported

Plugins are imported on demand, by the OCRmyPDF worker process that needs to use them. As such, plugins cannot share state with other plugins, cannot rely on their module’s or the interpreter’s global state, and should expect asynchronous copies of themselves to be running. Plugins can write intermediate files to the folder specified in options.work_folder.

Plugins should work whether executed in threads or processes.

Script plugins

Script plugins may be called from the command line, by specifying the name of a file.

ocrmypdf --plugin example_plugin.py input.pdf output.pdf

Multiple plugins may be called by issuing the --plugin argument multiple times.

Packaged plugins

Installed plugins may be installed into the same virtual environment as OCRmyPDF is installed into. They may be invoked using Python standard module naming.

ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf

OCRmyPDF does not automatically import plugins, because the assumption is that plugins affect different files differently and you may not want them activated all the time. The command line or ocrmypdf.ocr(plugin='...') must call for them.

Third parties that wish to distribute packages for ocrmypdf should package them as packaged plugins, and these modules should begin with the name ocrmypdf_ similar to pytest packages such as pytest-cov (the package) and pytest_cov (the module).

Plugin hooks

A plugin may provide the following hooks. Hooks should be decorated with ocrmypdf.hookimpl, for example:

from ocrmpydf import hookimpl

@hookimpl
def add_options(parser):
    pass

The following is a complete list of hooks that may be installed and when they are called.

class ocrmypdf.pluginspec.OcrEngine
static creator_tag(options: argparse.Namespace) → str

Returns the creator tag to identify this software’s role in creating the PDF.

static generate_hocr(input_file: pathlib.Path, output_hocr: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None

Called to produce a hOCR file and sidecar text file.

static generate_pdf(input_file: pathlib.Path, output_pdf: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None

Called to produce a text only PDF.

Parameters:
  • input_file – A page image on which to perform OCR.
  • output_pdf – The expected name of the output PDF, which must be a single page PDF with no visible content of any kind, sized to the dimensions implied by the input_file’s width, height and DPI. The image will be grafted onto the input PDF page.
static get_orientation(input_file: pathlib.Path, options: argparse.Namespace) → ocrmypdf.pluginspec.OrientationConfidence

Returns the orientation of the image.

static languages(options: argparse.Namespace) → AbstractSet[str]

Returns the set of all languages that are supported by the engine.

Languages are typically given in 3-letter ISO 3166-1 codes, but actually can be any value understood by the OCR engine.

static version() → str

Returns the version of the OCR engine.

class ocrmypdf.pluginspec.OrientationConfidence(angle, confidence)
angle

Alias for field number 0

confidence

Alias for field number 1

ocrmypdf.pluginspec.add_options(parser: argparse.ArgumentParser) → None

Allows the plugin to add its own command line and API arguments.

OCRmyPDF converts command line arguments to API arguments, so adding arguments here will cause new arguments to be processed for API calls to ocrmypdf.ocr, or when invoked on the command line.

Note

This hook will be called from the main process, and may modify global state before child worker processes are forked.

ocrmypdf.pluginspec.check_options(options: argparse.Namespace) → None

Called to ask the plugin to check all of the options.

The plugin may check if options that it added are valid.

Warnings or other messages may be passed to the user by creating a logger object using log = logging.getLogger(__name__) and logging to this.

The plugin may also modify the options. All objects that are in options must be picklable so they can be marshalled to child worker processes.

Raises:ocrmypdf.exceptions.ExitCodeException – If options are not acceptable and the application should terminate gracefully with an informative message and error code.

Note

This hook will be called from the main process, and may modify global state before child worker processes are forked.

ocrmypdf.pluginspec.filter_ocr_image(page: PageContext, image: <module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/ocrmypdf/envs/v10.2.0/lib/python3.6/site-packages/PIL/Image.py'>) → <module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/ocrmypdf/envs/v10.2.0/lib/python3.6/site-packages/PIL/Image.py'>

Called to filter the image before it is sent to OCR.

This is the image that OCR sees, not what the user sees when they view the PDF.

Note

This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.

ocrmypdf.pluginspec.filter_page_image(page: PageContext, image_filename: pathlib.Path) → pathlib.Path

Called to filter the whole page before it is inserted into the PDF.

A whole page image is only produced when preprocessing command line arguments are issued or when --force-ocr is issued. If no whole page is image is produced for a given page, this function will not be called. This is not the image that will be shown to OCR.

ocrmypdf will create the PDF page based on the image format used. If you convert the image to a JPEG, the output page will be created as a JPEG, etc. Note that the ocrmypdf image optimization stage may ultimately chose a different format.

Note

This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.

ocrmypdf.pluginspec.generate_pdfa(pdf_pages: List[pathlib.Path], pdfmark: pathlib.Path, output_file: pathlib.Path, compression: str, pdf_version: str, pdfa_part: str) → pathlib.Path

Generate a PDF/A.

This API strongly assumes a PDF/A generator with Ghostscript’s semantics.

OCRmyPDF will modify the metadata and possibly linearize the PDF/A after it is generated.

Parameters:
  • pdf_pages – A list of one or more filenames, will be merged into output_file.
  • pdfmark – A PostScript file intended for Ghostscript with details on how to perform the PDF/A conversion.
  • output_file – The name of the desired output file.
  • compression – One of 'jpeg', 'lossless', ''. For 'jpeg', the PDF/A generator should convert all images to JPEG encoding where possible. For lossless, all images should be converted to FlateEncode (lossless PNG). If an empty string, the PDF generator should make its own decisions about how to encode images.
  • pdf_version – The minimum PDF version that the output file should be. At its own discretion, the PDF/A generator may raise the version, but should not lower it.
  • pdfa_part – The desired PDF/A compliance level, such as '2B'.
Returns:

If successful, the hook should return output_file.

Return type:

output_file

ocrmypdf.pluginspec.get_ocr_engine() → ocrmypdf.pluginspec.OcrEngine

Returns an OcrEngine to use for processing this file.

The OcrEngine may be instantiated multiple times, by both the main process and child process. As such, it must be obtain store any state in options or some common location.

ocrmypdf.pluginspec.rasterize_pdf_page(input_file: pathlib.Path, output_file: pathlib.Path, raster_device: str, raster_dpi: ocrmypdf.helpers.Resolution, pageno: int, page_dpi: Optional[ocrmypdf.helpers.Resolution] = None, rotation: Optional[int] = None, filter_vector: bool = False) → pathlib.Path

Rasterize one page of a PDF at resolution raster_dpi in canvas units.

The image is sized to match the integer pixels dimensions implied by raster_dpi even if those numbers are noninteger. The image’s DPI will be overridden with the values in page_dpi.

Parameters:
  • input_file – The PDF to rasterize.
  • output_file – The desired name of the rasterized image.
  • raster_device – Type of image to produce at output_file
  • raster_dpi – Resolution at which to rasterize page
  • pageno – Page number to rasterize (beginning at page 1)
  • page_dpi – Resolution, overriding output image DPI
  • rotation – Cardinal angle, clockwise, to rotate page
  • filter_vector – If True, remove vector graphics objects
Returns:

output_file

Note

This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.

ocrmypdf.pluginspec.validate(pdfinfo: PdfInfo, options: argparse.Namespace) → None

Called to give a plugin an opportunity to review options and pdfinfo.

options contains the “work order” to process a particular file. pdfinfo contains information about the input file obtained after loading and parsing. The plugin may modify the options. For example, you could decide that a certain type of file should be treated with options.force_ocr = True based on information in its pdfinfo.

Raises:ocrmypdf.exceptions.ExitCodeException – If options or pdfinfo are not acceptable and the application should terminate gracefully with an informative message and error code.

Note

This hook will be called from the main process, and may modify global state before child worker processes are forked.