Plugins¶
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
You can use plugins to customize the behavior of OCRmyPDF at certain points of interest.
Currently, it is possible to:
- add new command line arguments
- override the decision for whether or not to perform OCR on a particular file
- modify the image is about to be sent for OCR
- modify the page image before it is converted to PDF
- replace the Tesseract OCR with another OCR engine that has similar behavior
- replace Ghostscript with another PDF to image converter (rasterizer) or PDF/A generator
OCRmyPDF plugins are based on the Python pluggy
package and conform to its
conventions. Note that: plugins installed with as setuptools entrypoints are
not checked currently, because OCRmyPDF assumes you may not want to enable
plugins for all files.
Script plugins¶
Script plugins may be called from the command line, by specifying the name of a file. Script plugins may be convenient for informal or “one-off” plugins, when a certain batch of files needs a special processing step for example.
ocrmypdf --plugin ocrmypdf_example_plugin.py input.pdf output.pdf
Multiple plugins may be installed by issuing the --plugin
argument multiple times.
Packaged plugins¶
Installed plugins may be installed into the same virtual environment as OCRmyPDF is installed into. They may be invoked using Python standard module naming. If you are intending to distribute a plugin, please package it.
ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf
OCRmyPDF does not automatically import plugins, because the assumption is that
plugins affect different files differently and you may not want them activated
all the time. The command line or ocrmypdf.ocr(plugin='...')
must call
for them.
Third parties that wish to distribute packages for ocrmypdf should package them
as packaged plugins, and these modules should begin with the name ocrmypdf_
similar to pytest
packages such as pytest-cov
(the package) and
pytest_cov
(the module).
Note
We strongly recommend plugin authors name their plugins with the prefix
ocrmypdf-
(for the package name on PyPI) and ocrmypdf_
(for the
module), just like pytest plugins.
Plugin requirements¶
OCRmyPDF generally uses multiple worker processes. When a new worker is started, Python will import all plugins again, including all plugins that were imported earlier. This means that the global state of a plugin in one worker will not be shared with other workers. As such, plugin hook implementations should be stateless, relying only on their inputs. Hook implementations may use their input parameters to to obtain a reference to shared state prepared by another hook implementation. Plugins must expect that other instances of the plugin will be running simultaneously.
The context
object that is passed to many hooks can be used to share information
about a file being worked on. Plugins must write private, plugin-specific data to
a subfolder named {options.work_folder}/ocrmypdf-plugin-name
. Plugins MAY
read and write files in options.work_folder
, but should be aware that their
semantics are subject to change.
OCRmyPDF will delete options.work_folder
when it has finished OCRing
a file, unless invoked with --keep-temporary-files
.
The documentation for some plugin hooks contain a detailed description of the execution context in which they will be called.
Plugins should be prepared to work whether executed in worker threads or worker processes. Generally, OCRmyPDF uses processes, but has a semi-hidden threaded argument that simplifies debugging.
Plugin hooks¶
A plugin may provide the following hooks. Hooks must be decorated with
ocrmypdf.hookimpl
, for example:
from ocrmpydf import hookimpl
@hookimpl
def add_options(parser):
pass
The following is a complete list of hooks that are available, and when they are called.
Note on firstresult hooks
If multiple plugins install implementations for this hook, they will be called in
the reverse of the order in which they are installed (i.e., last plugin wins).
When each hook implementation is called in order, the first implementation that
returns a value other than None
will “win” and prevent execution of all other
hooks. As such, you cannot “chain” a series of plugin filters together in this
way. Instead, a single hook implementation should be responsible for any such
chaining operations.
Custom command line arguments¶
-
ocrmypdf.pluginspec.
add_options
(parser: argparse.ArgumentParser) → None¶ Allows the plugin to add its own command line and API arguments.
OCRmyPDF converts command line arguments to API arguments, so adding arguments here will cause new arguments to be processed for API calls to
ocrmypdf.ocr
, or when invoked on the command line.Note
This hook will be called from the main process, and may modify global state before child worker processes are forked.
-
ocrmypdf.pluginspec.
check_options
(options: argparse.Namespace) → None¶ Called to ask the plugin to check all of the options.
The plugin may check if options that it added are valid.
Warnings or other messages may be passed to the user by creating a logger object using
log = logging.getLogger(__name__)
and logging to this.The plugin may also modify the options. All objects that are in options must be picklable so they can be marshalled to child worker processes.
Raises: ocrmypdf.exceptions.ExitCodeException
– If options are not acceptable and the application should terminate gracefully with an informative message and error code.Note
This hook will be called from the main process, and may modify global state before child worker processes are forked.
Applying special behavior before processing¶
-
ocrmypdf.pluginspec.
validate
(pdfinfo: PdfInfo, options: argparse.Namespace) → None¶ Called to give a plugin an opportunity to review options and pdfinfo.
options contains the “work order” to process a particular file. pdfinfo contains information about the input file obtained after loading and parsing. The plugin may modify the options. For example, you could decide that a certain type of file should be treated with
options.force_ocr = True
based on information in its pdfinfo.Raises: ocrmypdf.exceptions.ExitCodeException
– If options or pdfinfo are not acceptable and the application should terminate gracefully with an informative message and error code.Note
This hook will be called from the main process, and may modify global state before child worker processes are forked.
PDF page to image¶
-
ocrmypdf.pluginspec.
rasterize_pdf_page
(input_file: pathlib.Path, output_file: pathlib.Path, raster_device: str, raster_dpi: ocrmypdf.helpers.Resolution, pageno: int, page_dpi: Optional[ocrmypdf.helpers.Resolution] = None, rotation: Optional[int] = None, filter_vector: bool = False) → pathlib.Path¶ Rasterize one page of a PDF at resolution raster_dpi in canvas units.
The image is sized to match the integer pixels dimensions implied by raster_dpi even if those numbers are noninteger. The image’s DPI will be overridden with the values in page_dpi.
Parameters: - input_file – The PDF to rasterize.
- output_file – The desired name of the rasterized image.
- raster_device – Type of image to produce at output_file
- raster_dpi – Resolution at which to rasterize page
- pageno – Page number to rasterize (beginning at page 1)
- page_dpi – Resolution, overriding output image DPI
- rotation – Cardinal angle, clockwise, to rotate page
- filter_vector – If True, remove vector graphics objects
Returns: Path – output_file if successful
Note
This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.
Note
This is a firstresult hook.
Modifying intermediate images¶
-
ocrmypdf.pluginspec.
filter_ocr_image
(page: PageContext, image: Image) → Image¶ Called to filter the image before it is sent to OCR.
This is the image that OCR sees, not what the user sees when they view the PDF. If
redo_ocr
is enabled, portions of the image will be masked so they are not shown to OCR. The main use of this hook is expected to be hiding content from OCR.The input image may be color, grayscale, or monochrome, and the output image may differ. The pixel width and height of the output image must be identical to the input image, or misalignment between the OCR text layer and visual position of the text will occur. Likewise, the output must be a faithful representation of the input, or alignment errors may occurs.
Tesseract OCR only deals with monochrome images, and internally converts non-monochrome images to OCR.
Note
This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.
Note
This is a firstresult hook.
-
ocrmypdf.pluginspec.
filter_page_image
(page: PageContext, image_filename: pathlib.Path) → pathlib.Path¶ Called to filter the whole page before it is inserted into the PDF.
A whole page image is only produced when preprocessing command line arguments are issued or when
--force-ocr
is issued. If no whole page is image is produced for a given page, this function will not be called. This is not the image that will be shown to OCR.If the function does not want to modify the image, it should return
image_filename
. The hook may overwriteimage_filename
with a new file.The output image should preserve the same physical unit dimensions, that is (width * dpi_x, height * dpi_y). That is, if the image is resized, the DPI must be adjusted by the reciprocal. If this is not preserved, the PDF page will be resized and the OCR layer misaligned. OCRmyPDF does not nothing to enforce these constraints; it is up to the plugin to do sensible things.
OCRmyPDF will create the PDF page based on the image format used. If you convert the image to a JPEG, the output page will be created as a JPEG, etc. If you change the colorspace, that change will be kept. Note that the OCRmyPDF image optimization stage, if enabled, may ultimately chose a different format.
If the return value is a file that does not exist,
FileNotFoundError
will occur. The return value should be a path to a file in the same folder asimage_filename
.Implementation detail: If the value returned is falsy, OCRmyPDF will ignore the return value and assume the input file was unmodified. This is deprecated. To leave the image unmodified,
image_filename
should be returned.Note
This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.
Note
This is a firstresult hook.
OCR engine¶
-
ocrmypdf.pluginspec.
get_ocr_engine
() → ocrmypdf.pluginspec.OcrEngine¶ Returns an OcrEngine to use for processing this file.
The OcrEngine may be instantiated multiple times, by both the main process and child process. As such, it must be obtain store any state in
options
or some common location.Note
This is a firstresult hook.
-
class
ocrmypdf.pluginspec.
OcrEngine
¶ A class representing an OCR engine with capabilities similar to Tesseract OCR.
This could be used to create a plugin for another OCR engine instead of Tesseract OCR.
-
__str__
()¶ Returns name of OCR engine and version.
This is used when OCRmyPDF wants to mention the name of the OCR engine to the user, usually in an error message.
-
static
creator_tag
(options: argparse.Namespace) → str¶ Returns the creator tag to identify this software’s role in creating the PDF.
This tag will be inserted in the XMP metadata and DocumentInfo dictionary as appropriate. Ideally you should include the name of the OCR engine and its version. The text should not contain line breaks. This is to help developers like yourself identify the software that produced this file.
OCRmyPDF will always prepend its name to this value.
-
static
generate_hocr
(input_file: pathlib.Path, output_hocr: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None¶ Called to produce a hOCR file and sidecar text file.
-
static
generate_pdf
(input_file: pathlib.Path, output_pdf: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None¶ Called to produce a text only PDF.
Parameters: - input_file – A page image on which to perform OCR.
- output_pdf – The expected name of the output PDF, which must be a single page PDF with no visible content of any kind, sized to the dimensions implied by the input_file’s width, height and DPI. The image will be grafted onto the input PDF page.
-
static
get_orientation
(input_file: pathlib.Path, options: argparse.Namespace) → ocrmypdf.pluginspec.OrientationConfidence¶ Returns the orientation of the image.
-
static
languages
(options: argparse.Namespace) → AbstractSet[str]¶ Returns the set of all languages that are supported by the engine.
Languages are typically given in 3-letter ISO 3166-1 codes, but actually can be any value understood by the OCR engine.
-
static
version
() → str¶ Returns the version of the OCR engine.
-
-
class
ocrmypdf.pluginspec.
OrientationConfidence
(angle, confidence)¶ Expresses an OCR engine’s confidence in page rotation.
-
angle
¶ The clockwise angle (0, 90, 180, 270) that the page should be rotated. 0 means no rotation.
Type: int
-
confidence
¶ How confident the OCR engine is that this the correct rotation. 0 is not confident, 15 is very confident. Arbitrary units.
Type: float
-
PDF/A production¶
-
ocrmypdf.pluginspec.
generate_pdfa
(pdf_pages: List[pathlib.Path], pdfmark: pathlib.Path, output_file: pathlib.Path, compression: str, pdf_version: str, pdfa_part: str) → pathlib.Path¶ Generate a PDF/A.
This API strongly assumes a PDF/A generator with Ghostscript’s semantics.
OCRmyPDF will modify the metadata and possibly linearize the PDF/A after it is generated.
Parameters: - pdf_pages – A list of one or more filenames, will be merged into output_file.
- pdfmark – A PostScript file intended for Ghostscript with details on how to perform the PDF/A conversion.
- output_file – The name of the desired output file.
- compression – One of
'jpeg'
,'lossless'
,''
. For'jpeg'
, the PDF/A generator should convert all images to JPEG encoding where possible. For lossless, all images should be converted to FlateEncode (lossless PNG). If an empty string, the PDF generator should make its own decisions about how to encode images. - pdf_version – The minimum PDF version that the output file should be. At its own discretion, the PDF/A generator may raise the version, but should not lower it.
- pdfa_part – The desired PDF/A compliance level, such as
'2B'
.
Returns: Path – If successful, the hook should return
output_file
.Note
This is a firstresult hook.