Plugins¶
You can use plugins to customize the behavior of OCRmyPDF at certain points of interest.
Currently, it is possible to:
- add new command line arguments
- override the decision for whether or not to perform OCR on a particular file
- modify the image is about to be sent for OCR
- modify the page image before it is converted to PDF
OCRmyPDF plugins are based on the Python pluggy
package and conform to its
conventions. Note that: plugins installed with as setuptools entrypoints are
not checked currently, because OCRmyPDF assumes you may not want to enable
plugins for all files. Also, plugins must be functions, not classes.
How plugins are imported¶
Plugins are imported on demand, by the OCRmyPDF worker process that needs to use
them. As such, plugins cannot share state with other plugins, cannot rely on
their module’s or the interpreter’s global state, and should expect asynchronous
copies of themselves to be running. Plugins can write intermediate files to the
folder specified in options.work_folder
.
Plugins should work whether executed in threads or processes.
Script plugins¶
Script plugins may be called from the command line, by specifying the name of a file.
ocrmypdf --plugin example_plugin.py input.pdf output.pdf
Multiple plugins may be called by issuing the --plugin
argument multiple times.
Packaged plugins¶
Installed plugins may be installed into the same virtual environment as OCRmyPDF is installed into. They may be invoked using Python standard module naming.
ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf
OCRmyPDF does not automatically import plugins, because the assumption is that
plugins affect different files differently and you may not want them activated
all the time. The command line or ocrmypdf.ocr(plugin='...')
must call
for them.
Third parties that wish to distribute packages for ocrmypdf should package them
as packaged plugins, and these modules should begin with the name ocrmypdf_
similar to pytest
packages such as pytest-cov
(the package) and
pytest_cov
(the module).
Plugin hooks¶
A plugin may provide the following hooks. Hooks should be decorated with
ocrmypdf.hookimpl
, for example:
from ocrmpydf import hookimpl
@hookimpl
def add_options(parser):
pass
The following is a complete list of hooks that may be installed and when they are called.
-
class
ocrmypdf.pluginspec.
OcrEngine
¶ -
static
creator_tag
(options: argparse.Namespace) → str¶ Returns the creator tag to identify this software’s role in creating the PDF.
-
static
generate_hocr
(input_file: pathlib.Path, output_hocr: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None¶ Called to produce a hOCR file and sidecar text file.
-
static
generate_pdf
(input_file: pathlib.Path, output_pdf: pathlib.Path, output_text: pathlib.Path, options: argparse.Namespace) → None¶ Called to produce a text only PDF.
Parameters: - input_file – A page image on which to perform OCR.
- output_pdf – The expected name of the output PDF, which must be a single page PDF with no visible content of any kind, sized to the dimensions implied by the input_file’s width, height and DPI. The image will be grafted onto the input PDF page.
-
static
get_orientation
(input_file: pathlib.Path, options: argparse.Namespace) → ocrmypdf.pluginspec.OrientationConfidence¶ Returns the orientation of the image.
-
static
languages
(options: argparse.Namespace) → AbstractSet[str]¶ Returns the set of all languages that are supported by the engine.
Languages are typically given in 3-letter ISO 3166-1 codes, but actually can be any value understood by the OCR engine.
-
static
version
() → str¶ Returns the version of the OCR engine.
-
static
-
class
ocrmypdf.pluginspec.
OrientationConfidence
(angle, confidence)¶ -
angle
¶ Alias for field number 0
-
confidence
¶ Alias for field number 1
-
-
ocrmypdf.pluginspec.
add_options
(parser: argparse.ArgumentParser) → None¶ Allows the plugin to add its own command line and API arguments.
OCRmyPDF converts command line arguments to API arguments, so adding arguments here will cause new arguments to be processed for API calls to
ocrmypdf.ocr
, or when invoked on the command line.Note
This hook will be called from the main process, and may modify global state before child worker processes are forked.
-
ocrmypdf.pluginspec.
check_options
(options: argparse.Namespace) → None¶ Called to ask the plugin to check all of the options.
The plugin may check if options that it added are valid.
Warnings or other messages may be passed to the user by creating a logger object using
log = logging.getLogger(__name__)
and logging to this.The plugin may also modify the options. All objects that are in options must be picklable so they can be marshalled to child worker processes.
Raises: ocrmypdf.exceptions.ExitCodeException
– If options are not acceptable and the application should terminate gracefully with an informative message and error code.Note
This hook will be called from the main process, and may modify global state before child worker processes are forked.
-
ocrmypdf.pluginspec.
filter_ocr_image
(page: PageContext, image: <module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/ocrmypdf/envs/v10.1.0/lib/python3.6/site-packages/PIL/Image.py'>) → <module 'PIL.Image' from '/home/docs/checkouts/readthedocs.org/user_builds/ocrmypdf/envs/v10.1.0/lib/python3.6/site-packages/PIL/Image.py'>¶ Called to filter the image before it is sent to OCR.
This is the image that OCR sees, not what the user sees when they view the PDF.
Note
This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.
-
ocrmypdf.pluginspec.
filter_page_image
(page: PageContext, image_filename: pathlib.Path) → pathlib.Path¶ Called to filter the whole page before it is inserted into the PDF.
A whole page image is only produced when preprocessing command line arguments are issued or when
--force-ocr
is issued. If no whole page is image is produced for a given page, this function will not be called. This is not the image that will be shown to OCR.ocrmypdf will create the PDF page based on the image format used. If you convert the image to a JPEG, the output page will be created as a JPEG, etc. Note that the ocrmypdf image optimization stage may ultimately chose a different format.
Note
This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.
-
ocrmypdf.pluginspec.
generate_pdfa
(pdf_pages: List[pathlib.Path], pdfmark: pathlib.Path, output_file: pathlib.Path, compression: str, pdf_version: str, pdfa_part: str) → pathlib.Path¶ Generate a PDF/A.
This API strongly assumes a PDF/A generator with Ghostscript’s semantics.
OCRmyPDF will modify the metadata and possibly linearize the PDF/A after it is generated.
Parameters: - pdf_pages – A list of one or more filenames, will be merged into output_file.
- pdfmark – A PostScript file intended for Ghostscript with details on how to perform the PDF/A conversion.
- output_file – The name of the desired output file.
- compression – One of
'jpeg'
,'lossless'
,''
. For'jpeg'
, the PDF/A generator should convert all images to JPEG encoding where possible. For lossless, all images should be converted to FlateEncode (lossless PNG). If an empty string, the PDF generator should make its own decisions about how to encode images. - pdf_version – The minimum PDF version that the output file should be. At its own discretion, the PDF/A generator may raise the version, but should not lower it.
- pdfa_part – The desired PDF/A compliance level, such as
'2B'
.
Returns: If successful, the hook should return
output_file
.Return type: output_file
-
ocrmypdf.pluginspec.
get_ocr_engine
() → ocrmypdf.pluginspec.OcrEngine¶ Returns an OcrEngine to use for processing this file.
The OcrEngine may be instantiated multiple times, by both the main process and child process. As such, it must be obtain store any state in
options
or some common location.
-
ocrmypdf.pluginspec.
rasterize_pdf_page
(input_file: pathlib.Path, output_file: pathlib.Path, raster_device: str, raster_dpi: ocrmypdf.helpers.Resolution, pageno: int, page_dpi: Optional[ocrmypdf.helpers.Resolution] = None, rotation: Optional[int] = None, filter_vector: bool = False) → pathlib.Path¶ Rasterize one page of a PDF at resolution raster_dpi in canvas units.
The image is sized to match the integer pixels dimensions implied by raster_dpi even if those numbers are noninteger. The image’s DPI will be overridden with the values in page_dpi.
Parameters: - input_file – The PDF to rasterize.
- output_file – The desired name of the rasterized image.
- raster_device – Type of image to produce at output_file
- raster_dpi – Resolution at which to rasterize page
- pageno – Page number to rasterize (beginning at page 1)
- page_dpi – Resolution, overriding output image DPI
- rotation – Cardinal angle, clockwise, to rotate page
- filter_vector – If True, remove vector graphics objects
Returns: output_file
Note
This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.
-
ocrmypdf.pluginspec.
validate
(pdfinfo: PdfInfo, options: argparse.Namespace) → None¶ Called to give a plugin an opportunity to review options and pdfinfo.
options contains the “work order” to process a particular file. pdfinfo contains information about the input file obtained after loading and parsing. The plugin may modify the options. For example, you could decide that a certain type of file should be treated with
options.force_ocr = True
based on information in its pdfinfo.Raises: ocrmypdf.exceptions.ExitCodeException
– If options or pdfinfo are not acceptable and the application should terminate gracefully with an informative message and error code.Note
This hook will be called from the main process, and may modify global state before child worker processes are forked.