Plugins

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

You can use plugins to customize the behavior of OCRmyPDF at certain points of interest.

Currently, it is possible to:

add new command line arguments
override the decision for whether or not to perform OCR on a particular file
modify the image is about to be sent for OCR
modify the page image before it is converted to PDF
replace the Tesseract OCR with another OCR engine that has similar behavior
replace Ghostscript with another PDF to image converter (rasterizer) or PDF/A generator

OCRmyPDF plugins are based on the Python pluggy package and conform to its conventions. Note that: plugins installed with as setuptools entrypoints are not checked currently, because OCRmyPDF assumes you may not want to enable plugins for all files.

See [OCRmyPDF-EasyOCR](https://github.com/ocrmypdf/OCRmyPDF-EasyOCR) for an example of a straightforward, fully working plugin.

Script plugins

Script plugins may be called from the command line, by specifying the name of a file. Script plugins may be convenient for informal or “one-off” plugins, when a certain batch of files needs a special processing step for example.

ocrmypdf --plugin ocrmypdf_example_plugin.py input.pdf output.pdf

Multiple plugins may be installed by issuing the --plugin argument multiple times.

Packaged plugins

Installed plugins may be installed into the same virtual environment as OCRmyPDF is installed into. They may be invoked using Python standard module naming. If you are intending to distribute a plugin, please package it.

ocrmypdf --plugin ocrmypdf_fancypants.pockets.contents input.pdf output.pdf

OCRmyPDF does not automatically import plugins, because the assumption is that plugins affect different files differently and you may not want them activated all the time. The command line or ocrmypdf.ocr(plugin='...') must call for them.

Third parties that wish to distribute packages for ocrmypdf should package them as packaged plugins, and these modules should begin with the name ocrmypdf_ similar to pytest packages such as pytest-cov (the package) and pytest_cov (the module).

Note

We recommend plugin authors name their plugins with the prefix ocrmypdf- (for the package name on PyPI) and ocrmypdf_ (for the module), just like pytest plugins. At the same time, please make it clear that your package is not official.

Plugins

You can also create a plugin that OCRmyPDF will always automatically load if both are installed in the same virtual environment, using a project entrypoint. OCRmyPDF uses the entrypoint namespace “ocrmypdf”.

For example, pyproject.toml would need to contain the following, for a plugin named ocrmypdf-exampleplugin:

[project]
name = "ocrmypdf-exampleplugin"

[project.entry-points."ocrmypdf"]
exampleplugin = "exampleplugin.pluginmodule"

Plugin requirements

OCRmyPDF generally uses multiple worker processes. When a new worker is started, Python will import all plugins again, including all plugins that were imported earlier. This means that the global state of a plugin in one worker will not be shared with other workers. As such, plugin hook implementations should be stateless, relying only on their inputs. Hook implementations may use their input parameters to to obtain a reference to shared state prepared by another hook implementation. Plugins must expect that other instances of the plugin will be running simultaneously.

The context object that is passed to many hooks can be used to share information about a file being worked on. Plugins must write private, plugin-specific data to a subfolder named {options.work_folder}/ocrmypdf-plugin-name. Plugins MAY read and write files in options.work_folder, but should be aware that their semantics are subject to change.

OCRmyPDF will delete options.work_folder when it has finished OCRing a file, unless invoked with --keep-temporary-files.

The documentation for some plugin hooks contain a detailed description of the execution context in which they will be called.

Plugins should be prepared to work whether executed in worker threads or worker processes. Generally, OCRmyPDF uses processes, but has a semi-hidden threaded argument that simplifies debugging.

Plugin hooks

A plugin may provide the following hooks. Hooks must be decorated with ocrmypdf.hookimpl, for example:

from ocrmypdf import hookimpl

@hookimpl
def add_options(parser):
    pass

The following is a complete list of hooks that are available, and when they are called.

Note on firstresult hooks

If multiple plugins install implementations for this hook, they will be called in the reverse of the order in which they are installed (i.e., last plugin wins). When each hook implementation is called in order, the first implementation that returns a value other than None will “win” and prevent execution of all other hooks. As such, you cannot “chain” a series of plugin filters together in this way. Instead, a single hook implementation should be responsible for any such chaining operations.

Examples

OCRmyPDF’s test suite contains several plugins that are used to simulate certain test conditions.
ocrmypdf-papermerge is a production plugin that integrates OCRmyPDF and the Papermerge document management system.

Suppressing or overriding other plugins

ocrmypdf.pluginspec.initialize(plugin_manager: PluginManager) → None

Called when this plugin is first loaded into OCRmyPDF.

The primary intended use of this is for plugins to check compatibility with other plugins and possibly block other blocks, a plugin that wishes to block ocrmypdf’s built-in optimize plugin could do:

plugin_manager.set_blocked('ocrmypdf.builtin_plugins.optimize')

It would also be reasonable for an plugin implementation to check if it is unable to proceed, for example, because a required dependency is missing. (If the plugin’s ability to proceed depends on options and arguments, use validate instead.)

Raises:: ocrmypdf.exceptions.ExitCodeException – If options are not acceptable and the application should terminate gracefully with an informative message and error code.

Note

This hook will be called from the main process, and may modify global state before child worker processes are forked.

Custom command line arguments

ocrmypdf.pluginspec.add_options(parser: ArgumentParser) → None

Allows the plugin to add its own command line and API arguments.

OCRmyPDF converts command line arguments to API arguments, so adding arguments here will cause new arguments to be processed for API calls to ocrmypdf.ocr, or when invoked on the command line.

Note

This hook will be called from the main process, and may modify global state before child worker processes are forked.

ocrmypdf.pluginspec.check_options(options: OcrOptions) → None

Called to ask the plugin to check all of the options.

The plugin may check if options that it added are valid.

Warnings or other messages may be passed to the user by creating a logger object using log = logging.getLogger(__name__) and logging to this.

The plugin may also modify the options. All objects that are in options must be picklable so they can be marshalled to child worker processes.

Raises:: ocrmypdf.exceptions.ExitCodeException – If options are not acceptable and the application should terminate gracefully with an informative message and error code.

Note

This hook will be called from the main process, and may modify global state before child worker processes are forked.

Plugin option models

Plugins can define their own option models using Pydantic. This allows plugins to:

Define type-safe option structures with validation
Add CLI arguments that map to their option model fields
Access options via nested namespaces (e.g., options.tesseract.timeout)

ocrmypdf.pluginspec.register_options() → dict[str, type[BaseModel]]

Return plugin’s option models keyed by namespace.

This hook allows plugins to register their option models with the plugin option registry. The returned dictionary should map namespace strings to Pydantic model classes.

Returns:: Dictionary mapping namespace strings to BaseModel classes

Example

@hookimpl def register_options():

return {‘tesseract’: TesseractOptions}

Note

This hook will be called from the main process during plugin infrastructure setup, before child worker processes are forked.

Plugin options can be accessed in two ways:

Flat access (backward compatible): options.tesseract_timeout
Nested access: options.tesseract.timeout

Both access patterns are equivalent and return the same values.

Note

Plugin Interface Change: Starting in OCRmyPDF v17.0.0, plugin hooks receive OcrOptions objects instead of argparse.Namespace objects. Most plugins will continue working due to duck-typing compatibility, but plugin developers should update their type hints accordingly.

Migration guide for plugin developers

Added in version 17.0.0.

Update imports:

from ocrmypdf._options import OcrOptions

Update type hints:

# Before (v16 and earlier)
def check_options(options: argparse.Namespace) -> None:
    ...

# After (v17+)
def check_options(options: OcrOptions) -> None:
    ...

Attribute access unchanged:

# These work exactly as before
options.languages
options.output_type
options.tesseract_timeout

Remove in-place modifications:

# Before (v16 pattern - no longer recommended)
def check_options(options):
    options.some_computed_value = compute_value(options)

# After (v17 pattern - compute at point of use)
def some_function(options):
    computed = compute_value(options)
    use_computed(computed)

Execution and progress reporting

class ocrmypdf.pluginspec.ProgressBar(*, total: int | float | None, desc: str | None, unit: str | None, disable: bool = False, **kwargs)

The protocol that OCRmyPDF expects progress bar classes to be compatible with.

In practice this could be used for any time of monitoring, not just a progress bar.

Calling the class should return a new progress bar object, which is activated with __enter__ and terminated with __exit__. An update method is called whenever the progress bar is updated. Progress bar objects will not be reused; a new one will be created for each group of tasks.

The progress bar is held in the main process/thread and not updated by child process/threads. When a child notifies the parent of completed work, the parent updates the progress bar. Progress bars should never write to sys.stdout, or they will corrupt the output if OCRmyPDF writes a PDF to standard output.

Note

The type of events that OCRmyPDF reports to a progress bar may change in

minor releases.

Parameters:

total (int | float | None) – The total number of work units expected. If None, the total is unknown. For example, if you are processing pages, this might be the number of pages, or if you are measuring overall progress in percent, this might be 100.
desc (str | None) – A brief description of the current step (e.g. “Scanning contents”, “OCR”, “PDF/A conversion”). OCRmyPDF updates this before each major step.
unit (str | None) – A short label for the type of work being tracked (e.g. “page”, “%”, “image”).
disable (bool) – If True, progress updates are suppressed (no output). Defaults to False.
**kwargs – Future or extra parameters that OCRmyPDF might pass. Implementations should accept and ignore unrecognized keywords gracefully.

Example

A simple plugin implementation could look like this:

from ocrmypdf.pluginspec import ProgressBar
from ocrmypdf import hookimpl

class ConsoleProgressBar(ProgressBar):
    def __init__(self, *, total=None, desc=None, unit=None, disable=False,
                 **kwargs):
        self.total = total
        self.desc = desc
        self.unit = unit
        self.disable = disable
        self.current = 0

    def __enter__(self):
        if not self.disable:
            print(f"Starting {self.desc or 'an OCR task'} "
                  f"(total={self.total} {self.unit})"
            )
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        if not self.disable:
            if exc_type is None:
                print("Completed successfully.")
            else:
                print(f"Task ended with error: {exc_value}")
        return False  # Let OCRmyPDF raise any exceptions

    def update(self, n=1, *, completed=None):
        if completed is not None:
            # If 'completed' is given, set self.current
            # but let's just read it to show usage
            print(f"Absolute completion reported: {completed}")
        # Otherwise, we increment by 'n'
        self.current += n
        if not self.disable:
            if self.total:
                percent = (self.current / self.total) * 100
                print(
                    f"{self.desc}: {self.current}"
                    f"/{self.total} ({percent:.1f}%)"
                )
            else:
                print(f"{self.desc}: {self.current} units done")

@hookimpl
def get_progressbar_class():
    return MyProgressBar

__enter__(): Enter a progress bar context.

__exit__(*args): Exit a progress bar context.

__init__(*, total: int | float | None, desc: str | None, unit: str | None, disable: bool = False, **kwargs)

Initialize a progress bar.

This is called once before any work is done. OCRmyPDF supplies the total number of units (or None if unknown), a description of the work, and the type of units. The disable parameter can be used to turn off progress reporting. Unrecognized keyword arguments should be ignored.

Parameters:

total (int | float | None) – The total amount of work. If None, the total is unknown.
desc (str | None) – A description of the current task. May change for different stages.
unit (str | None) – A short label for the unit of work.
disable (bool) – If True, no output or logging should be displayed.
**kwargs – Extra parameters that may be passed by OCRmyPDF in future versions.

update(n: float = 1, *, completed: float | None = None)

Increment the progress bar by n units, or set an absolute completion.

OCRmyPDF calls this method repeatedly while processing pages or other tasks. If your total is known and you track it, you might do something like:

self.current += n
percent = (self.current / total) * 100

The completed argument can indicate an absolute position, which is particularly helpful if you’re tracking a percentage of work (e.g., 0 to 100) and want precise updates. In contrast, the incremental parameter n is often more useful for page-based increments.

Parameters:

n (float, optional) – The amount to increment the progress by. Defaults to 1. May be fractional if OCRmyPDF performs partial steps. If you are tracking pages, this is typically how many pages have been processed in the most recent step.
completed (float | None, optional) – The absolute amount of work completed so far. This can override or supplement the simple increment logic. It’s particularly useful for percentage-based tracking (e.g., when total is 100).

class ocrmypdf.pluginspec.Executor(*, pbar_class=None)

Abstract concurrent executor.

__call__(*, use_threads: bool, max_workers: int, progress_kwargs: dict, worker_initializer: Callable | None = None, task: Callable[[...], T] | None = None, task_arguments: Iterable | None = None, task_finished: Callable[[T, ProgressBar], None] | None = None) → None

Set up parallel execution and progress reporting.

Parameters:

use_threads – If False, the workload is the sort that will benefit from running in a multiprocessing context (for example, it uses Python heavily, and parallelizing it with threads is not expected to be performant).
max_workers – The maximum number of workers that should be run.
progress_kwargs – Arguments to set up the progress bar.
worker_initializer – Called when a worker is initialized, in the worker’s execution context. If the child workers are processes, it must be possible to marshall/pickle the worker initializer. functools.partial can be used to bind parameters.
task – Called when the worker starts a new task, in the worker’s execution context. Must be possible to marshall to the worker.
task_finished – Called when a worker finishes a task, in the parent’s context.
task_arguments – An iterable that generates a group of parameters for each task. This runs in the parent’s context, but the parameters must be marshallable to the worker.

pbar_class: alias of NullProgressBar

ocrmypdf.pluginspec.get_logging_console() → Handler

Returns a custom logging handler.

Generally this is necessary when both logging output and a progress bar are both outputting to sys.stderr.

Note

This is a firstresult hook.

ocrmypdf.pluginspec.get_executor(progressbar_class: type[ProgressBar]) → Executor

Called to obtain an object that manages parallel execution.

This may be used to replace OCRmyPDF’s default parallel execution system with a third party alternative. For example, you could make OCRmyPDF run in a distributed environment.

OCRmyPDF’s executors are analogous to the standard Python executors in conconcurrent.futures, but they do not work the same way. Executors may be reused for different, unrelated batch operations, since all of the context for a given job are passed to Executor.__call__().

Should be of type Executor or otherwise conforming to the protocol of that call.

Parameters:: progressbar_class – A progress bar class, which will be created when

Note

This hook will be called from the main process, and may modify global state before child worker processes are forked.

Note

This is a firstresult hook.

ocrmypdf.pluginspec.get_progressbar_class() → type[ProgressBar]

Called to obtain a class that can be used to monitor progress.

OCRmyPDF will call this function when it wants to display a progress bar. The class returned by this function must be compatible with the ProgressBar protocol.

Example

Here is how OCRmyPDF will use the progress bar:

pbar_class = pm.hook.get_progressbar_class()
with pbar_class(**progress_kwargs) as pbar:
    ... # do some work
    pbar.update(1)

Applying special behavior before processing

ocrmypdf.pluginspec.validate(pdfinfo: PdfInfo, options: OcrOptions) → None

Called to give a plugin an opportunity to review options and pdfinfo.

options contains the “work order” to process a particular file. pdfinfo contains information about the input file obtained after loading and parsing. The plugin may modify the options. For example, you could decide that a certain type of file should be treated with options.force_ocr = True based on information in its pdfinfo.

Raises:: ocrmypdf.exceptions.ExitCodeException – If options or pdfinfo are not acceptable and the application should terminate gracefully with an informative message and error code.

Note

This hook will be called from the main process, and may modify global state before child worker processes are forked.

PDF page to image

ocrmypdf.pluginspec.rasterize_pdf_page(input_file: Path, output_file: Path, raster_device: GhostscriptRasterDevice, raster_dpi: Resolution, pageno: int, page_dpi: Resolution | None, rotation: int | None, filter_vector: bool, stop_on_soft_error: bool, options: OcrOptions | None, use_cropbox: bool) → Path

Rasterize one page of a PDF at resolution raster_dpi in canvas units.

The image is sized to match the integer pixels dimensions implied by raster_dpi even if those numbers are noninteger. The image’s DPI will be overridden with the values in page_dpi.

Parameters:

input_file – The PDF to rasterize.
output_file – The desired name of the rasterized image.
raster_device – Type of image to produce at output_file.
raster_dpi – Resolution in dots per inch at which to rasterize page.
pageno – Page number to rasterize (beginning at page 1).
page_dpi – Resolution, overriding output image DPI.
rotation – Cardinal angle, clockwise, to rotate page.
filter_vector – If True, remove vector graphics objects.
stop_on_soft_error – If there is an “soft error” such that PDF page image generation can proceed, but may visually differ from the original, the implementer of this hook should raise a detailed exception. If False, continue processing and report by logging it. If the hook cannot proceed, it should always raise an exception, regardless of this setting. One “soft error” would be a missing font that is required to properly rasterize the PDF.
options – OCRmyPDF options. Plugins may use this to check settings like options.rasterizer to determine whether they should handle the request or defer to another plugin. Introduced in version 17.0.
use_cropbox – If True, rasterize the page’s CropBox instead of the MediaBox. Default is False (use MediaBox) for consistency with Ghostscript’s default behavior.

Returns:

Path – output_file if successful

Note

This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.

Note

This is a firstresult hook.

Modifying intermediate images

ocrmypdf.pluginspec.filter_ocr_image(page: PageContext, image: Image.Image) → Image.Image

Called to filter the image before it is sent to OCR.

This is the image that OCR sees, not what the user sees when they view the PDF. In certain modes such as --redo-ocr, portions of the image may be masked out to hide them from OCR.

The main uses of this hook are expected to be hiding content from OCR, conditioning images to OCR better with filters, and adjusting images to match any constraints imposed by the OCR engine.

The input image may be color, grayscale, or monochrome, and the output image may differ. For example, if you know that a custom OCR engine does not care about the color of the text, you could convert the image to it to grayscale or monochrome.

Generally speaking, the output image should be a faithful representation of of the input image. You may change the pixel width and height of the the input image, but you must not change the aspect ratio, and you must calculate the DPI of the output image based on the new pixel width and height or the OCR text layer will be misaligned with the visual position.

The built-in Tesseract OCR engine uses this hook itself to downsample very large images to fit its constraints.

Note

This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.

Note

This is a firstresult hook.

ocrmypdf.pluginspec.filter_page_image(page: PageContext, image_filename: Path) → Path

Called to filter the whole page before it is inserted into the PDF.

A whole page image is only produced when preprocessing command line arguments are issued or when --force-ocr is issued. If no whole page is image is produced for a given page, this function will not be called. This is not the image that will be shown to OCR.

If the function does not want to modify the image, it should return image_filename. The hook may overwrite image_filename with a new file.

The output image should preserve the same physical unit dimensions, that is (width * dpi_x, height * dpi_y). That is, if the image is resized, the DPI must be adjusted by the reciprocal. If this is not preserved, the PDF page will be resized and the OCR layer misaligned. OCRmyPDF does nothing to enforce these constraints; it is up to the plugin to do sensible things.

OCRmyPDF will create the PDF page based on the image format used (unless the hook is overridden). If you convert the image to a JPEG, the output page will be created as a JPEG, etc. If you change the colorspace, that change will be kept. Note that the OCRmyPDF image optimization stage, if enabled, may ultimately chose a different format.

If the return value is a file that does not exist, FileNotFoundError will occur. The return value should be a path to a file in the same folder as image_filename.

Note

This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.

Note

This is a firstresult hook.

ocrmypdf.pluginspec.filter_pdf_page(page: PageContext, image_filename: Path, output_pdf: Path) → Path

Called to convert a filtered whole page image into a PDF.

A whole page image is only produced when preprocessing command line arguments are issued or when --force-ocr is issued. If no whole page is image is produced for a given page, this function will not be called. This is not the image that will be shown to OCR. The whole page image is filtered in the hook above, filter_page_image, then this function is called for PDF conversion.

This function will only be called when OCRmyPDF runs in a mode such as “force OCR” mode where rasterizing of all content is performed.

Clever things could be done at this stage such as segmenting the page image into color regions or vector equivalents.

The provider of the hook implementation is responsible for ensuring that the OCR text layer is aligned with the PDF produced here, or text misalignment will result.

Currently this function must produce a single page PDF or the pipeline will fail. If the intent is to remove the PDF, then create a single page empty PDF.

Parameters:

page – Context for this page.
image_filename – Filename of the input image used to create output_pdf, for “reference” if recreating the output_pdf entirely.
output_pdf – The previous created output_pdf.

Returns:

output_pdf

Note

This hook will be called from child processes. Modifying global state will not affect the main process or other child processes.

Note

This is a firstresult hook.

OCR engine

ocrmypdf.pluginspec.get_ocr_engine(options: OcrOptions | None) → OcrEngine

Returns an OcrEngine to use for processing this file.

The OcrEngine may be instantiated multiple times, by both the main process and child process.

When multiple OCR engine plugins are installed, plugins should check options.ocr_engine and return None if they are not the selected engine. The hook caller will then try the next plugin.

Parameters:: options – The current OcrOptions, used to determine which engine to select. May be None for backward compatibility with external plugins.

Note

This is a firstresult hook.

class ocrmypdf.pluginspec.OcrEngine

A class representing an OCR engine with capabilities similar to Tesseract OCR.

This could be used to create a plugin for another OCR engine instead of Tesseract OCR.

abstractmethod __str__() → str

Returns name of OCR engine and version.

This is used when OCRmyPDF wants to mention the name of the OCR engine to the user, usually in an error message.

abstractmethod static creator_tag(options: OcrOptions) → str

Returns the creator tag to identify this software’s role in creating the PDF.

This tag will be inserted in the XMP metadata and DocumentInfo dictionary as appropriate. Ideally you should include the name of the OCR engine and its version. The text should not contain line breaks. This is to help developers like yourself identify the software that produced this file.

OCRmyPDF will always prepend its name to this value.

abstractmethod static generate_hocr(input_file: Path, output_hocr: Path, output_text: Path, options: OcrOptions) → None

Called to produce a hOCR file from a page image and sidecar text file.

A hOCR file is an HTML-like file that describes the position of text on a page. OCRmyPDF can create a text only PDF from the hOCR file and graft it onto the output PDF.

This function executes in a worker thread or worker process. OCRmyPDF automatically parallelizes OCR over pages. The OCR engine should not introduce more parallelism.

Parameters:

input_file – A page image on which to perform OCR.
output_hocr – The expected name of the output hOCR file.
output_text – The expected name of a text file containing the recognized text.
options – The command line options.

static generate_ocr(input_file: Path, options: OcrOptions, page_number: int = 0) → tuple[OcrElement, str]

Generate OCR results as an OcrElement tree.

This is the modern API for OCR engines. Engines implementing this method can return structured OCR results directly without intermediate file formats.

This function executes in a worker thread or worker process. OCRmyPDF automatically parallelizes OCR over pages. The OCR engine should not introduce more parallelism.

Parameters:

input_file – A page image on which to perform OCR.
options – The command line options.
page_number – Zero-indexed page number (for multi-page context).

Returns:

A tuple of (OcrElement tree for the page, plain text content). The OcrElement should have ocr_class=OcrClass.PAGE as its root.

Note

This method is optional. Engines that don’t implement it should leave the default implementation, and the pipeline will fall back to generate_hocr() or generate_pdf().

abstractmethod static generate_pdf(input_file: Path, output_pdf: Path, output_text: Path, options: OcrOptions) → None

Called to produce a text only PDF from a page image.

A text only PDF should contain no visible material of any kind, as it will be grafted onto the input PDF page. It must be sized to the exact dimensions of the input image.

This function executes in a worker thread or worker process. OCRmyPDF automatically parallelizes OCR over pages. The OCR engine should not introduce more parallelism.

Parameters:

input_file – A page image on which to perform OCR.
output_pdf – The expected name of the output PDF.
output_text – The expected name of a text file containing the recognized text.
options – The command line options.

static get_deskew(input_file: Path, options: OcrOptions) → float: Returns the deskew angle of the image, in degrees.

abstractmethod static get_orientation(input_file: Path, options: OcrOptions) → OrientationConfidence: Returns the orientation of the image.

abstractmethod static languages(options: OcrOptions) → Set[str]

Returns the set of all languages that are supported by the engine.

Languages are typically given in 3-letter ISO 3166-1 codes, but actually can be any value understood by the OCR engine.

static supports_generate_ocr() → bool

Return True if this engine supports the generate_ocr() API.

The pipeline uses this to determine whether to call generate_ocr() or fall back to generate_hocr().

Returns:: False by default. Engines implementing generate_ocr() should override this to return True.

abstractmethod static version() → str: Returns the version of the OCR engine.

class ocrmypdf.pluginspec.OrientationConfidence(angle: int, confidence: float)

Expresses an OCR engine’s confidence in page rotation.

angle

The clockwise angle (0, 90, 180, 270) that the page should be rotated. 0 means no rotation.

Type:: int

confidence

How confident the OCR engine is that this the correct rotation. 0 is not confident, 15 is very confident. Arbitrary units.

Type:: float

PDF/A production

ocrmypdf.pluginspec.generate_pdfa(pdf_pages: list[Path], pdfmark: Path, output_file: Path, context: PdfContext, pdf_version: str, pdfa_part: str, progressbar_class: type[ProgressBar] | None, stop_on_soft_error: bool) → Path

Generate a PDF/A.

This API strongly assumes a PDF/A generator with Ghostscript’s semantics.

OCRmyPDF will modify the metadata and possibly linearize the PDF/A after it is generated.

Parameters:

pdf_pages – A list of one or more filenames, will be merged into output_file.
pdfmark – A PostScript file intended for Ghostscript with details on how to perform the PDF/A conversion.
output_file – The name of the desired output file.
context – The current context.
pdf_version – The minimum PDF version that the output file should be. At its own discretion, the PDF/A generator may raise the version, but should not lower it.
pdfa_part – The desired PDF/A compliance level, such as '2b'.
progressbar_class – The class of a progress bar, which must implement the ProgressBar protocol. If None, no progress is reported.
stop_on_soft_error – If there is an “soft error” such that PDF/A generation can proceed and produce a valid PDF/A, but output may be invalid or may not visually resemble the original, the implementer of this hook should raise a detailed exception. If False, continue processing and report by logging it. If the hook cannot proceed, it should always raise an exception, regardless of this setting.

Returns:

Path – If successful, the hook should return output_file.

Note

This is a firstresult hook.

Note

Before version 15.0.0, the context was not provided and compression was provided instead. Plugins should now read the context object to determine if compression is requested.

PDF optimization

ocrmypdf.pluginspec.optimize_pdf(input_pdf: Path, output_pdf: Path, context: PdfContext, executor: Executor, linearize: bool) → tuple[Path, Sequence[str]]

Optimize a PDF after image, OCR and metadata processing.

If the input_pdf is a PDF/A, the plugin should modify input_pdf in a way that preserves the PDF/A status, or report to the user when this is not possible.

If the implementation fails to produce a smaller file than the input file, it should return input_pdf instead.

A plugin that implements a new optimizer may need to suppress the built-in optimizer by implementing an initialize hook.

Parameters:

input_pdf – The input PDF, which has OCR added.
output_pdf – The requested filename of the output PDF which should be created by this optimization hook.
context – The current context.
executor – An initialized executor which may be used during optimization, to distribute optimization tasks.
linearize – If True, OCRmyPDF requires optimize_pdf to return a linearized, also known as fast web view PDF.

Returns:

Path –

If optimization is successful, the hook should return output_file.: If optimization does not produce a smaller file, the hook should return input_file.
Sequence[str]: Any comments that the plugin wishes to report to the user,: especially reasons it was not able to further optimize the file. For example, the plugin could report that a required third party was not installed, so a specific optimization was not attempted.

Note

This is a firstresult hook.

ocrmypdf.pluginspec.is_optimization_enabled(context: PdfContext) → bool

For a given PdfContext, OCRmyPDF asks the plugin if optimization is enabled.

An optimization plugin might be installed and active but could be disabled by user settings.

If this returns False, OCRmyPDF will take certain actions to finalize the PDF.

Returns:: True if the plugin’s optimization is enabled.

Note

This is a firstresult hook.

Working with OcrElement trees

Added in version 17.0.0.

OCRmyPDF v17 introduces the OcrElement dataclass for representing OCR output in an engine-agnostic format. This enables plugins to work with OCR results without parsing hOCR XML.

Key classes:

from ocrmypdf import OcrElement, OcrClass, BoundingBox

# OcrElement - represents any OCR structural unit
page = OcrElement(
    ocr_class=OcrClass.PAGE,
    bbox=BoundingBox(0, 0, 612, 792),
    children=[...]
)

# BoundingBox - axis-aligned bounding box (left, top, right, bottom)
bbox = BoundingBox(left=100, top=50, right=300, bottom=80)

# OcrClass - constants for element types
OcrClass.PAGE      # "ocr_page"
OcrClass.LINE      # "ocr_line"
OcrClass.WORD      # "ocrx_word"
OcrClass.PARAGRAPH # "ocr_par"

Navigating the tree:

# Get all words in a page
words = page.words  # Returns list[OcrElement]

# Get all lines
lines = page.lines

# Get combined text
text = page.get_text_recursive()

# Iterate by class
for para in page.paragraphs:
    print(para.get_text_recursive())

OCR engine plugins:

Plugins implementing custom OCR engines can now output OcrElement trees directly via the generate_ocr() method, bypassing hOCR entirely:

from pathlib import Path
from ocrmypdf.pluginspec import OcrEngine
from ocrmypdf import OcrElement, OcrClass, BoundingBox

class MyOcrEngine(OcrEngine):
    def generate_ocr(
        self,
        input_file: Path,
        options,
        context,
    ) -> OcrElement:
        # Perform OCR and return OcrElement tree directly
        # No need to generate hOCR XML
        return OcrElement(
            ocr_class=OcrClass.PAGE,
            bbox=BoundingBox(0, 0, width, height),
            dpi=300,
            children=[
                OcrElement(
                    ocr_class=OcrClass.LINE,
                    bbox=BoundingBox(100, 50, 500, 80),
                    children=[
                        OcrElement(
                            ocr_class=OcrClass.WORD,
                            bbox=BoundingBox(100, 50, 200, 80),
                            text="Hello",
                        ),
                        # ... more words
                    ]
                ),
                # ... more lines
            ]
        )

    def supports_generate_ocr(self) -> bool:
        return True  # Indicate this engine uses generate_ocr()

This approach is simpler than generating hOCR and allows modern OCR engines to integrate more naturally with OCRmyPDF.