API reference

This page summarizes the rest of the public API. Generally speaking this should be mainly of interest to plugin developers.

ocrmypdf.api

Python API for OCRmyPDF.

This module provides the main Python API for OCRmyPDF, allowing you to perform OCR operations programmatically without using the command line interface.

Main Functions:

ocr(): The primary function for OCR processing. Takes an input PDF or image: file and produces an OCR’d PDF with searchable text.
configure_logging(): Set up logging to match the command line interface: behavior, with support for progress bars and colored output.

Experimental Functions:

_pdf_to_hocr(): Extract text from PDF pages and save as hOCR files for: manual editing before final PDF generation.
_hocr_to_ocr_pdf(): Convert hOCR files back to a searchable PDF after: manual text corrections.

The API maintains thread safety through internal locking since OCRmyPDF uses global state for plugins. Only one OCR operation can run per Python process at a time. For parallel processing, use multiple Python processes.

Example

import ocrmypdf

# Configure logging (optional) ocrmypdf.configure_logging(ocrmypdf.Verbosity.default)

# Perform OCR ocrmypdf.ocr(‘input.pdf’, ‘output.pdf’, language=’eng’)

For detailed parameter documentation, see the ocr() function docstring and the equivalent command line parameters in the OCRmyPDF documentation.

class ocrmypdf.api.PageNumberFilter(name='')

Insert PDF page number that emitted log message to log record.

filter(record)

Determine if the specified record is to be logged.

Returns True if the record should be logged, or False otherwise. If deemed appropriate, the record may be modified in-place.

class ocrmypdf.api.Verbosity(*values)

Verbosity level for configure_logging.

debug = 1: Output ocrmypdf debug messages

debug_all = 2: More detailed debugging from ocrmypdf and dependent modules

default = 0: Default level of logging

quiet = -1: Suppress most messages

ocrmypdf.api.check_options(options: OcrOptions, plugin_manager: OcrmypdfPluginManager) → None

Check options for validity and consistency.

This function coordinates validation across the entire system: 1. Core validation (platform, files, preprocessing) 2. Plugin external dependency validation 3. Plugin-specific validation (handled by plugin models) 4. Cross-cutting validation (handled by validation coordinator)

ocrmypdf.api.configure_logging(verbosity: Verbosity, *, progress_bar_friendly: bool = True, manage_root_logger: bool = False, plugin_manager: OcrmypdfPluginManager | None = None)

Set up logging.

Before calling ocrmypdf.ocr(), you can use this function to configure logging if you want ocrmypdf’s output to look like the ocrmypdf command line interface. It will register log handlers, log filters, and formatters, configure color logging to standard error, and adjust the log levels of third party libraries. Details of this are fine-tuned and subject to change. The verbosity argument is equivalent to the argument --verbose and applies those settings. If you have a wrapper script for ocrmypdf and you want it to be very similar to ocrmypdf, use this function; if you are using ocrmypdf as part of an application that manages its own logging, you probably do not want this function.

If this function is not called, ocrmypdf will not configure logging, and it is up to the caller of ocrmypdf.ocr() to set up logging as it wishes using the Python standard library’s logging module. If this function is called, the caller may of course make further adjustments to logging.

Regardless of whether this function is called, ocrmypdf will perform all of its logging under the "ocrmypdf" logging namespace. In addition, ocrmypdf imports pdfminer, which logs under "pdfminer". A library user may wish to configure both; note that pdfminer is extremely chatty at the log level logging.INFO.

This function does not set up the debug.log log file that the command line interface does at certain verbosity levels. Applications should configure their own debug logging.

Parameters:

verbosity – Verbosity level.
progress_bar_friendly – If True (the default), install a custom log handler that is compatible with progress bars and colored output.
manage_root_logger – Configure the process’s root logger.
plugin_manager – The plugin manager, used for obtaining the custom log handler.

Returns:

The toplevel logger for ocrmypdf (or the root logger, if we are managing it).

ocrmypdf.api.configure_stdout_protection() → bool

Protect the process’s real standard output from corruption.

When OCRmyPDF writes its final PDF to standard output (output_file='-'), the bytes on stdout must be exactly the PDF and nothing else. By default OCRmyPDF relies on no in-process code – third party libraries, plugins, or stray print() calls – ever writing to stdout. This function makes that guarantee real: it redirects file descriptor 1 to standard error and preserves a private copy of the real stdout, so that any accidental write to stdout lands harmlessly on stderr while OCRmyPDF still emits its final PDF to the preserved descriptor.

This is the same protection the ocrmypdf command line program installs. It is optional for API users and works like configure_logging(): call it before ocr() if you want command-line-like behavior. It must be called once, early – before any plugins are loaded or any worker process/thread is started – so that they inherit the redirected descriptor.

Because it mutates process-global file descriptors and affects the entire process, applications that manage their own standard output (for example, a long-lived service that calls ocr() in-process) should not call this function.

Returns:: True if protection was installed (or was already active). False if stdout is not backed by a real operating system file descriptor, in which case nothing is changed.

Construct an options object from the input/output files and keyword arguments.

Parameters:

input_file – Input file path or file object.
output_file – Output file path or file object.
parser – ArgumentParser object (kept for compatibility, may be used for plugin validation).
**kwargs – Keyword arguments.

Returns:

OcrOptions – An options object containing the parsed arguments.

Raises:

TypeError – If the type of a keyword argument is not supported.

ocrmypdf.api.get_parser(): Get the main CLI parser.

ocrmypdf.api.ocr(options: OcrOptions, /, *, plugins: Iterable[Path | str] | None = None, plugin_manager: OcrmypdfPluginManager | None = None) → ExitCode

Run OCRmyPDF on one PDF or image.

This function supports two calling conventions:

New style (recommended):

>>> from ocrmypdf import ocr
>>> from ocrmypdf._options import OcrOptions
>>> options = OcrOptions(
...     input_file="input.pdf",
...     output_file="output.pdf",
...     languages=["eng"],
... )
>>> ocr(options)

Old style:

>>> ocr("input.pdf", "output.pdf", language=["eng"])

For most arguments, see documentation for the equivalent command line parameter.

This API takes a threading lock, because OCRmyPDF uses global state in particular for the plugin system. The jobs parameter will be used to create a pool of worker threads or processes at different times, subject to change. A Python process can only run one OCRmyPDF task at a time.

To run parallelize instances OCRmyPDF, use separate Python processes to scale horizontally. Generally speaking you should set jobs=sqrt(cpu_count) and run sqrt(cpu_count) processes as a starting point. If you have files with a high page count, run fewer processes and more jobs per process. If you have a lot of short files, run more processes and fewer jobs per process.

A few specific arguments are discussed here:

Parameters:

input_file_or_options – Either an OcrOptions object containing all settings, or a path/stream for the input file (old-style API).
output_file (For) – Output file path or stream. Required when using old-style API with input_file as first argument. Must be None when passing OcrOptions.
use_threads – Use worker threads instead of processes. This reduces performance but may make debugging easier since it is easier to set breakpoints.
plugins – List of plugin paths to load. Can be passed alongside OcrOptions.
plugin_manager – Pre-configured plugin manager. Can be passed alongside OcrOptions.
input_file (For) – If a pathlib.Path, str or bytes, this is interpreted as file system path to the input file. If the object appears to be a readable stream (with methods such as .read() and .seek()), the object will be read in its entirety and saved to a temporary file. If input_file is "-", standard input will be read.
output_file – If a pathlib.Path, str or bytes, this is interpreted as file system path to the output file. If the object appears to be a writable stream (with methods such as .write() and .seek()), the output will be written to this stream. If output_file is "-", the output will be written to sys.stdout (provided that standard output does not seem to be a terminal device). When a stream is used as output, whether via a writable object or "-", some final validation steps are not performed (we do not read back the stream after it is written).

Raises:

ocrmypdf.MissingDependencyError – If a required dependency program is missing or was not found on PATH.
ocrmypdf.UnsupportedImageFormatError – If the input file type was an image that could not be read, or some other file type that is not a PDF.
ocrmypdf.DpiError – If the input file is an image, but the resolution of the image is not credible (allowing it to proceed would cause poor OCR).
ocrmypdf.OutputFileAccessError – If an attempt to write to the intended output file failed.
ocrmypdf.PriorOcrFoundError – If the input PDF seems to have OCR or digital text already, and settings did not tell us to proceed.
ocrmypdf.InputFileError – Any other problem with the input file.
ocrmypdf.SubprocessOutputError – Any error related to executing a subprocess.
ocrmypdf.EncryptedPdfError – If the input PDF is encrypted (password protected). OCRmyPDF does not remove passwords.
ocrmypdf.TesseractConfigError – If Tesseract reported its configuration was not valid.
ValueError – If OcrOptions is passed along with other OCR parameters, or if both plugins and plugin_manager are provided.
TypeError – If output_file is missing when using the old-style API.

Returns:

ocrmypdf.ExitCode

ocrmypdf.api.run_pipeline(options: OcrOptions, *, plugin_manager: OcrmypdfPluginManager) → ExitCode

Run the OCR pipeline without command line exception handling.

Parameters:

options – The parsed OCR options.
plugin_manager – The plugin manager to use. If not provided, one will be created.

ocrmypdf.api.run_pipeline_cli(options: OcrOptions, *, plugin_manager: OcrmypdfPluginManager) → ExitCode

Run the OCR pipeline with command line exception handling.

Parameters:

options – The parsed OCR options.
plugin_manager – The plugin manager to use. If not provided, one will be created.

ocrmypdf.api.setup_plugin_infrastructure(plugins: Sequence[Path | str] | None = None, plugin_manager: OcrmypdfPluginManager | None = None) → OcrmypdfPluginManager

Set up plugin infrastructure with proper initialization.

This function handles: 1. Creating or validating the plugin manager 2. Calling plugin initialization hooks 3. Setting up plugin option registry

Parameters:

plugins – List of plugin paths/names to load
plugin_manager – Existing plugin manager (if any)

Returns:

Properly initialized plugin manager

Raises:

ValueError – If both plugins and plugin_manager are provided

ocrmypdf._options

Internal options model for OCRmyPDF.

class ocrmypdf._options.OcrOptions(*, input_file: BinaryIO | IOBase | Path | str | bytes, output_file: BinaryIO | IOBase | Path | str | bytes, sidecar: BinaryIO | IOBase | Path | str | bytes | None = None, output_folder: Path | None = None, work_folder: Path | None = None, languages: list[str] = <factory>, output_type: str = 'auto', mode: ProcessingMode = ProcessingMode.default, jobs: int | None = None, use_threads: bool = True, progress_bar: bool = True, quiet: bool = False, verbose: int = 0, keep_temporary_files: bool = False, image_dpi: int | None = None, deskew: bool = False, clean: bool = False, clean_final: bool = False, rotate_pages: bool = False, remove_background: bool = False, remove_vectors: bool = False, oversample: int = 0, unpaper_args: list[str] | None = None, skip_big: float | None = None, pages: str | set[int] | None = None, invalidate_digital_signatures: bool = False, tagged_pdf_mode: TaggedPdfMode = TaggedPdfMode.default, title: str | None = None, author: str | None = None, subject: str | None = None, keywords: str | None = None, optimize: int = 1, jpg_quality: int | None = None, png_quality: int | None = None, jbig2_threshold: float = 0.85, no_overwrite: bool = False, max_image_mpixels: float | None = None, pdf_renderer: str = 'auto', ocr_engine: str = 'auto', rasterizer: str = 'auto', rotate_pages_threshold: float = 14.0, user_words: PathLike | None = None, user_patterns: PathLike | None = None, fast_web_view: float = 1.0, continue_on_soft_render_error: bool | None = None, tesseract_config: list[str] = [], tesseract_pagesegmode: int | None = None, tesseract_oem: int | None = None, tesseract_thresholding: int | None = None, tesseract_timeout: float | None = None, tesseract_non_ocr_timeout: float | None = None, tesseract_downsample_above: int = 32767, tesseract_downsample_large_images: bool | None = None, pdfa_image_compression: str | None = None, color_conversion_strategy: str = 'LeaveColorUnchanged', plugins: Sequence[Path | str] | None = None, _extra_attrs: dict[str, ~typing.Any]=<factory>)

Internal options model that can masquerade as argparse.Namespace.

This model provides proper typing and validation while maintaining compatibility with existing code that expects argparse.Namespace behavior.

property force_ocr: bool: Backward compatibility alias for mode == ProcessingMode.force.

classmethod handle_special_cases(data): Handle special cases for API compatibility and legacy options.

property jpeg_quality: Compatibility alias for jpg_quality.

property lossless_reconstruction: Determine lossless_reconstruction based on other options.

model_config = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'validate_assignment': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_dump_json_safe() → str: Serialize to JSON with special handling for non-serializable types.

classmethod model_validate_json_safe(json_str: str) → OcrOptions: Reconstruct from JSON with special handling for non-serializable types.

property redo_ocr: bool: Backward compatibility alias for mode == ProcessingMode.redo.

classmethod register_plugin_models(models: dict[str, type]) → None

Register plugin option model classes for nested access.

Parameters:: models – Dictionary mapping namespace to model class

property skip_text: bool: Backward compatibility alias for mode == ProcessingMode.skip.

classmethod validate_clean_final(v, info): If clean_final is True, also set clean to True.

classmethod validate_jobs(v): Validate jobs is a reasonable number.

classmethod validate_languages(v): Ensure languages list is not empty.

classmethod validate_max_image_mpixels(v): Validate max image megapixels.

classmethod validate_metadata_unicode(v): Validate metadata strings don’t contain unsupported Unicode characters.

classmethod validate_output_type(v): Validate output type is one of the allowed values.

validate_output_type_compatibility(): Validate output type is compatible with output file.

classmethod validate_oversample(v): Validate oversample DPI.

classmethod validate_pages_format(v)

Convert page ranges string to set of page numbers.

If the string uses the end alias, the original string is preserved so that resolution can happen later, once the document’s page count is known.

classmethod validate_pdf_renderer(v): Validate PDF renderer is one of the allowed values.

classmethod validate_rasterizer(v): Validate rasterizer is one of the allowed values.

validate_redo_ocr_options(): Validate options compatible with redo mode.

classmethod validate_rotate_pages_threshold(v): Validate rotate pages threshold.

classmethod validate_unpaper_args(v): Normalize unpaper_args from string to list and validate security.

classmethod validate_verbose(v): Validate verbose level.

ocrmypdf.exceptions

OCRmyPDF’s exceptions.

exception ocrmypdf.exceptions.BadArgsError

Invalid arguments on the command line or API.

exit_code = 1

exception ocrmypdf.exceptions.ColorConversionNeededError(color_conversion_strategy: str = 'LeaveColorUnchanged')

PDF needs color conversion to a standard color space.

Ghostscript reported a DeviceN colorspace with an inappropriate alternate. The resulting PDF/A is liable to render incorrectly (often blank) in some viewers such as Adobe Reader, so the colorspace must be normalized to a common one. RGB, CMYK, and Gray are known to work; LeaveColorUnchanged performs no conversion and UseDeviceIndependentColor does not resolve the problem (see https://github.com/ocrmypdf/OCRmyPDF/issues/1187).

exception ocrmypdf.exceptions.DigitalSignatureError

PDF has a digital signature.

message = 'Input PDF has a digital signature. OCR would alter the document,\ninvalidating the signature.\n'

exception ocrmypdf.exceptions.DpiError

Missing information about input image DPI.

exit_code = 2

exception ocrmypdf.exceptions.EncryptedPdfError

Input PDF is encrypted.

exit_code = 8

message = "Input PDF is encrypted. The encryption must be removed to\nperform OCR.\n\nFor information about this PDF's security use\n qpdf --show-encryption infilename\n\nYou can remove the encryption using\n qpdf --decrypt [--password=[password]] infilename\n"

class ocrmypdf.exceptions.ExitCode(*values)

OCRmyPDF’s exit codes.

already_done_ocr = 6

bad_args = 1

child_process_error = 7

ctrl_c = 130

encrypted_pdf = 8

file_access_error = 5

input_file = 2

invalid_config = 9

invalid_output_pdf = 4

missing_dependency = 3

ok = 0

other_error = 15

pdfa_conversion_failed = 10

exception ocrmypdf.exceptions.ExitCodeException

An exception which should return an exit code with sys.exit().

exit_code = 15

message = ''

exception ocrmypdf.exceptions.InputFileError

Something is wrong with the input file.

exit_code = 2

exception ocrmypdf.exceptions.MissingDependencyError

A third-party dependency is missing.

exit_code = 3

exception ocrmypdf.exceptions.NonEmbeddedFontsError(fonts: set[str])

Input has non-embedded CID fonts that PDF/A conversion would corrupt.

PDF/A requires all fonts to be embedded. Ghostscript substitutes and embeds a replacement for non-embedded CID (CJK) fonts, which corrupts the character-to-Unicode mapping and silently destroys an existing text layer (commonly an Adobe Acrobat CJK OCR layer). OCRmyPDF refuses to produce such output rather than damage the user’s data (see https://github.com/ocrmypdf/OCRmyPDF/issues/1561).

exception ocrmypdf.exceptions.OutputFileAccessError

Cannot access the intended output file path.

exit_code = 5

exception ocrmypdf.exceptions.PriorOcrFoundError

This file already has OCR.

exit_code = 6

exception ocrmypdf.exceptions.SubprocessOutputError

A subprocess returned an unexpected error.

exit_code = 7

exception ocrmypdf.exceptions.TaggedPDFError

PDF is tagged.

message = 'This PDF is marked as a Tagged PDF. This often indicates\nthat the PDF was generated from an office document and does\nnot need OCR. Use --force-ocr, --skip-text or --redo-ocr to\noverride this error.\n'

exception ocrmypdf.exceptions.TesseractConfigError

Tesseract config can’t be parsed.

exit_code = 9

message = 'Error occurred while parsing a Tesseract configuration file'

exception ocrmypdf.exceptions.UnsupportedImageFormatError

The image format is not supported.

exit_code = 2

ocrmypdf.helpers

Support functions.

class ocrmypdf.helpers.Resolution(x: T, y: T)

The number of pixels per inch in each 2D direction.

Resolution objects are considered “equal” for == purposes if they are equal to a reasonable tolerance.

flip_axis() → Resolution[T]: Return a new Resolution object with x and y swapped.

property is_finite: bool: True if both x and y are finite numbers.

property is_square: bool: True if the resolution is square (x == y).

round(ndigits: int) → Resolution: Round to ndigits after the decimal point.

take_max(vals: Iterable[Any], yvals: Iterable[Any] | None = None) → Resolution: Return a new Resolution object with the maximum resolution of inputs.

take_min(vals: Iterable[Any], yvals: Iterable[Any] | None = None) → Resolution: Return a new Resolution object with the minimum resolution of inputs.

to_int() → Resolution[int]: Round to nearest integer.

to_scalar() → float

Return the harmonic mean of x and y as a 1D approximation.

In most cases, Resolution is 2D, but typically it is “square” (x == y) and can be approximated as a single number. When not square, the harmonic mean is used to approximate the 2D resolution as a single number.

ocrmypdf.helpers.available_cpu_count() → int: Returns number of CPUs in the system.

ocrmypdf.helpers.check_pdf(input_file: Path) → bool

Check if a PDF complies with the PDF specification.

Checks for proper formatting and proper linearization. Uses pikepdf (which in turn, uses QPDF) to perform the checks.

ocrmypdf.helpers.clamp(n: T, smallest: T, largest: T) → T: Clamps the value of n to between smallest and largest.

ocrmypdf.helpers.is_file_writable(test_file: PathLike) → bool

Intentionally racy test if target is writable.

We intend to write to the output file if and only if we succeed and can replace it atomically. Before doing the OCR work, make sure the location is writable.

ocrmypdf.helpers.is_iterable_notstr(thing: Any) → bool: Is this is an iterable type, other than a string?

ocrmypdf.helpers.monotonic(seq: Sequence) → bool: Does this sequence increase monotonically?

ocrmypdf.helpers.page_number(input_file: PathLike) → int: Get one-based page number implied by filename (000002.pdf -> 2).

ocrmypdf.helpers.pikepdf_enable_mmap() → None: Enable pikepdf memory mapping.

ocrmypdf.helpers.remove_all_log_handlers(logger: Logger) → None

Remove all log handlers, usually used in a child process.

The child process inherits the log handlers from the parent process when a fork occurs. Typically we want to remove all log handlers in the child process so that the child process can set up a single queue handler to forward log messages to the parent process.

ocrmypdf.helpers.running_in_docker() → bool: Returns True if we seem to be running in a Docker container.

ocrmypdf.helpers.running_in_snap() → bool: Returns True if we seem to be running in a Snap container.

ocrmypdf.helpers.safe_symlink(input_file: PathLike, soft_link_name: PathLike) → None

Create a symbolic link at soft_link_name, which references input_file.

Think of this as copying input_file to soft_link_name with less overhead.

Use symlinks safely. Self-linking loops are prevented. On Windows, file copy is used since symlinks may require administrator privileges. An existing link at the destination is removed.

ocrmypdf.helpers.samefile(file1: PathLike, file2: PathLike) → bool

Return True if two files are the same file.

Attempts to account for different relative paths to the same file.

ocrmypdf.hocrtransform

Transform OCR output to text-only PDFs.

This package provides tools for: 1. Parsing OCR output (hOCR format) into generic OcrElement structures 2. Rendering OcrElement structures to searchable PDF text layers

The architecture separates parsing from rendering, allowing: - Support for multiple OCR input formats (hOCR, ALTO, custom engines) - Independent improvements to text rendering - Reuse of the OcrElement data model for other purposes

Main components: - OcrElement: Generic dataclass representing OCR output structure - HocrParser: Parses hOCR files into OcrElement trees - Fpdf2PdfRenderer: Renders OcrElement trees to PDF text layers (via fpdf2)

For PDF rendering, use the fpdf2_renderer module:: from ocrmypdf.fpdf_renderer import Fpdf2PdfRenderer, DebugRenderOptions

class ocrmypdf.hocrtransform.Baseline(slope: float = 0.0, intercept: float = 0.0)

Text baseline information.

The baseline is represented as a linear equation: y = slope * x + intercept. This describes the line along which text characters sit, relative to the bottom-left corner of the line’s bounding box.

In hOCR, the baseline is specified relative to the bottom of the line’s bbox, with the intercept being the vertical offset from the bottom and the slope representing rotation (positive = ascending left-to-right).

slope

Slope of the baseline (rise over run)

Type:: float

intercept

Y-intercept of the baseline (vertical offset from bbox bottom)

Type:: float

class ocrmypdf.hocrtransform.BoundingBox(left: float, top: float, right: float, bottom: float)

An axis-aligned bounding box in pixel coordinates.

Coordinates use top-left origin (standard for images and hOCR).

left

Left edge x-coordinate

Type:: float

top

Top edge y-coordinate

Type:: float

right

Right edge x-coordinate

Type:: float

bottom

Bottom edge y-coordinate

Type:: float

property height: float: Height of the bounding box.

property width: float: Width of the bounding box.

class ocrmypdf.hocrtransform.FontInfo(name: str | None = None, size: float | None = None, bold: bool = False, italic: bool = False, monospace: bool = False, serif: bool = False, smallcaps: bool = False, underline: bool = False)

Font information for text rendering.

name

Font family name (e.g., “Times New Roman”)

Type:: str | None

size

Font size in points

Type:: float | None

bold

Whether the font is bold

Type:: bool

italic

Whether the font is italic

Type:: bool

monospace

Whether the font is monospace

Type:: bool

serif

Whether the font is serif (vs sans-serif)

Type:: bool

smallcaps

Whether the font uses small caps

Type:: bool

underline

Whether the text is underlined

Type:: bool

exception ocrmypdf.hocrtransform.HocrParseError: Error while parsing hOCR file.

class ocrmypdf.hocrtransform.HocrParser(hocr_file: str | Path)

Parser for hOCR format files.

Converts hOCR XML/HTML files into OcrElement trees.

The hOCR format uses HTML with special class attributes (ocr_page, ocr_line, ocrx_word, etc.) and a title attribute containing properties like bbox, baseline, and confidence scores.

parse() → OcrElement

Parse the hOCR file and return an OcrElement tree.

Returns:: The root OcrElement (ocr_page) containing the document structure
Raises:: HocrParseError – If no ocr_page element is found

class ocrmypdf.hocrtransform.OcrClass: Constants for common OCR element classes.

class ocrmypdf.hocrtransform.OcrElement(ocr_class: str, bbox: BoundingBox | None = None, poly: list[tuple[float, float]] | None=None, text: str = '', confidence: float | None = None, children: list[OcrElement] = <factory>, direction: Literal['ltr', 'rtl'] | None=None, language: str | None = None, baseline: Baseline | None = None, textangle: float | None = None, font: FontInfo | None = None, dpi: float | None = None, page_number: int | None = None, logical_page_number: int | None = None)

A generic OCR element representing any structural unit of OCR output.

OcrElements form a tree structure where pages contain paragraphs, paragraphs contain lines, lines contain words, etc. The specific hierarchy depends on the OCR engine, but this dataclass can represent any of these levels.

The ocr_class field uses hOCR naming conventions (ocr_page, ocr_par, ocr_line, ocrx_word, etc.) as a common vocabulary, but elements from other sources can map to these classes.

Common hOCR classes:

ocr_page: The root element for a page
ocr_carea: A content/column area
ocr_par: A paragraph
ocr_line: A line of text
ocr_header: A header line
ocr_footer: A footer line
ocr_caption: A caption line
ocr_textfloat: A floating text element
ocrx_word: A single word

ocr_class

The element type (e.g., “ocr_page”, “ocr_line”, “ocrx_word”)

Type:: str

bbox

Axis-aligned bounding box in source pixel coordinates (top-left origin)

Type:: ocrmypdf.models.ocr_element.BoundingBox | None

poly

Polygon vertices for oriented/non-rectangular bounds

Type:: list[tuple[float, float]] | None

text

Text content (primarily for leaf nodes like words)

Type:: str

confidence

OCR confidence score (0.0-1.0)

Type:: float | None

children

Child elements (hierarchical structure)

Type:: list[ocrmypdf.models.ocr_element.OcrElement]

direction

Text direction (“ltr” or “rtl”)

Type:: Literal[‘ltr’, ‘rtl’] | None

language

Language code (e.g., “eng”, “deu”, “chi_sim”)

Type:: str | None

baseline

Text baseline information (slope and intercept)

Type:: ocrmypdf.models.ocr_element.Baseline | None

textangle

Text rotation angle in degrees (counter-clockwise from horizontal)

Type:: float | None

font

Font information (name, size, style)

Type:: ocrmypdf.models.ocr_element.FontInfo | None

dpi

Image resolution in dots per inch (typically for page-level)

Type:: float | None

page_number

Physical page number (0-indexed)

Type:: int | None

logical_page_number

Logical page number (as printed on the page)

Type:: int | None

find_by_class(*ocr_classes: str) → OcrElement | None

Find the first descendant matching the given class(es).

Parameters:: *ocr_classes – One or more ocr_class values to match
Returns:: The first matching element, or None if not found

get_text_recursive() → str

Get the combined text of this element and all descendants.

Returns:: Combined text content, with words separated by spaces

iter_by_class(*ocr_classes: str) → list[OcrElement]

Iterate over all descendants matching the given class(es).

Parameters:: *ocr_classes – One or more ocr_class values to match
Returns:: List of all matching descendant elements (depth-first order)

property lines: list[OcrElement]: Get all line elements in this element’s subtree.

property paragraphs: list[OcrElement]: Get all paragraph elements (ocr_par) in this element’s subtree.

property words: list[OcrElement]: Get all word elements (ocrx_word) in this element’s subtree.

ocrmypdf.pdfa

Utilities for PDF/A production and confirmation with Ghostscript.

ocrmypdf.pdfa.add_pdfa_metadata(pdf: <MagicMock id = '139298874645136'>, part: str, conformance: str) → None

Add PDF/A XMP metadata declaration to a PDF.

Parameters:

pdf – An open pikepdf.Pdf object
part – PDF/A part number (‘1’, ‘2’, or ‘3’)
conformance – Conformance level (‘A’, ‘B’, or ‘U’)

ocrmypdf.pdfa.add_srgb_output_intent(pdf: <MagicMock id = '139298874645136'>) → None

Add sRGB ICC profile as OutputIntent to PDF catalog.

This creates the required PDF/A OutputIntent structure with: - An ICC profile stream containing sRGB profile - An OutputIntent dictionary pointing to that profile - Updates the Catalog’s OutputIntents array

Parameters:: pdf – An open pikepdf.Pdf object

ocrmypdf.pdfa.file_claims_pdfa(filename: Path)

Determines if the file claims to be PDF/A compliant.

This only checks if the XMP metadata contains a PDF/A marker. It does not do full PDF/A validation.

ocrmypdf.pdfa.find_nonembedded_cid_fonts(pdf: <MagicMock id = '139298874645136'>) → set[str]

Find CID-keyed (Type0) fonts that lack embedded glyph data.

PDF/A requires every font to be embedded. When Ghostscript converts a PDF to PDF/A it must substitute and embed a replacement for any non-embedded font. For CID-keyed fonts – which is how CJK text is encoded, including the OCR text layers produced by Adobe Acrobat – this substitution routinely corrupts the character-to-Unicode mapping, silently destroying the searchable text. Detecting these fonts lets the caller refuse PDF/A conversion rather than emit corrupted output.

Simple (non-CID) non-embedded fonts are not reported: Ghostscript substitutes standard encodings for them without corrupting the text, and they are far too common to treat as conversion blockers.

Parameters:: pdf – An open pikepdf.Pdf to scan.
Returns:: The set of BaseFont names of non-embedded CID fonts found.

ocrmypdf.pdfa.generate_pdfa_ps(target_filename: Path, icc: str = 'sRGB')

Create a Postscript PDFMARK file for Ghostscript PDF/A conversion.

pdfmark is an extension to the Postscript language that describes some PDF features like bookmarks and annotations. It was originally specified Adobe Distiller, for Postscript to PDF conversion.

Ghostscript uses pdfmark for PDF to PDF/A conversion as well. To use Ghostscript to create a PDF/A, we need to create a pdfmark file with the necessary metadata.

This function takes care of the many version-specific bugs and peculiarities in Ghostscript’s handling of pdfmark.

The only information we put in specifies that we want the file to be a PDF/A, and we want to Ghostscript to convert objects to the sRGB colorspace if it runs into any object that it decides must be converted.

Parameters:

target_filename – filename to save
icc – ICC identifier such as ‘sRGB’

References

Adobe PDFMARK Reference: https://opensource.adobe.com/dc-acrobat-sdk-docs/library/pdfmark/

ocrmypdf.pdfa.speculative_pdfa_conversion(input_file: Path, output_file: Path, output_type: str) → Path

Attempt to convert a PDF to PDF/A by adding required structures.

This function creates a copy of the input PDF and adds: 1. sRGB ICC profile as OutputIntent 2. XMP metadata declaring PDF/A conformance

This approach works for PDFs that are already mostly PDF/A compliant but lack the formal declarations. It does NOT perform color conversion, font embedding, or other transformations that Ghostscript does.

Parameters:

input_file – Path to input PDF
output_file – Path where output PDF should be written
output_type – One of ‘pdfa’, ‘pdfa-1’, ‘pdfa-2’, ‘pdfa-3’

Returns:

Path to the output file

Raises:

pikepdf.PdfError – If the PDF cannot be opened or modified

ocrmypdf.quality

Utilities to measure OCR quality.

class ocrmypdf.quality.OcrQualityDictionary(*, wordlist: Iterable[str])

Manages a dictionary for simple OCR quality checks.

measure_words_matched(ocr_text: str) → float

Check how many unique words in the OCR text match a dictionary.

Words with mixed capitalized are only considered a match if the test word matches that capitalization.

Returns:: number of words that match / number

ocrmypdf.subprocess

Wrappers to manage subprocess calls.

This package is split into three private submodules by concern:

ocrmypdf.subprocess._run - low-level execution wrappers (run, run_polling_stderr) that add OCRmyPDF-aware logging and Windows PATH resolution. Useful as drop-in replacements for subprocess.run().
ocrmypdf.subprocess._version - version probing (get_version).
ocrmypdf.subprocess._check - startup validation (check_external_program) with platform-aware error messages.

The names below are the stable public API. Importing from the private submodules directly is not supported for external code.

ocrmypdf.subprocess.check_external_program(*, program: str, package: str, version_checker: ~collections.abc.Callable[[], ~packaging.version.Version], need_version: str | ~packaging.version.Version, required_for: str | None = None, recommended: bool = False, version_parser: type[~packaging.version.Version] = <class 'packaging.version.Version'>) → None

Check for required version of external program and raise exception if not.

Parameters:

program – The name of the program to test.
package – The name of a software package that typically supplies this program. Usually the same as program.
version_checker – A callable without arguments that retrieves the installed version of program.
need_version – The minimum required version.
required_for – The name of an argument of feature that requires this program.
recommended – If this external program is recommended, instead of raising an exception, log a warning and allow execution to continue.
version_parser – A class that should be used to parse and compare version numbers. Used when version numbers do not follow standard conventions.

ocrmypdf.subprocess.get_version(program: str, *, version_arg: str = '--version', regex='(\\d+(\\.\\d+)*)', env: Mapping[str, str] | _Environ | None = None) → str

Get the version of the specified program.

Parameters:

program – The program to version check.
version_arg – The argument needed to ask for its version, e.g. --version.
regex – A regular expression to parse the program’s output and obtain the version.
env – Custom os.environ in which to run program.

ocrmypdf.subprocess.run(args: Sequence[Path | str], *, env: Mapping[str, str] | _Environ | None = None, logs_errors_to_stdout: bool = False, check: bool = False, **kwargs) → CompletedProcess

Wrapper around subprocess.run().

The main purpose of this wrapper is to log subprocess output in an orderly fashion that identifies the responsible subprocess. An additional task is that this function goes to greater lengths to find possible Windows locations of our dependencies when they are not on the system PATH.

Arguments should be identical to subprocess.run, except for following:

Parameters:

args – Positional arguments to pass to subprocess.run.
env – A set of environment variables. If None, the OS environment is used.
logs_errors_to_stdout – If True, indicates that the process writes its error messages to stdout rather than stderr, so stdout should be logged if there is an error. If False, stderr is logged. Could be used with stderr=STDOUT, stdout=PIPE for example.
check – If True, raise an exception if the process exits with a non-zero status code. If False, the return value will indicate success or failure.
kwargs – Additional arguments to pass to subprocess.run.

ocrmypdf.subprocess.run_polling_stderr(args: Sequence[Path | str], *, callback: Callable[[str], None], check: bool = False, env: Mapping[str, str] | _Environ | None = None, **kwargs) → CompletedProcess

Run a process like ocrmypdf.subprocess.run, and poll stderr.

Every line of produced by stderr will be forwarded to the callback function. The intended use is monitoring progress of subprocesses that output their own progress indicators. In addition, each line will be logged if debug logging is enabled.

Requires stderr to be opened in text mode for ease of handling errors. In addition the expected encoding= and errors= arguments should be set. Note that if stdout is already set up, it need not be binary.