API reference

This page summarizes the rest of the public API. Generally speaking this should be mainly of interest to plugin developers.

ocrmypdf.api

Python API for OCRmyPDF.

This module provides the main Python API for OCRmyPDF, allowing you to perform OCR operations programmatically without using the command line interface.

Main Functions:
ocr(): The primary function for OCR processing. Takes an input PDF or image

file and produces an OCR’d PDF with searchable text.

configure_logging(): Set up logging to match the command line interface

behavior, with support for progress bars and colored output.

Experimental Functions:
_pdf_to_hocr(): Extract text from PDF pages and save as hOCR files for

manual editing before final PDF generation.

_hocr_to_ocr_pdf(): Convert hOCR files back to a searchable PDF after

manual text corrections.

The API maintains thread safety through internal locking since OCRmyPDF uses global state for plugins. Only one OCR operation can run per Python process at a time. For parallel processing, use multiple Python processes.

Example

import ocrmypdf

# Configure logging (optional) ocrmypdf.configure_logging(ocrmypdf.Verbosity.default)

# Perform OCR ocrmypdf.ocr(‘input.pdf’, ‘output.pdf’, language=’eng’)

For detailed parameter documentation, see the ocr() function docstring and the equivalent command line parameters in the OCRmyPDF documentation.

class ocrmypdf.api.PageNumberFilter(name='')

Insert PDF page number that emitted log message to log record.

filter(record)

Determine if the specified record is to be logged.

Returns True if the record should be logged, or False otherwise. If deemed appropriate, the record may be modified in-place.

class ocrmypdf.api.Verbosity(*values)

Verbosity level for configure_logging.

debug = 1

Output ocrmypdf debug messages

debug_all = 2

More detailed debugging from ocrmypdf and dependent modules

default = 0

Default level of logging

quiet = -1

Suppress most messages

ocrmypdf.api.check_options(options: OcrOptions, plugin_manager: OcrmypdfPluginManager) None

Check options for validity and consistency.

This function coordinates validation across the entire system: 1. Core validation (platform, files, preprocessing) 2. Plugin external dependency validation 3. Plugin-specific validation (handled by plugin models) 4. Cross-cutting validation (handled by validation coordinator)

ocrmypdf.api.configure_logging(verbosity: Verbosity, *, progress_bar_friendly: bool = True, manage_root_logger: bool = False, plugin_manager: OcrmypdfPluginManager | None = None)

Set up logging.

Before calling ocrmypdf.ocr(), you can use this function to configure logging if you want ocrmypdf’s output to look like the ocrmypdf command line interface. It will register log handlers, log filters, and formatters, configure color logging to standard error, and adjust the log levels of third party libraries. Details of this are fine-tuned and subject to change. The verbosity argument is equivalent to the argument --verbose and applies those settings. If you have a wrapper script for ocrmypdf and you want it to be very similar to ocrmypdf, use this function; if you are using ocrmypdf as part of an application that manages its own logging, you probably do not want this function.

If this function is not called, ocrmypdf will not configure logging, and it is up to the caller of ocrmypdf.ocr() to set up logging as it wishes using the Python standard library’s logging module. If this function is called, the caller may of course make further adjustments to logging.

Regardless of whether this function is called, ocrmypdf will perform all of its logging under the "ocrmypdf" logging namespace. In addition, ocrmypdf imports pdfminer, which logs under "pdfminer". A library user may wish to configure both; note that pdfminer is extremely chatty at the log level logging.INFO.

This function does not set up the debug.log log file that the command line interface does at certain verbosity levels. Applications should configure their own debug logging.

Parameters:
  • verbosity – Verbosity level.

  • progress_bar_friendly – If True (the default), install a custom log handler that is compatible with progress bars and colored output.

  • manage_root_logger – Configure the process’s root logger.

  • plugin_manager – The plugin manager, used for obtaining the custom log handler.

Returns:

The toplevel logger for ocrmypdf (or the root logger, if we are managing it).

ocrmypdf.api.create_options(*, input_file: BinaryIO | Path | str | bytes, output_file: BinaryIO | Path | str | bytes, parser: ArgumentParser, **kwargs) OcrOptions

Construct an options object from the input/output files and keyword arguments.

Parameters:
  • input_file – Input file path or file object.

  • output_file – Output file path or file object.

  • parser – ArgumentParser object (kept for compatibility, may be used for plugin validation).

  • **kwargs – Keyword arguments.

Returns:

OcrOptions – An options object containing the parsed arguments.

Raises:

TypeError – If the type of a keyword argument is not supported.

ocrmypdf.api.get_parser()

Get the main CLI parser.

ocrmypdf.api.ocr(options: OcrOptions, /, *, plugins: Iterable[Path | str] | None = None, plugin_manager: OcrmypdfPluginManager | None = None) ExitCode
ocrmypdf.api.ocr(input_file_or_options: BinaryIO | Path | str | bytes, output_file: BinaryIO | Path | str | bytes, *, language: Iterable[str] | None = None, image_dpi: int | None = None, output_type: str | None = None, sidecar: BinaryIO | Path | str | bytes | None = None, jobs: int | None = None, use_threads: bool | None = None, title: str | None = None, author: str | None = None, subject: str | None = None, keywords: str | None = None, rotate_pages: bool | None = None, remove_background: bool | None = None, deskew: bool | None = None, clean: bool | None = None, clean_final: bool | None = None, unpaper_args: str | None = None, oversample: int | None = None, remove_vectors: bool | None = None, mode: str | None = None, force_ocr: bool | None = None, skip_text: bool | None = None, redo_ocr: bool | None = None, skip_big: float | None = None, optimize: int | None = None, jpg_quality: int | None = None, png_quality: int | None = None, jbig2_lossy: bool | None = None, jbig2_page_group_size: int | None = None, jbig2_threshold: float | None = None, pages: str | None = None, max_image_mpixels: float | None = None, tesseract_config: Iterable[str] | None = None, tesseract_pagesegmode: int | None = None, tesseract_oem: int | None = None, tesseract_thresholding: int | None = None, pdf_renderer: str | None = None, rasterizer: str | None = None, tesseract_timeout: float | None = None, tesseract_non_ocr_timeout: float | None = None, tesseract_downsample_above: int | None = None, tesseract_downsample_large_images: bool | None = None, rotate_pages_threshold: float | None = None, pdfa_image_compression: str | None = None, color_conversion_strategy: str | None = None, user_words: PathLike | None = None, user_patterns: PathLike | None = None, fast_web_view: float | None = None, continue_on_soft_render_error: bool | None = None, invalidate_digital_signatures: bool | None = None, tagged_pdf_mode: str | None = None, no_overwrite: bool | None = None, plugins: Iterable[Path | str] | None = None, plugin_manager: OcrmypdfPluginManager | None = None, keep_temporary_files: bool | None = None, progress_bar: bool | None = None, **kwargs) ExitCode

Run OCRmyPDF on one PDF or image.

This function supports two calling conventions:

New style (recommended):
>>> from ocrmypdf import ocr
>>> from ocrmypdf._options import OcrOptions
>>> options = OcrOptions(
...     input_file="input.pdf",
...     output_file="output.pdf",
...     languages=["eng"],
... )
>>> ocr(options)
Old style:
>>> ocr("input.pdf", "output.pdf", language=["eng"])

For most arguments, see documentation for the equivalent command line parameter.

This API takes a threading lock, because OCRmyPDF uses global state in particular for the plugin system. The jobs parameter will be used to create a pool of worker threads or processes at different times, subject to change. A Python process can only run one OCRmyPDF task at a time.

To run parallelize instances OCRmyPDF, use separate Python processes to scale horizontally. Generally speaking you should set jobs=sqrt(cpu_count) and run sqrt(cpu_count) processes as a starting point. If you have files with a high page count, run fewer processes and more jobs per process. If you have a lot of short files, run more processes and fewer jobs per process.

A few specific arguments are discussed here:

Parameters:
  • input_file_or_options – Either an OcrOptions object containing all settings, or a path/stream for the input file (old-style API).

  • output_file (For) – Output file path or stream. Required when using old-style API with input_file as first argument. Must be None when passing OcrOptions.

  • use_threads – Use worker threads instead of processes. This reduces performance but may make debugging easier since it is easier to set breakpoints.

  • plugins – List of plugin paths to load. Can be passed alongside OcrOptions.

  • plugin_manager – Pre-configured plugin manager. Can be passed alongside OcrOptions.

  • input_file (For) – If a pathlib.Path, str or bytes, this is interpreted as file system path to the input file. If the object appears to be a readable stream (with methods such as .read() and .seek()), the object will be read in its entirety and saved to a temporary file. If input_file is "-", standard input will be read.

  • output_file – If a pathlib.Path, str or bytes, this is interpreted as file system path to the output file. If the object appears to be a writable stream (with methods such as .write() and .seek()), the output will be written to this stream. If output_file is "-", the output will be written to sys.stdout (provided that standard output does not seem to be a terminal device). When a stream is used as output, whether via a writable object or "-", some final validation steps are not performed (we do not read back the stream after it is written).

Raises:
  • ocrmypdf.MissingDependencyError – If a required dependency program is missing or was not found on PATH.

  • ocrmypdf.UnsupportedImageFormatError – If the input file type was an image that could not be read, or some other file type that is not a PDF.

  • ocrmypdf.DpiError – If the input file is an image, but the resolution of the image is not credible (allowing it to proceed would cause poor OCR).

  • ocrmypdf.OutputFileAccessError – If an attempt to write to the intended output file failed.

  • ocrmypdf.PriorOcrFoundError – If the input PDF seems to have OCR or digital text already, and settings did not tell us to proceed.

  • ocrmypdf.InputFileError – Any other problem with the input file.

  • ocrmypdf.SubprocessOutputError – Any error related to executing a subprocess.

  • ocrmypdf.EncryptedPdfError – If the input PDF is encrypted (password protected). OCRmyPDF does not remove passwords.

  • ocrmypdf.TesseractConfigError – If Tesseract reported its configuration was not valid.

  • ValueError – If OcrOptions is passed along with other OCR parameters, or if both plugins and plugin_manager are provided.

  • TypeError – If output_file is missing when using the old-style API.

Returns:

ocrmypdf.ExitCode

ocrmypdf.api.run_pipeline(options: OcrOptions, *, plugin_manager: OcrmypdfPluginManager) ExitCode

Run the OCR pipeline without command line exception handling.

Parameters:
  • options – The parsed OCR options.

  • plugin_manager – The plugin manager to use. If not provided, one will be created.

ocrmypdf.api.run_pipeline_cli(options: OcrOptions, *, plugin_manager: OcrmypdfPluginManager) ExitCode

Run the OCR pipeline with command line exception handling.

Parameters:
  • options – The parsed OCR options.

  • plugin_manager – The plugin manager to use. If not provided, one will be created.

ocrmypdf.api.setup_plugin_infrastructure(plugins: Sequence[Path | str] | None = None, plugin_manager: OcrmypdfPluginManager | None = None) OcrmypdfPluginManager

Set up plugin infrastructure with proper initialization.

This function handles: 1. Creating or validating the plugin manager 2. Calling plugin initialization hooks 3. Setting up plugin option registry

Parameters:
  • plugins – List of plugin paths/names to load

  • plugin_manager – Existing plugin manager (if any)

Returns:

Properly initialized plugin manager

Raises:

ValueError – If both plugins and plugin_manager are provided

ocrmypdf._options

Internal options model for OCRmyPDF.

class ocrmypdf._options.OcrOptions(*, input_file: BinaryIO | IOBase | Path | str | bytes, output_file: BinaryIO | IOBase | Path | str | bytes, sidecar: BinaryIO | IOBase | Path | str | bytes | None = None, output_folder: Path | None = None, work_folder: Path | None = None, languages: list[str] = <factory>, output_type: str = 'auto', mode: ProcessingMode = ProcessingMode.default, jobs: int | None = None, use_threads: bool = True, progress_bar: bool = True, quiet: bool = False, verbose: int = 0, keep_temporary_files: bool = False, image_dpi: int | None = None, deskew: bool = False, clean: bool = False, clean_final: bool = False, rotate_pages: bool = False, remove_background: bool = False, remove_vectors: bool = False, oversample: int = 0, unpaper_args: list[str] | None = None, skip_big: float | None = None, pages: str | set[int] | None = None, invalidate_digital_signatures: bool = False, tagged_pdf_mode: TaggedPdfMode = TaggedPdfMode.default, title: str | None = None, author: str | None = None, subject: str | None = None, keywords: str | None = None, optimize: int = 1, jpg_quality: int | None = None, png_quality: int | None = None, jbig2_threshold: float = 0.85, no_overwrite: bool = False, max_image_mpixels: float | None = None, pdf_renderer: str = 'auto', ocr_engine: str = 'auto', rasterizer: str = 'auto', rotate_pages_threshold: float = 14.0, user_words: PathLike | None = None, user_patterns: PathLike | None = None, fast_web_view: float = 1.0, continue_on_soft_render_error: bool | None = None, tesseract_config: list[str] = [], tesseract_pagesegmode: int | None = None, tesseract_oem: int | None = None, tesseract_thresholding: int | None = None, tesseract_timeout: float | None = None, tesseract_non_ocr_timeout: float | None = None, tesseract_downsample_above: int = 32767, tesseract_downsample_large_images: bool | None = None, pdfa_image_compression: str | None = None, color_conversion_strategy: str = 'LeaveColorUnchanged', plugins: Sequence[Path | str] | None = None, _extra_attrs: dict[str, ~typing.Any]=<factory>)

Internal options model that can masquerade as argparse.Namespace.

This model provides proper typing and validation while maintaining compatibility with existing code that expects argparse.Namespace behavior.

property force_ocr: bool

Backward compatibility alias for mode == ProcessingMode.force.

classmethod handle_special_cases(data)

Handle special cases for API compatibility and legacy options.

property jpeg_quality

Compatibility alias for jpg_quality.

property lossless_reconstruction

Determine lossless_reconstruction based on other options.

model_config = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_dump_json_safe() str

Serialize to JSON with special handling for non-serializable types.

classmethod model_validate_json_safe(json_str: str) OcrOptions

Reconstruct from JSON with special handling for non-serializable types.

property redo_ocr: bool

Backward compatibility alias for mode == ProcessingMode.redo.

classmethod register_plugin_models(models: dict[str, type]) None

Register plugin option model classes for nested access.

Parameters:

models – Dictionary mapping namespace to model class

property skip_text: bool

Backward compatibility alias for mode == ProcessingMode.skip.

classmethod validate_clean_final(v, info)

If clean_final is True, also set clean to True.

classmethod validate_jobs(v)

Validate jobs is a reasonable number.

classmethod validate_languages(v)

Ensure languages list is not empty.

classmethod validate_max_image_mpixels(v)

Validate max image megapixels.

classmethod validate_metadata_unicode(v)

Validate metadata strings don’t contain unsupported Unicode characters.

classmethod validate_output_type(v)

Validate output type is one of the allowed values.

validate_output_type_compatibility()

Validate output type is compatible with output file.

classmethod validate_oversample(v)

Validate oversample DPI.

classmethod validate_pages_format(v)

Convert page ranges string to set of page numbers.

classmethod validate_pdf_renderer(v)

Validate PDF renderer is one of the allowed values.

classmethod validate_rasterizer(v)

Validate rasterizer is one of the allowed values.

validate_redo_ocr_options()

Validate options compatible with redo mode.

classmethod validate_rotate_pages_threshold(v)

Validate rotate pages threshold.

classmethod validate_unpaper_args(v)

Normalize unpaper_args from string to list and validate security.

classmethod validate_verbose(v)

Validate verbose level.

ocrmypdf.exceptions

OCRmyPDF’s exceptions.

exception ocrmypdf.exceptions.BadArgsError

Invalid arguments on the command line or API.

exit_code = 1
exception ocrmypdf.exceptions.ColorConversionNeededError

PDF needs color conversion.

message = 'The input PDF has an unusual color space. Use\n--color-conversion-strategy to convert to a common color space\nsuch as RGB, or use --output-type pdf to skip PDF/A conversion\nand retain the original color space.\n'
exception ocrmypdf.exceptions.DigitalSignatureError

PDF has a digital signature.

message = 'Input PDF has a digital signature. OCR would alter the document,\ninvalidating the signature.\n'
exception ocrmypdf.exceptions.DpiError

Missing information about input image DPI.

exit_code = 2
exception ocrmypdf.exceptions.EncryptedPdfError

Input PDF is encrypted.

exit_code = 8
message = "Input PDF is encrypted. The encryption must be removed to\nperform OCR.\n\nFor information about this PDF's security use\n    qpdf --show-encryption infilename\n\nYou can remove the encryption using\n    qpdf --decrypt [--password=[password]] infilename\n"
class ocrmypdf.exceptions.ExitCode(*values)

OCRmyPDF’s exit codes.

already_done_ocr = 6
bad_args = 1
child_process_error = 7
ctrl_c = 130
encrypted_pdf = 8
file_access_error = 5
input_file = 2
invalid_config = 9
invalid_output_pdf = 4
missing_dependency = 3
ok = 0
other_error = 15
pdfa_conversion_failed = 10
exception ocrmypdf.exceptions.ExitCodeException

An exception which should return an exit code with sys.exit().

exit_code = 15
message = ''
exception ocrmypdf.exceptions.InputFileError

Something is wrong with the input file.

exit_code = 2
exception ocrmypdf.exceptions.MissingDependencyError

A third-party dependency is missing.

exit_code = 3
exception ocrmypdf.exceptions.OutputFileAccessError

Cannot access the intended output file path.

exit_code = 5
exception ocrmypdf.exceptions.PriorOcrFoundError

This file already has OCR.

exit_code = 6
exception ocrmypdf.exceptions.SubprocessOutputError

A subprocess returned an unexpected error.

exit_code = 7
exception ocrmypdf.exceptions.TaggedPDFError

PDF is tagged.

message = 'This PDF is marked as a Tagged PDF. This often indicates\nthat the PDF was generated from an office document and does\nnot need OCR. Use --force-ocr, --skip-text or --redo-ocr to\noverride this error.\n'
exception ocrmypdf.exceptions.TesseractConfigError

Tesseract config can’t be parsed.

exit_code = 9
message = 'Error occurred while parsing a Tesseract configuration file'
exception ocrmypdf.exceptions.UnsupportedImageFormatError

The image format is not supported.

exit_code = 2

ocrmypdf.helpers

Support functions.

class ocrmypdf.helpers.Resolution(x: T, y: T)

The number of pixels per inch in each 2D direction.

Resolution objects are considered “equal” for == purposes if they are equal to a reasonable tolerance.

flip_axis() Resolution[T]

Return a new Resolution object with x and y swapped.

property is_finite: bool

True if both x and y are finite numbers.

property is_square: bool

True if the resolution is square (x == y).

round(ndigits: int) Resolution

Round to ndigits after the decimal point.

take_max(vals: Iterable[Any], yvals: Iterable[Any] | None = None) Resolution

Return a new Resolution object with the maximum resolution of inputs.

take_min(vals: Iterable[Any], yvals: Iterable[Any] | None = None) Resolution

Return a new Resolution object with the minimum resolution of inputs.

to_int() Resolution[int]

Round to nearest integer.

to_scalar() float

Return the harmonic mean of x and y as a 1D approximation.

In most cases, Resolution is 2D, but typically it is “square” (x == y) and can be approximated as a single number. When not square, the harmonic mean is used to approximate the 2D resolution as a single number.

ocrmypdf.helpers.available_cpu_count() int

Returns number of CPUs in the system.

ocrmypdf.helpers.check_pdf(input_file: Path) bool

Check if a PDF complies with the PDF specification.

Checks for proper formatting and proper linearization. Uses pikepdf (which in turn, uses QPDF) to perform the checks.

ocrmypdf.helpers.clamp(n: T, smallest: T, largest: T) T

Clamps the value of n to between smallest and largest.

ocrmypdf.helpers.is_file_writable(test_file: PathLike) bool

Intentionally racy test if target is writable.

We intend to write to the output file if and only if we succeed and can replace it atomically. Before doing the OCR work, make sure the location is writable.

ocrmypdf.helpers.is_iterable_notstr(thing: Any) bool

Is this is an iterable type, other than a string?

ocrmypdf.helpers.monotonic(seq: Sequence) bool

Does this sequence increase monotonically?

ocrmypdf.helpers.page_number(input_file: PathLike) int

Get one-based page number implied by filename (000002.pdf -> 2).

ocrmypdf.helpers.pikepdf_enable_mmap() None

Enable pikepdf memory mapping.

ocrmypdf.helpers.remove_all_log_handlers(logger: Logger) None

Remove all log handlers, usually used in a child process.

The child process inherits the log handlers from the parent process when a fork occurs. Typically we want to remove all log handlers in the child process so that the child process can set up a single queue handler to forward log messages to the parent process.

ocrmypdf.helpers.running_in_docker() bool

Returns True if we seem to be running in a Docker container.

ocrmypdf.helpers.running_in_snap() bool

Returns True if we seem to be running in a Snap container.

Create a symbolic link at soft_link_name, which references input_file.

Think of this as copying input_file to soft_link_name with less overhead.

Use symlinks safely. Self-linking loops are prevented. On Windows, file copy is used since symlinks may require administrator privileges. An existing link at the destination is removed.

ocrmypdf.helpers.samefile(file1: PathLike, file2: PathLike) bool

Return True if two files are the same file.

Attempts to account for different relative paths to the same file.

ocrmypdf.hocrtransform

Transform OCR output to text-only PDFs.

This package provides tools for: 1. Parsing OCR output (hOCR format) into generic OcrElement structures 2. Rendering OcrElement structures to searchable PDF text layers

The architecture separates parsing from rendering, allowing: - Support for multiple OCR input formats (hOCR, ALTO, custom engines) - Independent improvements to text rendering - Reuse of the OcrElement data model for other purposes

Main components: - OcrElement: Generic dataclass representing OCR output structure - HocrParser: Parses hOCR files into OcrElement trees - Fpdf2PdfRenderer: Renders OcrElement trees to PDF text layers (via fpdf2)

For PDF rendering, use the fpdf2_renderer module:

from ocrmypdf.fpdf_renderer import Fpdf2PdfRenderer, DebugRenderOptions

class ocrmypdf.hocrtransform.Baseline(slope: float = 0.0, intercept: float = 0.0)

Text baseline information.

The baseline is represented as a linear equation: y = slope * x + intercept. This describes the line along which text characters sit, relative to the bottom-left corner of the line’s bounding box.

In hOCR, the baseline is specified relative to the bottom of the line’s bbox, with the intercept being the vertical offset from the bottom and the slope representing rotation (positive = ascending left-to-right).

slope

Slope of the baseline (rise over run)

Type:

float

intercept

Y-intercept of the baseline (vertical offset from bbox bottom)

Type:

float

class ocrmypdf.hocrtransform.BoundingBox(left: float, top: float, right: float, bottom: float)

An axis-aligned bounding box in pixel coordinates.

Coordinates use top-left origin (standard for images and hOCR).

left

Left edge x-coordinate

Type:

float

top

Top edge y-coordinate

Type:

float

right

Right edge x-coordinate

Type:

float

bottom

Bottom edge y-coordinate

Type:

float

property height: float

Height of the bounding box.

property width: float

Width of the bounding box.

class ocrmypdf.hocrtransform.FontInfo(name: str | None = None, size: float | None = None, bold: bool = False, italic: bool = False, monospace: bool = False, serif: bool = False, smallcaps: bool = False, underline: bool = False)

Font information for text rendering.

name

Font family name (e.g., “Times New Roman”)

Type:

str | None

size

Font size in points

Type:

float | None

bold

Whether the font is bold

Type:

bool

italic

Whether the font is italic

Type:

bool

monospace

Whether the font is monospace

Type:

bool

serif

Whether the font is serif (vs sans-serif)

Type:

bool

smallcaps

Whether the font uses small caps

Type:

bool

underline

Whether the text is underlined

Type:

bool

exception ocrmypdf.hocrtransform.HocrParseError

Error while parsing hOCR file.

class ocrmypdf.hocrtransform.HocrParser(hocr_file: str | Path)

Parser for hOCR format files.

Converts hOCR XML/HTML files into OcrElement trees.

The hOCR format uses HTML with special class attributes (ocr_page, ocr_line, ocrx_word, etc.) and a title attribute containing properties like bbox, baseline, and confidence scores.

parse() OcrElement

Parse the hOCR file and return an OcrElement tree.

Returns:

The root OcrElement (ocr_page) containing the document structure

Raises:

HocrParseError – If no ocr_page element is found

class ocrmypdf.hocrtransform.OcrClass

Constants for common OCR element classes.

class ocrmypdf.hocrtransform.OcrElement(ocr_class: str, bbox: BoundingBox | None = None, poly: list[tuple[float, float]] | None=None, text: str = '', confidence: float | None = None, children: list[OcrElement] = <factory>, direction: Literal['ltr', 'rtl'] | None=None, language: str | None = None, baseline: Baseline | None = None, textangle: float | None = None, font: FontInfo | None = None, dpi: float | None = None, page_number: int | None = None, logical_page_number: int | None = None)

A generic OCR element representing any structural unit of OCR output.

OcrElements form a tree structure where pages contain paragraphs, paragraphs contain lines, lines contain words, etc. The specific hierarchy depends on the OCR engine, but this dataclass can represent any of these levels.

The ocr_class field uses hOCR naming conventions (ocr_page, ocr_par, ocr_line, ocrx_word, etc.) as a common vocabulary, but elements from other sources can map to these classes.

Common hOCR classes:
  • ocr_page: The root element for a page

  • ocr_carea: A content/column area

  • ocr_par: A paragraph

  • ocr_line: A line of text

  • ocr_header: A header line

  • ocr_footer: A footer line

  • ocr_caption: A caption line

  • ocr_textfloat: A floating text element

  • ocrx_word: A single word

ocr_class

The element type (e.g., “ocr_page”, “ocr_line”, “ocrx_word”)

Type:

str

bbox

Axis-aligned bounding box in source pixel coordinates (top-left origin)

Type:

ocrmypdf.models.ocr_element.BoundingBox | None

poly

Polygon vertices for oriented/non-rectangular bounds

Type:

list[tuple[float, float]] | None

text

Text content (primarily for leaf nodes like words)

Type:

str

confidence

OCR confidence score (0.0-1.0)

Type:

float | None

children

Child elements (hierarchical structure)

Type:

list[ocrmypdf.models.ocr_element.OcrElement]

direction

Text direction (“ltr” or “rtl”)

Type:

Literal[‘ltr’, ‘rtl’] | None

language

Language code (e.g., “eng”, “deu”, “chi_sim”)

Type:

str | None

baseline

Text baseline information (slope and intercept)

Type:

ocrmypdf.models.ocr_element.Baseline | None

textangle

Text rotation angle in degrees (counter-clockwise from horizontal)

Type:

float | None

font

Font information (name, size, style)

Type:

ocrmypdf.models.ocr_element.FontInfo | None

dpi

Image resolution in dots per inch (typically for page-level)

Type:

float | None

page_number

Physical page number (0-indexed)

Type:

int | None

logical_page_number

Logical page number (as printed on the page)

Type:

int | None

find_by_class(*ocr_classes: str) OcrElement | None

Find the first descendant matching the given class(es).

Parameters:

*ocr_classes – One or more ocr_class values to match

Returns:

The first matching element, or None if not found

get_text_recursive() str

Get the combined text of this element and all descendants.

Returns:

Combined text content, with words separated by spaces

iter_by_class(*ocr_classes: str) list[OcrElement]

Iterate over all descendants matching the given class(es).

Parameters:

*ocr_classes – One or more ocr_class values to match

Returns:

List of all matching descendant elements (depth-first order)

property lines: list[OcrElement]

Get all line elements in this element’s subtree.

property paragraphs: list[OcrElement]

Get all paragraph elements (ocr_par) in this element’s subtree.

property words: list[OcrElement]

Get all word elements (ocrx_word) in this element’s subtree.

ocrmypdf.pdfa

Utilities for PDF/A production and confirmation with Ghostscript.

ocrmypdf.pdfa.add_pdfa_metadata(pdf: <MagicMock id = '133295301338384'>, part: str, conformance: str) None

Add PDF/A XMP metadata declaration to a PDF.

Parameters:
  • pdf – An open pikepdf.Pdf object

  • part – PDF/A part number (‘1’, ‘2’, or ‘3’)

  • conformance – Conformance level (‘A’, ‘B’, or ‘U’)

ocrmypdf.pdfa.add_srgb_output_intent(pdf: <MagicMock id = '133295301338384'>) None

Add sRGB ICC profile as OutputIntent to PDF catalog.

This creates the required PDF/A OutputIntent structure with: - An ICC profile stream containing sRGB profile - An OutputIntent dictionary pointing to that profile - Updates the Catalog’s OutputIntents array

Parameters:

pdf – An open pikepdf.Pdf object

ocrmypdf.pdfa.file_claims_pdfa(filename: Path)

Determines if the file claims to be PDF/A compliant.

This only checks if the XMP metadata contains a PDF/A marker. It does not do full PDF/A validation.

ocrmypdf.pdfa.generate_pdfa_ps(target_filename: Path, icc: str = 'sRGB')

Create a Postscript PDFMARK file for Ghostscript PDF/A conversion.

pdfmark is an extension to the Postscript language that describes some PDF features like bookmarks and annotations. It was originally specified Adobe Distiller, for Postscript to PDF conversion.

Ghostscript uses pdfmark for PDF to PDF/A conversion as well. To use Ghostscript to create a PDF/A, we need to create a pdfmark file with the necessary metadata.

This function takes care of the many version-specific bugs and peculiarities in Ghostscript’s handling of pdfmark.

The only information we put in specifies that we want the file to be a PDF/A, and we want to Ghostscript to convert objects to the sRGB colorspace if it runs into any object that it decides must be converted.

Parameters:
  • target_filename – filename to save

  • icc – ICC identifier such as ‘sRGB’

References

Adobe PDFMARK Reference: https://opensource.adobe.com/dc-acrobat-sdk-docs/library/pdfmark/

ocrmypdf.pdfa.speculative_pdfa_conversion(input_file: Path, output_file: Path, output_type: str) Path

Attempt to convert a PDF to PDF/A by adding required structures.

This function creates a copy of the input PDF and adds: 1. sRGB ICC profile as OutputIntent 2. XMP metadata declaring PDF/A conformance

This approach works for PDFs that are already mostly PDF/A compliant but lack the formal declarations. It does NOT perform color conversion, font embedding, or other transformations that Ghostscript does.

Parameters:
  • input_file – Path to input PDF

  • output_file – Path where output PDF should be written

  • output_type – One of ‘pdfa’, ‘pdfa-1’, ‘pdfa-2’, ‘pdfa-3’

Returns:

Path to the output file

Raises:

pikepdf.PdfError – If the PDF cannot be opened or modified

ocrmypdf.quality

Utilities to measure OCR quality.

class ocrmypdf.quality.OcrQualityDictionary(*, wordlist: Iterable[str])

Manages a dictionary for simple OCR quality checks.

measure_words_matched(ocr_text: str) float

Check how many unique words in the OCR text match a dictionary.

Words with mixed capitalized are only considered a match if the test word matches that capitalization.

Returns:

number of words that match / number

ocrmypdf.subprocess

Wrappers to manage subprocess calls.

This package is split into three private submodules by concern:

  • ocrmypdf.subprocess._run - low-level execution wrappers (run, run_polling_stderr) that add OCRmyPDF-aware logging and Windows PATH resolution. Useful as drop-in replacements for subprocess.run().

  • ocrmypdf.subprocess._version - version probing (get_version).

  • ocrmypdf.subprocess._check - startup validation (check_external_program) with platform-aware error messages.

The names below are the stable public API. Importing from the private submodules directly is not supported for external code.

ocrmypdf.subprocess.check_external_program(*, program: str, package: str, version_checker: ~collections.abc.Callable[[], ~packaging.version.Version], need_version: str | ~packaging.version.Version, required_for: str | None = None, recommended: bool = False, version_parser: type[~packaging.version.Version] = <class 'packaging.version.Version'>) None

Check for required version of external program and raise exception if not.

Parameters:
  • program – The name of the program to test.

  • package – The name of a software package that typically supplies this program. Usually the same as program.

  • version_checker – A callable without arguments that retrieves the installed version of program.

  • need_version – The minimum required version.

  • required_for – The name of an argument of feature that requires this program.

  • recommended – If this external program is recommended, instead of raising an exception, log a warning and allow execution to continue.

  • version_parser – A class that should be used to parse and compare version numbers. Used when version numbers do not follow standard conventions.

ocrmypdf.subprocess.get_version(program: str, *, version_arg: str = '--version', regex='(\\d+(\\.\\d+)*)', env: Mapping[str, str] | _Environ | None = None) str

Get the version of the specified program.

Parameters:
  • program – The program to version check.

  • version_arg – The argument needed to ask for its version, e.g. --version.

  • regex – A regular expression to parse the program’s output and obtain the version.

  • env – Custom os.environ in which to run program.

ocrmypdf.subprocess.run(args: Sequence[Path | str], *, env: Mapping[str, str] | _Environ | None = None, logs_errors_to_stdout: bool = False, check: bool = False, **kwargs) CompletedProcess

Wrapper around subprocess.run().

The main purpose of this wrapper is to log subprocess output in an orderly fashion that identifies the responsible subprocess. An additional task is that this function goes to greater lengths to find possible Windows locations of our dependencies when they are not on the system PATH.

Arguments should be identical to subprocess.run, except for following:

Parameters:
  • args – Positional arguments to pass to subprocess.run.

  • env – A set of environment variables. If None, the OS environment is used.

  • logs_errors_to_stdout – If True, indicates that the process writes its error messages to stdout rather than stderr, so stdout should be logged if there is an error. If False, stderr is logged. Could be used with stderr=STDOUT, stdout=PIPE for example.

  • check – If True, raise an exception if the process exits with a non-zero status code. If False, the return value will indicate success or failure.

  • kwargs – Additional arguments to pass to subprocess.run.

ocrmypdf.subprocess.run_polling_stderr(args: Sequence[Path | str], *, callback: Callable[[str], None], check: bool = False, env: Mapping[str, str] | _Environ | None = None, **kwargs) CompletedProcess

Run a process like ocrmypdf.subprocess.run, and poll stderr.

Every line of produced by stderr will be forwarded to the callback function. The intended use is monitoring progress of subprocesses that output their own progress indicators. In addition, each line will be logged if debug logging is enabled.

Requires stderr to be opened in text mode for ease of handling errors. In addition the expected encoding= and errors= arguments should be set. Note that if stdout is already set up, it need not be binary.