Using the OCRmyPDF API

OCRmyPDF originated as a command line program and continues to have this legacy, but parts of it can be imported and used in other Python applications.

Some applications may want to consider running ocrmypdf from a subprocess call anyway, as this provides isolation of its activities.

Example

OCRmyPDF provides one high-level function to run its main engine from an application.

Changed in version 17.0: The ocrmypdf.ocr() function now accepts an OcrOptions object as its first argument, providing a cleaner API with full type hints and validation. The previous positional argument style remains supported.

Modern API (recommended)

The recommended way to call ocrmypdf.ocr() is to construct an OcrOptions object with all settings, then pass it as the sole argument:

import ocrmypdf
from ocrmypdf import OcrOptions

if __name__ == '__main__':  # To ensure correct behavior on Windows and macOS
    options = OcrOptions(
        input_file='input.pdf',
        output_file='output.pdf',
        deskew=True,
        languages=['eng'],
    )
    ocrmypdf.ocr(options)

OcrOptions is a Pydantic model that provides:

Full type hints and IDE autocompletion
Validation of option values at construction time
Clear documentation of all available options

Added in version 17.0: The OcrOptions class is now exported from the top-level ocrmypdf module.

Legacy API

For compatibility with OCRmyPDF < v17, the traditional calling style with positional arguments is still fully supported:

import ocrmypdf

if __name__ == '__main__':  # To ensure correct behavior on Windows and macOS
    ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)

With this style, all of the command line arguments are available and may be passed as equivalent keywords.

A few differences are that verbose and quiet are not available. Instead, output should be managed by configuring logging.

Parent process requirements

The ocrmypdf.ocr() function runs OCRmyPDF similar to command line execution. To do this, it will:

create worker processes or threads
manage the signal flags of its worker processes
execute other subprocesses (forking and executing other programs)

The Python process that calls ocrmypdf.ocr() must be sufficiently privileged to perform these actions.

There currently is no option to manage how jobs are scheduled other than the argument jobs= which will limit the number of worker processes.

Creating a child process to call ocrmypdf.ocr() is suggested. That way your application will survive and remain interactive even if OCRmyPDF fails for any reason. For example:

from multiprocessing import Process
import ocrmypdf
from ocrmypdf import OcrOptions

def ocrmypdf_process():
    options = OcrOptions(input_file='input.pdf', output_file='output.pdf')
    ocrmypdf.ocr(options)

def call_ocrmypdf_from_my_app():
    p = Process(target=ocrmypdf_process)
    p.start()
    p.join()

Programs that call ocrmypdf.ocr() should also install a SIGBUS signal handler (except on Windows), to raise an exception if access to a memory mapped file fails. OCRmyPDF may use memory mapping.

ocrmypdf.ocr() will take a threading lock to prevent multiple runs of itself in the same Python interpreter process. This is not thread-safe, because of how OCRmyPDF’s plugins and Python’s library import system work. If you need to parallelize OCRmyPDF, use processes.

Warning

On Windows and macOS, the script that calls ocrmypdf.ocr() must be protected by an “ifmain” guard (if __name__ == '__main__'). If you do not take at least one of these steps, process semantics will prevent OCRmyPDF from working correctly.

Logging

OCRmyPDF will log under loggers named ocrmypdf. In addition, it imports pdfminer and PIL, both of which post log messages under those logging namespaces.

You can configure the logging as desired for your application or call ocrmypdf.configure_logging() to configure logging the same way OCRmyPDF itself does. The command line parameters such as --quiet and --verbose have no equivalents in the API; you must use the provided configuration function or do configuration in a way that suits your use case.

Progress monitoring

OCRmyPDF uses the rich package to implement its progress bars. ocrmypdf.configure_logging() will set up logging output to sys.stderr in a way that is compatible with the display of the progress bar. Use ocrmypdf.ocr(...progress_bar=False) to disable the progress bar.

Standard output

OCRmyPDF is strict about not writing to standard output so that users can safely use it in a pipeline and produce a valid output file. A caller application will have to ensure it does not write to standard output either, if it wants to be compatible with this behavior and support piping to a file. Another benefit of running OCRmyPDF in a child process, as recommended above, is that it will not interfere with the parent process’s standard output.

Exceptions

OCRmyPDF may throw standard Python exceptions, ocrmypdf.exceptions.* exceptions, some exceptions related to multiprocessing, and KeyboardInterrupt. The parent process should provide an exception handler. OCRmyPDF will clean up its temporary files and worker processes automatically when an exception occurs.

When OCRmyPDF succeeds conditionally, it returns an integer exit code.

Plugin Development Changes

Changed in version 16.13: Plugin hooks now receive OcrOptions objects instead of argparse.Namespace.

OcrOptions provides the same attribute access as Namespace (duck-typing compatible)
Plugin developers should update type hints: from ocrmypdf import OcrOptions
Built-in plugins no longer modify options in-place for better immutability

Most existing plugins will continue working without modification due to the duck-typing compatibility between OcrOptions and Namespace.