Using the OCRmyPDF API¶

OCRmyPDF originated as a command line program and continues to have this legacy, but parts of it can be imported and used in other Python applications.

Some applications may want to consider running ocrmypdf from a subprocess call anyway, as this provides isolation of its activities.

Example¶

OCRmyPDF one high-level function to run its main engine from an application. The parameters are symmetric to the command line arguments and largely have the same functions.

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)

With a few exceptions, all of the command line arguments are available and may be passed as equivalent keywords.

A few differences are that verbose and quiet are not available. Instead, output should be managed by configuring logging.

Parent process requirements¶

The ocrmypdf.ocr() function runs OCRmyPDF similar to command line execution. To do this, it will:

create a monitoring thread
create worker processes (forking itself)
manage the signal flags of worker processes
execute other subprocesses (forking and executing other programs)

The Python process that calls ocrmypdf.ocr() must be sufficiently privileged to perform these actions. If it is not, ocrmypdf() will fail.

There is no currently no option to manage how jobs are scheduled other than the argument jobs= which will limit the number of worker processes.

Forking a child process to call ocrmypdf.ocr() is suggested. That way your application will survive and remain interactive even if OCRmyPDF does not.

Programs that call ocrmypdf.ocr() should also install a SIGBUS signal handler (except on Windows), to raise an exception if access to a memory mapped file fails. OCRmyPDF may use memory mapping.

Warning

On Windows, the script that calls ocrmypdf.ocr() must be protected by an “ifmain” guard (if __name__ == '__main__') or you must use ocrmypdf.ocr(...use_threads=True). If you do not take at least one of these steps, Windows process semantics will prevent OCRmyPDF from working correctly.

Logging¶

OCRmyPDF will log under loggers named ocrmypdf. In addition, it imports pdfminer and PIL, both of which post log messages under those logging namespaces.

You can configure the logging as desired for your application or call ocrmypdf.configure_logging() to configure logging the same way OCRmyPDF itself does. The command line parameters such as --quiet and --verbose have no equivalents in the API; you must use the provided configuration function or do configuration in a way that suits your use case.

Progress monitoring¶

OCRmyPDF uses the tqdm package to implement its progress bars. ocrmypdf.configure_logging() will set up logging output to sys.stderr in a way that is compatible with the display of the progress bar. Use ocrmypdf.ocr(...progress_bar=False) to disable the progress bar.

Exceptions¶

OCRmyPDF may throw standard Python exceptions, ocrmypdf.exceptions.* exceptions, some exceptions related to multiprocessing, and KeyboardInterrupt. The parent process should provide an exception handler. OCRmyPDF will clean up its temporary files and worker processes automatically when an exception occurs.

Programs that call OCRmyPDF should consider trapping KeyboardInterrupt so that they allow OCR to terminate with the whole program terminating.

When OCRmyPDF succeeds conditionally, it returns an integer exit code.

Reference¶

ocrmypdf.ocr(input_file: Union[BinaryIO, os.PathLike, str, bytes], output_file: Union[BinaryIO, os.PathLike, str, bytes], *, language: Iterable[str] = None, image_dpi: int = None, output_type=None, sidecar: os.PathLike = None, jobs: int = None, use_threads: bool = None, title: str = None, author: str = None, subject: str = None, keywords: str = None, rotate_pages: bool = None, remove_background: bool = None, deskew: bool = None, clean: bool = None, clean_final: bool = None, unpaper_args: str = None, oversample: int = None, remove_vectors: bool = None, threshold: bool = None, force_ocr: bool = None, skip_text: bool = None, redo_ocr: bool = None, skip_big: float = None, optimize: int = None, jpg_quality: int = None, png_quality: int = None, jbig2_lossy: bool = None, jbig2_page_group_size: int = None, pages: str = None, max_image_mpixels: float = None, tesseract_config: Iterable[str] = None, tesseract_pagesegmode: int = None, tesseract_oem: int = None, pdf_renderer=None, tesseract_timeout: float = None, rotate_pages_threshold: float = None, pdfa_image_compression=None, user_words: os.PathLike = None, user_patterns: os.PathLike = None, fast_web_view: float = None, plugins: Iterable[str] = None, keep_temporary_files: bool = None, progress_bar: bool = None, **kwargs)¶

Run OCRmyPDF on one PDF or image.

For most arguments, see documentation for the equivalent command line parameter. A few specific arguments are discussed here:

Parameters:	use_threads – Use worker threads instead of processes. This reduces performance but may make debugging easier since it is easier to set breakpoints. input_file – If a `pathlib.Path`, `str` or `bytes`, this is interpreted as file system path to the input file. If the object appears to be a readable stream (with methods such as `.read()` and `.seek()`), the object will be read in its entirety and saved to a temporary file. If `input_file` is `"-"`, standard input will be read. output_file – If a `pathlib.Path`, `str` or `bytes`, this is interpreted as file system path to the output file. If the object appears to be a writable stream (with methods such as `.read()` and `.seek()`), the output will be written to this stream. If `output_file` is `"-"`, the output will be written to `sys.stdout` (provided that standard output does not seem to be a terminal device). When a stream is used as output, whether via a writable object or `"-"`, some final validation steps are not performed (we do not read back the stream after it is written).
Raises:	`ocrmypdf.PdfMergeFailedError` – If the input PDF is malformed, preventing merging with the OCR layer. `ocrmypdf.MissingDependencyError` – If a required dependency program is missing or was not found on PATH. `ocrmypdf.UnsupportedImageFormatError` – If the input file type was an image that could not be read, or some other file type that is not a PDF. `ocrmypdf.DpiError` – If the input file is an image, but the resolution of the image is not credible (allowing it to proceed would cause poor OCR). `ocrmypdf.OutputFileAccessError` – If an attempt to write to the intended output file failed. `ocrmypdf.PriorOcrFoundError` – If the input PDF seems to have OCR or digital text already, and settings did not tell us to proceed. `ocrmypdf.InputFileError` – Any other problem with the input file. `ocrmypdf.SubprocessOutputError` – Any error related to executing a subprocess. `ocrmypdf.EncryptedPdfERror` – If the input PDF is encrypted (password protected). OCRmyPDF does not remove passwords. `ocrmypdf.TesseractConfigError` – If Tesseract reported its configuration was not valid.
Returns:	`ocrmypdf.ExitCode`

class ocrmypdf.Verbosity¶

Verbosity level for configure_logging.

debug = 1¶: Output ocrmypdf debug messages

debug_all = 2¶: More detailed debugging from ocrmypdf and dependent modules

default = 0¶: Default level of logging

quiet = -1¶: Suppress most messages

ocrmypdf.configure_logging(verbosity: ocrmypdf.api.Verbosity, progress_bar_friendly: bool = True, manage_root_logger: bool = False)¶

Set up logging.

Library users may wish to use this function if they want their log output to be similar to ocrmypdf command line interface. If not used, the external application should configure logging on its own.

ocrmypdf will perform all of its logging under the "ocrmypdf" logging namespace. In addition, ocrmypdf imports pdfminer, which logs under "pdfminer". A library user may wish to configure both; note that pdfminer is extremely chatty at the log level logging.INFO.

Library users may perform additional configuration afterwards.

Parameters:	verbosity (Verbosity) – Verbosity level. progress_bar_friendly (bool) – Install the TqdmConsole log handler, which is compatible with the tqdm progress bar; without this log messages will overwrite the progress bar manage_root_logger (bool) – Configure the process’s root logger, to ensure all log output is sent through
Returns:	The toplevel logger for ocrmypdf (or the root logger, if we are managing it).