Using the OCRmyPDF API¶
OCRmyPDF originated as a command line program and continues to have this legacy, but parts of it can be imported and used in other Python applications.
Some applications may want to consider running ocrmypdf from a subprocess call anyway, as this provides isolation of its activities.
OCRmyPDF provides one high-level function to run its main engine from an application. The parameters are symmetric to the command line arguments and largely have the same functions.
import ocrmypdf if __name__ == '__main__': # To ensure correct behavior on Windows and macOS ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
With some exceptions, all of the command line arguments are available and may be passed as equivalent keywords.
A few differences are that
quiet are not available.
Instead, output should be managed by configuring logging.
Parent process requirements¶
ocrmypdf.ocr() function runs OCRmyPDF similar to command line
execution. To do this, it will:
- create a monitoring thread
- create worker processes (on Linux, forking itself; on Windows and macOS, by spawning)
- manage the signal flags of its worker processes
- execute other subprocesses (forking and executing other programs)
The Python process that calls
ocrmypdf.ocr() must be sufficiently
privileged to perform these actions.
There currently is no option to manage how jobs are scheduled other
than the argument
jobs= which will limit the number of worker
Creating a child process to call
ocrmypdf.ocr() is suggested. That
way your application will survive and remain interactive even if
OCRmyPDF fails for any reason.
Programs that call
ocrmypdf.ocr() should also install a SIGBUS signal
handler (except on Windows), to raise an exception if access to a memory
mapped file fails. OCRmyPDF may use memory mapping.
ocrmypdf.ocr() will take a threading lock to prevent multiple runs of itself
in the same Python interpreter process. This is not thread-safe, because of how
OCRmyPDF’s plugins and Python’s library import system work. If you need to parallelize
OCRmyPDF, use processes.
On Windows and macOS, the script that calls
ocrmypdf.ocr() must be
protected by an “ifmain” guard (
if __name__ == '__main__'). If you do
not take at least one of these steps, process semantics will prevent
OCRmyPDF from working correctly.
OCRmyPDF will log under loggers named
ocrmypdf. In addition, it
PIL, both of which post log messages under
those logging namespaces.
You can configure the logging as desired for your application or call
ocrmypdf.configure_logging() to configure logging the same way
OCRmyPDF itself does. The command line parameters such as
--verbose have no equivalents in the API; you must use the
provided configuration function or do configuration in a way that suits
your use case.
OCRmyPDF uses the
tqdm package to implement its progress bars.
ocrmypdf.configure_logging() will set up logging output to
sys.stderr in a way that is compatible with the display of the
progress bar. Use
ocrmypdf.ocr(...progress_bar=False) to disable
the progress bar.
OCRmyPDF may throw standard Python exceptions,
exceptions, some exceptions related to multiprocessing, and
KeyboardInterrupt. The parent process should provide an exception
handler. OCRmyPDF will clean up its temporary files and worker processes
automatically when an exception occurs.
Programs that call OCRmyPDF should consider trapping KeyboardInterrupt so that they allow OCR to terminate with the whole program terminating.
When OCRmyPDF succeeds conditionally, it returns an integer exit code.
ocr(input_file: PathOrIO, output_file: PathOrIO, *, language: Iterable[str] | None = None, image_dpi: int | None = None, output_type: str | None = None, sidecar: StrPath | None = None, jobs: int | None = None, use_threads: bool | None = None, title: str | None = None, author: str | None = None, subject: str | None = None, keywords: str | None = None, rotate_pages: bool | None = None, remove_background: bool | None = None, deskew: bool | None = None, clean: bool | None = None, clean_final: bool | None = None, unpaper_args: str | None = None, oversample: int | None = None, remove_vectors: bool | None = None, force_ocr: bool | None = None, skip_text: bool | None = None, redo_ocr: bool | None = None, skip_big: float | None = None, optimize: int | None = None, jpg_quality: int | None = None, png_quality: int | None = None, jbig2_lossy: bool | None = None, jbig2_page_group_size: int | None = None, pages: str | None = None, max_image_mpixels: float | None = None, tesseract_config: Iterable[str] | None = None, tesseract_pagesegmode: int | None = None, tesseract_oem: int | None = None, tesseract_thresholding: int | None = None, pdf_renderer: str | None = None, tesseract_timeout: float | None = None, tesseract_non_ocr_timeout: float | None = None, rotate_pages_threshold: float | None = None, pdfa_image_compression: str | None = None, user_words: os.PathLike | None = None, user_patterns: os.PathLike | None = None, fast_web_view: float | None = None, continue_on_soft_render_error: bool | None = None, plugins: Iterable[StrPath] | None = None, plugin_manager=None, keep_temporary_files: bool | None = None, progress_bar: bool | None = None, **kwargs)¶
Run OCRmyPDF on one PDF or image.
For most arguments, see documentation for the equivalent command line parameter.
This API takes a threading lock, because OCRmyPDF uses global state in particular for the plugin system. The jobs parameter will be used to create a pool of worker threads or processes at different times, subject to change. A Python process can only run one OCRmyPDF task at a time.
To run parallelize instances OCRmyPDF, use separate Python processes to scale horizontally. Generally speaking you should set jobs=sqrt(cpu_count) and run sqrt(cpu_count) processes as a starting point. If you have files with a high page count, run fewer processes and more jobs per process. If you have a lot of short files, run more processes and fewer jobs per process.
A few specific arguments are discussed here:
- use_threads – Use worker threads instead of processes. This reduces performance but may make debugging easier since it is easier to set breakpoints.
- input_file – If a
bytes, this is interpreted as file system path to the input file. If the object appears to be a readable stream (with methods such as
.seek()), the object will be read in its entirety and saved to a temporary file. If
"-", standard input will be read.
- output_file – If a
bytes, this is interpreted as file system path to the output file. If the object appears to be a writable stream (with methods such as
.seek()), the output will be written to this stream. If
"-", the output will be written to
sys.stdout(provided that standard output does not seem to be a terminal device). When a stream is used as output, whether via a writable object or
"-", some final validation steps are not performed (we do not read back the stream after it is written).
ocrmypdf.MissingDependencyError– If a required dependency program is missing or was not found on PATH.
ocrmypdf.UnsupportedImageFormatError– If the input file type was an image that could not be read, or some other file type that is not a PDF.
ocrmypdf.DpiError– If the input file is an image, but the resolution of the image is not credible (allowing it to proceed would cause poor OCR).
ocrmypdf.OutputFileAccessError– If an attempt to write to the intended output file failed.
ocrmypdf.PriorOcrFoundError– If the input PDF seems to have OCR or digital text already, and settings did not tell us to proceed.
ocrmypdf.InputFileError– Any other problem with the input file.
ocrmypdf.SubprocessOutputError– Any error related to executing a subprocess.
ocrmypdf.EncryptedPdfError– If the input PDF is encrypted (password protected). OCRmyPDF does not remove passwords.
ocrmypdf.TesseractConfigError– If Tesseract reported its configuration was not valid.
Verbosity level for configure_logging.
Output ocrmypdf debug messages
More detailed debugging from ocrmypdf and dependent modules
Default level of logging
Suppress most messages
configure_logging(verbosity: Verbosity, *, progress_bar_friendly: bool = True, manage_root_logger: bool = False, plugin_manager: pluggy.PluginManager | None = None)¶
Set up logging.
ocrmypdf.ocr(), you can use this function to configure logging if you want ocrmypdf’s output to look like the ocrmypdf command line interface. It will register log handlers, log filters, and formatters, configure color logging to standard error, and adjust the log levels of third party libraries. Details of this are fine-tuned and subject to change. The
verbosityargument is equivalent to the argument
--verboseand applies those settings. If you have a wrapper script for ocrmypdf and you want it to be very similar to ocrmypdf, use this function; if you are using ocrmypdf as part of an application that manages its own logging, you probably do not want this function.
If this function is not called, ocrmypdf will not configure logging, and it is up to the caller of
ocrmypdf.ocr()to set up logging as it wishes using the Python standard library’s logging module. If this function is called, the caller may of course make further adjustments to logging.
Regardless of whether this function is called, ocrmypdf will perform all of its logging under the
"ocrmypdf"logging namespace. In addition, ocrmypdf imports pdfminer, which logs under
"pdfminer". A library user may wish to configure both; note that pdfminer is extremely chatty at the log level
This function does not set up the
debug.loglog file that the command line interface does at certain verbosity levels. Applications should configure their own debug logging.
- verbosity – Verbosity level.
- progress_bar_friendly – If True (the default), install a custom log handler that is compatible with progress bars and colored output.
- manage_root_logger – Configure the process’s root logger.
- plugin_manager – The plugin manager, used for obtaining the custom log handler.
The toplevel logger for ocrmypdf (or the root logger, if we are managing it).