API reference
This page summarizes the rest of the public API. Generally speaking this should be mainly of interest to plugin developers.
ocrmypdf.api
Functions for using ocrmypdf as an API.
- class ocrmypdf.api.PageNumberFilter(name='')
Insert PDF page number that emitted log message to log record.
- filter(record)
Determine if the specified record is to be logged.
Returns True if the record should be logged, or False otherwise. If deemed appropriate, the record may be modified in-place.
- class ocrmypdf.api.Verbosity(value)
Verbosity level for configure_logging.
- debug = 1
Output ocrmypdf debug messages
- debug_all = 2
More detailed debugging from ocrmypdf and dependent modules
- default = 0
Default level of logging
- quiet = -1
Suppress most messages
- ocrmypdf.api.configure_logging(verbosity: Verbosity, *, progress_bar_friendly: bool = True, manage_root_logger: bool = False, plugin_manager: PluginManager | None = None)
Set up logging.
Before calling
ocrmypdf.ocr()
, you can use this function to configure logging if you want ocrmypdf’s output to look like the ocrmypdf command line interface. It will register log handlers, log filters, and formatters, configure color logging to standard error, and adjust the log levels of third party libraries. Details of this are fine-tuned and subject to change. Theverbosity
argument is equivalent to the argument--verbose
and applies those settings. If you have a wrapper script for ocrmypdf and you want it to be very similar to ocrmypdf, use this function; if you are using ocrmypdf as part of an application that manages its own logging, you probably do not want this function.If this function is not called, ocrmypdf will not configure logging, and it is up to the caller of
ocrmypdf.ocr()
to set up logging as it wishes using the Python standard library’s logging module. If this function is called, the caller may of course make further adjustments to logging.Regardless of whether this function is called, ocrmypdf will perform all of its logging under the
"ocrmypdf"
logging namespace. In addition, ocrmypdf imports pdfminer, which logs under"pdfminer"
. A library user may wish to configure both; note that pdfminer is extremely chatty at the log levellogging.INFO
.This function does not set up the
debug.log
log file that the command line interface does at certain verbosity levels. Applications should configure their own debug logging.- Parameters:
verbosity – Verbosity level.
progress_bar_friendly – If True (the default), install a custom log handler that is compatible with progress bars and colored output.
manage_root_logger – Configure the process’s root logger.
plugin_manager – The plugin manager, used for obtaining the custom log handler.
- Returns:
The toplevel logger for ocrmypdf (or the root logger, if we are managing it).
- ocrmypdf.api.create_options(*, input_file: BinaryIO | Path | str | bytes, output_file: BinaryIO | Path | str | bytes, parser: ArgumentParser, **kwargs) Namespace
Construct an options object from the input/output files and keyword arguments.
- Parameters:
input_file – Input file path or file object.
output_file – Output file path or file object.
parser – ArgumentParser object.
**kwargs – Keyword arguments.
- Returns:
argparse.Namespace – A Namespace object containing the parsed arguments.
- Raises:
TypeError – If the type of a keyword argument is not supported.
- ocrmypdf.api.get_parser()
Get the main CLI parser.
- ocrmypdf.api.ocr(input_file: BinaryIO | Path | str | bytes, output_file: BinaryIO | Path | str | bytes, *, language: Iterable[str] | None = None, image_dpi: int | None = None, output_type: str | None = None, sidecar: BinaryIO | Path | str | bytes | None = None, jobs: int | None = None, use_threads: bool | None = None, title: str | None = None, author: str | None = None, subject: str | None = None, keywords: str | None = None, rotate_pages: bool | None = None, remove_background: bool | None = None, deskew: bool | None = None, clean: bool | None = None, clean_final: bool | None = None, unpaper_args: str | None = None, oversample: int | None = None, remove_vectors: bool | None = None, force_ocr: bool | None = None, skip_text: bool | None = None, redo_ocr: bool | None = None, skip_big: float | None = None, optimize: int | None = None, jpg_quality: int | None = None, png_quality: int | None = None, jbig2_lossy: bool | None = None, jbig2_page_group_size: int | None = None, jbig2_threshold: float | None = None, pages: str | None = None, max_image_mpixels: float | None = None, tesseract_config: Iterable[str] | None = None, tesseract_pagesegmode: int | None = None, tesseract_oem: int | None = None, tesseract_thresholding: int | None = None, pdf_renderer: str | None = None, tesseract_timeout: float | None = None, tesseract_non_ocr_timeout: float | None = None, tesseract_downsample_above: int | None = None, tesseract_downsample_large_images: bool | None = None, rotate_pages_threshold: float | None = None, pdfa_image_compression: str | None = None, color_conversion_strategy: str | None = None, user_words: PathLike | None = None, user_patterns: PathLike | None = None, fast_web_view: float | None = None, continue_on_soft_render_error: bool | None = None, invalidate_digital_signatures: bool | None = None, plugins: Iterable[Path | str] | None = None, plugin_manager=None, keep_temporary_files: bool | None = None, progress_bar: bool | None = None, **kwargs)
Run OCRmyPDF on one PDF or image.
For most arguments, see documentation for the equivalent command line parameter.
This API takes a threading lock, because OCRmyPDF uses global state in particular for the plugin system. The jobs parameter will be used to create a pool of worker threads or processes at different times, subject to change. A Python process can only run one OCRmyPDF task at a time.
To run parallelize instances OCRmyPDF, use separate Python processes to scale horizontally. Generally speaking you should set jobs=sqrt(cpu_count) and run sqrt(cpu_count) processes as a starting point. If you have files with a high page count, run fewer processes and more jobs per process. If you have a lot of short files, run more processes and fewer jobs per process.
A few specific arguments are discussed here:
- Parameters:
use_threads – Use worker threads instead of processes. This reduces performance but may make debugging easier since it is easier to set breakpoints.
input_file – If a
pathlib.Path
,str
orbytes
, this is interpreted as file system path to the input file. If the object appears to be a readable stream (with methods such as.read()
and.seek()
), the object will be read in its entirety and saved to a temporary file. Ifinput_file
is"-"
, standard input will be read.output_file – If a
pathlib.Path
,str
orbytes
, this is interpreted as file system path to the output file. If the object appears to be a writable stream (with methods such as.write()
and.seek()
), the output will be written to this stream. Ifoutput_file
is"-"
, the output will be written tosys.stdout
(provided that standard output does not seem to be a terminal device). When a stream is used as output, whether via a writable object or"-"
, some final validation steps are not performed (we do not read back the stream after it is written).
- Raises:
ocrmypdf.MissingDependencyError – If a required dependency program is missing or was not found on PATH.
ocrmypdf.UnsupportedImageFormatError – If the input file type was an image that could not be read, or some other file type that is not a PDF.
ocrmypdf.DpiError – If the input file is an image, but the resolution of the image is not credible (allowing it to proceed would cause poor OCR).
ocrmypdf.OutputFileAccessError – If an attempt to write to the intended output file failed.
ocrmypdf.PriorOcrFoundError – If the input PDF seems to have OCR or digital text already, and settings did not tell us to proceed.
ocrmypdf.InputFileError – Any other problem with the input file.
ocrmypdf.SubprocessOutputError – Any error related to executing a subprocess.
ocrmypdf.EncryptedPdfError – If the input PDF is encrypted (password protected). OCRmyPDF does not remove passwords.
ocrmypdf.TesseractConfigError – If Tesseract reported its configuration was not valid.
- Returns:
ocrmypdf.ExitCode
ocrmypdf.exceptions
OCRmyPDF’s exceptions.
- exception ocrmypdf.exceptions.BadArgsError
Invalid arguments on the command line or API.
- exit_code = 1
- exception ocrmypdf.exceptions.ColorConversionNeededError
PDF needs color conversion.
- message = 'The input PDF has an unusual color space. Use\n--color-conversion-strategy to convert to a common color space\nsuch as RGB, or use --output-type pdf to skip PDF/A conversion\nand retain the original color space.\n'
- exception ocrmypdf.exceptions.DigitalSignatureError
PDF has a digital signature.
- message = 'Input PDF has a digital signature. OCR would alter the document,\ninvalidating the signature.\n'
- exception ocrmypdf.exceptions.EncryptedPdfError
Input PDF is encrypted.
- exit_code = 8
- message = "Input PDF is encrypted. The encryption must be removed to\nperform OCR.\n\nFor information about this PDF's security use\n qpdf --show-encryption infilename\n\nYou can remove the encryption using\n qpdf --decrypt [--password=[password]] infilename\n"
- class ocrmypdf.exceptions.ExitCode(value)
OCRmyPDF’s exit codes.
- already_done_ocr = 6
- bad_args = 1
- child_process_error = 7
- ctrl_c = 130
- encrypted_pdf = 8
- file_access_error = 5
- input_file = 2
- invalid_config = 9
- invalid_output_pdf = 4
- missing_dependency = 3
- ok = 0
- other_error = 15
- pdfa_conversion_failed = 10
- exception ocrmypdf.exceptions.ExitCodeException
An exception which should return an exit code with sys.exit().
- exit_code = 15
- message = ''
- exception ocrmypdf.exceptions.InputFileError
Something is wrong with the input file.
- exit_code = 2
- exception ocrmypdf.exceptions.MissingDependencyError
A third-party dependency is missing.
- exit_code = 3
- exception ocrmypdf.exceptions.OutputFileAccessError
Cannot access the intended output file path.
- exit_code = 5
- exception ocrmypdf.exceptions.SubprocessOutputError
A subprocess returned an unexpected error.
- exit_code = 7
- exception ocrmypdf.exceptions.TaggedPDFError
PDF is tagged.
- message = 'This PDF is marked as a Tagged PDF. This often indicates\nthat the PDF was generated from an office document and does\nnot need OCR. Use --force-ocr, --skip-text or --redo-ocr to\noverride this error.\n'
ocrmypdf.helpers
Support functions.
- @ocrmypdf.helpers.deprecated(deprecated_in=None, removed_in=None, current_version=None, details='')
Decorate a function to signify its deprecation
- This function wraps a method that will soon be removed and does two things:
The docstring of the method will be modified to include a notice about deprecation, e.g., “Deprecated since 0.9.11. Use foo instead.”
Raises a
DeprecatedWarning
via thewarnings
module, which is a subclass of the built-inDeprecationWarning
. Note that built-inDeprecationWarning`s are ignored by default, so for users to be informed of said warnings they will need to enable them--see the :mod:`warnings
module documentation for more details.
- Parameters:
deprecated_in – The version at which the decorated method is considered deprecated. This will usually be the next version to be released when the decorator is added. The default is None, which effectively means immediate deprecation. If this is not specified, then the removed_in and current_version arguments are ignored.
removed_in – The version or
datetime.date
when the decorated method will be removed. The default is None, specifying that the function is not currently planned to be removed. Note: This parameter cannot be set to a value if deprecated_in=None.current_version – The source of version information for the currently running code. This will usually be a __version__ attribute on your library. The default is None. When current_version=None the automation to determine if the wrapped function is actually in a period of deprecation or time for removal does not work, causing a
DeprecatedWarning
to be raised in all cases.details – Extra details to be added to the method docstring and warning. For example, the details may point users to a replacement method, such as “Use the foo_bar method instead”. By default there are no details.
- ocrmypdf.helpers.NeverRaise()
An exception that is never raised.
Deprecated since version 15.4.0.
- class ocrmypdf.helpers.Resolution(x: T, y: T)
The number of pixels per inch in each 2D direction.
Resolution objects are considered “equal” for == purposes if they are equal to a reasonable tolerance.
- flip_axis() Resolution[T]
Return a new Resolution object with x and y swapped.
- property is_finite: bool
True if both x and y are finite numbers.
- property is_square: bool
True if the resolution is square (x == y).
- round(ndigits: int) Resolution
Round to ndigits after the decimal point.
- take_max(vals: Iterable[Any], yvals: Iterable[Any] | None = None) Resolution
Return a new Resolution object with the maximum resolution of inputs.
- take_min(vals: Iterable[Any], yvals: Iterable[Any] | None = None) Resolution
Return a new Resolution object with the minimum resolution of inputs.
- to_int() Resolution[int]
Round to nearest integer.
- to_scalar() float
Return the harmonic mean of x and y as a 1D approximation.
In most cases, Resolution is 2D, but typically it is “square” (x == y) and can be approximated as a single number. When not square, the harmonic mean is used to approximate the 2D resolution as a single number.
- ocrmypdf.helpers.available_cpu_count() int
Returns number of CPUs in the system.
- ocrmypdf.helpers.check_pdf(input_file: Path) bool
Check if a PDF complies with the PDF specification.
Checks for proper formatting and proper linearization. Uses pikepdf (which in turn, uses QPDF) to perform the checks.
- ocrmypdf.helpers.clamp(n: T, smallest: T, largest: T) T
Clamps the value of
n
to betweensmallest
andlargest
.
- ocrmypdf.helpers.is_file_writable(test_file: PathLike) bool
Intentionally racy test if target is writable.
We intend to write to the output file if and only if we succeed and can replace it atomically. Before doing the OCR work, make sure the location is writable.
- ocrmypdf.helpers.is_iterable_notstr(thing: Any) bool
Is this is an iterable type, other than a string?
- ocrmypdf.helpers.page_number(input_file: PathLike) int
Get one-based page number implied by filename (000002.pdf -> 2).
- ocrmypdf.helpers.pikepdf_enable_mmap() None
Enable pikepdf memory mapping.
- ocrmypdf.helpers.remove_all_log_handlers(logger: Logger) None
Remove all log handlers, usually used in a child process.
The child process inherits the log handlers from the parent process when a fork occurs. Typically we want to remove all log handlers in the child process so that the child process can set up a single queue handler to forward log messages to the parent process.
- ocrmypdf.helpers.running_in_docker() bool
Returns True if we seem to be running in a Docker container.
- ocrmypdf.helpers.running_in_snap() bool
Returns True if we seem to be running in a Snap container.
- ocrmypdf.helpers.safe_symlink(input_file: PathLike, soft_link_name: PathLike) None
Create a symbolic link at
soft_link_name
, which referencesinput_file
.Think of this as copying
input_file
tosoft_link_name
with less overhead.Use symlinks safely. Self-linking loops are prevented. On Windows, file copy is used since symlinks may require administrator privileges. An existing link at the destination is removed.
ocrmypdf.hocrtransform
Transform .hocr and page image to text PDF.
- class ocrmypdf.hocrtransform.DebugRenderOptions(render_paragraph_bbox: bool = False, render_baseline: bool = False, render_triangle: bool = False, render_line_bbox: bool = False, render_word_bbox: bool = False, render_space_bbox: bool = False)
A class for managing rendering options.
- class ocrmypdf.hocrtransform.HocrTransform(*, hocr_filename: str | Path, dpi: float, debug: bool = False, fontname: Name = <MagicMock name='mock()' id='140327228779984'>, font: Font = <MagicMock spec='str' id='140327226352560'>, debug_render_options: DebugRenderOptions | None = None)
A class for converting documents from the hOCR format.
For details of the hOCR format, see: http://kba.github.io/hocr-spec/1.2/.
- classmethod element_coordinates(element: Element) <MagicMock name='mock.__or__()' id='140327195795728'>
Get coordinates of the bounding box around an element.
- classmethod normalize_text(s: str) str
Normalize the given text using the NFKC normalization form.
- classmethod polyval(poly, x)
Calculate the value of a polynomial at a point.
- to_pdf(*, out_filename: Path, image_filename: Path | None = None, invisible_text: bool = True) None
Creates a PDF file with an image superimposed on top of the text.
Text is positioned according to the bounding box of the lines in the hOCR file. The image need not be identical to the image used to create the hOCR file. It can have a lower resolution, different color mode, etc.
- Parameters:
out_filename – Path of PDF to write.
image_filename – Image to use for this file. If omitted, the OCR text is shown.
invisible_text – If True, text is rendered invisible so that is selectable but never drawn. If False, text is visible and may be seen if the image is skipped or deleted in Acrobat.
- exception ocrmypdf.hocrtransform.HocrTransformError
Error while applying hOCR transform.
ocrmypdf.pdfa
Utilities for PDF/A production and confirmation with Ghostspcript.
- ocrmypdf.pdfa.file_claims_pdfa(filename: Path)
Determines if the file claims to be PDF/A compliant.
This only checks if the XMP metadata contains a PDF/A marker. It does not do full PDF/A validation.
- ocrmypdf.pdfa.generate_pdfa_ps(target_filename: Path, icc: str = 'sRGB')
Create a Postscript PDFMARK file for Ghostscript PDF/A conversion.
pdfmark is an extension to the Postscript language that describes some PDF features like bookmarks and annotations. It was originally specified Adobe Distiller, for Postscript to PDF conversion.
Ghostscript uses pdfmark for PDF to PDF/A conversion as well. To use Ghostscript to create a PDF/A, we need to create a pdfmark file with the necessary metadata.
This function takes care of the many version-specific bugs and peculiarities in Ghostscript’s handling of pdfmark.
The only information we put in specifies that we want the file to be a PDF/A, and we want to Ghostscript to convert objects to the sRGB colorspace if it runs into any object that it decides must be converted.
- Parameters:
target_filename – filename to save
icc – ICC identifier such as ‘sRGB’
References
Adobe PDFMARK Reference: https://opensource.adobe.com/dc-acrobat-sdk-docs/library/pdfmark/
ocrmypdf.quality
Utilities to measure OCR quality.
ocrmypdf.subprocess
Wrappers to manage subprocess calls.
- ocrmypdf.subprocess.check_external_program(*, program: str, package: str, version_checker: ~collections.abc.Callable[[], ~packaging.version.Version], need_version: str | ~packaging.version.Version, required_for: str | None = None, recommended: bool = False, version_parser: type[~packaging.version.Version] = <class 'packaging.version.Version'>) None
Check for required version of external program and raise exception if not.
- Parameters:
program – The name of the program to test.
package – The name of a software package that typically supplies this program. Usually the same as program.
version_checker – A callable without arguments that retrieves the installed version of program.
need_version – The minimum required version.
required_for – The name of an argument of feature that requires this program.
recommended – If this external program is recommended, instead of raising an exception, log a warning and allow execution to continue.
version_parser – A class that should be used to parse and compare version numbers. Used when version numbers do not follow standard conventions.
- ocrmypdf.subprocess.get_version(program: str, *, version_arg: str = '--version', regex='(\\d+(\\.\\d+)*)', env: _Environ | None = None) str
Get the version of the specified program.
- Parameters:
program – The program to version check.
version_arg – The argument needed to ask for its version, e.g.
--version
.regex – A regular expression to parse the program’s output and obtain the version.
env – Custom
os.environ
in which to run program.
- ocrmypdf.subprocess.run(args: Sequence[Path | str], *, env: _Environ | None = None, logs_errors_to_stdout: bool = False, check: bool = False, **kwargs) CompletedProcess
Wrapper around
subprocess.run()
.The main purpose of this wrapper is to log subprocess output in an orderly fashion that identifies the responsible subprocess. An additional task is that this function goes to greater lengths to find possible Windows locations of our dependencies when they are not on the system PATH.
Arguments should be identical to
subprocess.run
, except for following:- Parameters:
args – Positional arguments to pass to
subprocess.run
.env – A set of environment variables. If None, the OS environment is used.
logs_errors_to_stdout – If True, indicates that the process writes its error messages to stdout rather than stderr, so stdout should be logged if there is an error. If False, stderr is logged. Could be used with stderr=STDOUT, stdout=PIPE for example.
check – If True, raise an exception if the process exits with a non-zero status code. If False, the return value will indicate success or failure.
kwargs – Additional arguments to pass to
subprocess.run
.
- ocrmypdf.subprocess.run_polling_stderr(args: Sequence[Path | str], *, callback: Callable[[str], None], check: bool = False, env: _Environ | None = None, **kwargs) CompletedProcess
Run a process like
ocrmypdf.subprocess.run
, and poll stderr.Every line of produced by stderr will be forwarded to the callback function. The intended use is monitoring progress of subprocesses that output their own progress indicators. In addition, each line will be logged if debug logging is enabled.
Requires stderr to be opened in text mode for ease of handling errors. In addition the expected encoding= and errors= arguments should be set. Note that if stdout is already set up, it need not be binary.