API Reference¶

This page summarizes the rest of the public API. Generally speaking this should mainly of interest to plugin developers.

ocrmypdf¶

ocrmypdf.exceptions¶

exception ocrmypdf.exceptions.BadArgsError¶

exit_code = 1¶

exception ocrmypdf.exceptions.DpiError¶

exit_code = 2¶

exception ocrmypdf.exceptions.EncryptedPdfError¶

exit_code = 8¶

message = "Input PDF is encrypted. The encryption must be removed to\nperform OCR.\n\nFor information about this PDF's security use\n qpdf --show-encryption infilename\n\nYou can remove the encryption using\n qpdf --decrypt [--password=[password]] infilename\n"¶

class ocrmypdf.exceptions.ExitCode(value)¶

An enumeration.

already_done_ocr = 6¶

bad_args = 1¶

child_process_error = 7¶

ctrl_c = 130¶

encrypted_pdf = 8¶

file_access_error = 5¶

input_file = 2¶

invalid_config = 9¶

invalid_output_pdf = 4¶

missing_dependency = 3¶

ok = 0¶

other_error = 15¶

pdfa_conversion_failed = 10¶

exception ocrmypdf.exceptions.ExitCodeException¶

exit_code = 15¶

message = ''¶

exception ocrmypdf.exceptions.InputFileError¶

exit_code = 2¶

exception ocrmypdf.exceptions.MissingDependencyError¶

exit_code = 3¶

exception ocrmypdf.exceptions.OutputFileAccessError¶

exit_code = 5¶

exception ocrmypdf.exceptions.PdfMergeFailedError¶

exit_code = 2¶

message = 'Failed to merge PDF image layer with OCR layer\n\nUsually this happens because the input PDF file is malformed and\nocrmypdf cannot automatically correct the problem on its own.\n\nTry using\n ocrmypdf --pdf-renderer sandwich [..other args..]\n'¶

exception ocrmypdf.exceptions.PriorOcrFoundError¶

exit_code = 6¶

exception ocrmypdf.exceptions.SubprocessOutputError¶

exit_code = 7¶

exception ocrmypdf.exceptions.TesseractConfigError¶

exit_code = 9¶

message = 'Error occurred while parsing a Tesseract configuration file'¶

exception ocrmypdf.exceptions.UnsupportedImageFormatError¶

exit_code = 2¶

ocrmypdf.helpers¶

@ocrmypdf.helpers.deprecated¶: Warn that function is deprecated.

exception ocrmypdf.helpers.NeverRaise: An exception that is never raised

class ocrmypdf.helpers.Resolution(x, y)

The number of pixels per inch in each 2D direction.

Resolution objects are considered “equal” for == purposes if they are equal to a reasonable tolerance.

ocrmypdf.helpers.available_cpu_count() → int: Returns number of CPUs in the system.

ocrmypdf.helpers.check_pdf(input_file: pathlib.Path) → bool

Check if a PDF complies with the PDF specification.

Checks for proper formatting and proper linearization. Uses pikepdf (which in turn, uses QPDF) to perform the checks.

ocrmypdf.helpers.clamp(n, smallest, largest): Clamps the value of n to between smallest and largest.

ocrmypdf.helpers.deprecated(func): Warn that function is deprecated.

ocrmypdf.helpers.is_file_writable(test_file: os.PathLike) → bool

Intentionally racy test if target is writable.

We intend to write to the output file if and only if we succeed and can replace it atomically. Before doing the OCR work, make sure the location is writable.

ocrmypdf.helpers.is_iterable_notstr(thing: Any) → bool: Is this is an iterable type, other than a string?

ocrmypdf.helpers.monotonic(L: Sequence) → bool: Does this sequence increase monotonically?

ocrmypdf.helpers.page_number(input_file: os.PathLike) → int: Get one-based page number implied by filename (000002.pdf -> 2)

ocrmypdf.helpers.remove_all_log_handlers(logger): Remove all log handlers, usually used in a child process.

ocrmypdf.helpers.safe_symlink(input_file: os.PathLike, soft_link_name: os.PathLike)

Create a symbolic link at soft_link_name, which references input_file.

Think of this as copying input_file to soft_link_name with less overhead.

Use symlinks safely. Self-linking loops are prevented. On Windows, file copy is used since symlinks may require administrator privileges. An existing link at the destination is removed.

ocrmypdf.hocrtransform¶

class ocrmypdf.hocrtransform.HocrTransform(*, hocr_filename: Union[str, pathlib.Path], dpi: float)¶

A class for converting documents from the hOCR format. For details of the hOCR format, see: http://kba.cloud/hocr-spec/

classmethod baseline(element: xml.etree.ElementTree.Element) → Tuple[float, float]¶: Returns a tuple containing the baseline slope and intercept.

classmethod element_coordinates(element: xml.etree.ElementTree.Element) → ocrmypdf.hocrtransform.Rect¶: Returns a tuple containing the coordinates of the bounding box around an element

pt_from_pixel(pxl) → ocrmypdf.hocrtransform.Rect¶: Returns the quantity in PDF units (pt) given quantity in pixels

classmethod replace_unsupported_chars(s: str) → str¶: Given an input string, returns the corresponding string that: * is available in the Helvetica facetype * does not contain any ligature (to allow easy search in the PDF file)

to_pdf(*, out_filename: pathlib.Path, image_filename: Optional[pathlib.Path] = None, show_bounding_boxes: bool = False, fontname: str = 'Helvetica', invisible_text: bool = False, interword_spaces: bool = False) → None¶

Creates a PDF file with an image superimposed on top of the text. Text is positioned according to the bounding box of the lines in the hOCR file. The image need not be identical to the image used to create the hOCR file. It can have a lower resolution, different color mode, etc.

Parameters

out_filename – Path of PDF to write.
image_filename – Image to use for this file. If omitted, the OCR text is shown.
show_bounding_boxes – Show bounding boxes around various text regions, for debugging.
fontname – Name of font to use.
invisible_text – If True, text is rendered invisible so that is selectable but never drawn. If False, text is visible and may be seen if the image is skipped or deleted in Acrobat.
interword_spaces – If True, insert spaces between words rather than drawing each word without spaces. Generally this improves text extraction.

exception ocrmypdf.hocrtransform.HocrTransformError¶

class ocrmypdf.hocrtransform.Rect(x1: Any, y1: Any, x2: Any, y2: Any)¶

A rectangle for managing PDF coordinates.

property x1¶: Alias for field number 0

property x2¶: Alias for field number 2

property y1¶: Alias for field number 1

property y2¶: Alias for field number 3

ocrmypdf.pdfa¶

Utilities for PDF/A production and confirmation with Ghostspcript.

ocrmypdf.pdfa.file_claims_pdfa(filename: pathlib.Path)¶

Determines if the file claims to be PDF/A compliant.

This only checks if the XMP metadata contains a PDF/A marker. It does not do full PDF/A validation.

ocrmypdf.pdfa.generate_pdfa_ps(target_filename: pathlib.Path, icc: str = 'sRGB')¶

Create a Postscript PDFMARK file for Ghostscript PDF/A conversion

pdfmark is an extension to the Postscript language that describes some PDF features like bookmarks and annotations. It was originally specified Adobe Distiller, for Postscript to PDF conversion.

Ghostscript uses pdfmark for PDF to PDF/A conversion as well. To use Ghostscript to create a PDF/A, we need to create a pdfmark file with the necessary metadata.

This function takes care of the many version-specific bugs and pecularities in Ghostscript’s handling of pdfmark.

The only information we put in specifies that we want the file to be a PDF/A, and we want to Ghostscript to convert objects to the sRGB colorspace if it runs into any object that it decides must be converted.

Parameters

target_filename – filename to save
icc – ICC identifier such as ‘sRGB’

References

Adobe PDFMARK Reference: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdfmark_reference.pdf