API Reference¶
This page summarizes the rest of the public API. Generally speaking this should mainly of interest to plugin developers.
ocrmypdf¶
-
class
ocrmypdf.
PageContext
(pdf_context: ocrmypdf._jobcontext.PdfContext, pageno)¶ Holds our context for a page.
Must be pickle-able, so stores only intrinsic/simple data elements or those capable of their serializing themselves via
__getstate__
.-
get_path
(name: str) → pathlib.Path¶ Generate a
Path
for a file that is part of processing this page.The path will be based in a common temporary folder and have a prefix based on the page number.
-
options
= None¶ The specified options for processing this PDF.
-
origin
= None¶ The filename of the original input file.
-
pageinfo
= None¶ Information on this page.
-
pageno
= None¶ This page number (zero-based).
-
plugin_manager
= None¶ PluginManager for processing the current PDF.
-
-
class
ocrmypdf.
PdfContext
(options: argparse.Namespace, work_folder: pathlib.Path, origin: pathlib.Path, pdfinfo: ocrmypdf.pdfinfo.info.PdfInfo, plugin_manager)¶ Holds the context for a particular run of the pipeline.
-
get_page_contexts
() → Iterator[ocrmypdf._jobcontext.PageContext]¶ Get all
PageContext
for this PDF.
-
get_path
(name: str) → pathlib.Path¶ Generate a
Path
for an intermediate file involved in processing.The path will be in a temporary folder that is common for all processing of this particular PDF.
-
options
= None¶ The specified options for processing this PDF.
-
origin
= None¶ The filename of the original input file.
-
pdfinfo
= None¶ Detailed data for this PDF.
-
plugin_manager
= None¶ PluginManager for processing the current PDF.
-
ocrmypdf.exceptions¶
OCRmyPDF’s exceptions.
-
exception
ocrmypdf.exceptions.
BadArgsError
¶ Invalid arguments on the command line or API.
-
exit_code
= 1¶
-
-
exception
ocrmypdf.exceptions.
DigitalSignatureError
¶ PDF has a digital signature.
-
exit_code
= 2¶
-
message
= 'Input PDF has a digital signature. OCR would alter the document,\ninvalidating the signature.\n'¶
-
-
exception
ocrmypdf.exceptions.
EncryptedPdfError
¶ Input PDF is encrypted.
-
exit_code
= 8¶
-
message
= "Input PDF is encrypted. The encryption must be removed to\nperform OCR.\n\nFor information about this PDF's security use\n qpdf --show-encryption infilename\n\nYou can remove the encryption using\n qpdf --decrypt [--password=[password]] infilename\n"¶
-
-
class
ocrmypdf.exceptions.
ExitCode
¶ OCRmyPDF’s exit codes.
-
already_done_ocr
= 6¶
-
bad_args
= 1¶
-
child_process_error
= 7¶
-
ctrl_c
= 130¶
-
encrypted_pdf
= 8¶
-
file_access_error
= 5¶
-
input_file
= 2¶
-
invalid_config
= 9¶
-
invalid_output_pdf
= 4¶
-
missing_dependency
= 3¶
-
ok
= 0¶
-
other_error
= 15¶
-
pdfa_conversion_failed
= 10¶
-
-
exception
ocrmypdf.exceptions.
ExitCodeException
¶ An exception which should return an exit code with sys.exit().
-
exit_code
= 15¶
-
message
= ''¶
-
-
exception
ocrmypdf.exceptions.
InputFileError
¶ Something is wrong with the input file.
-
exit_code
= 2¶
-
-
exception
ocrmypdf.exceptions.
MissingDependencyError
¶ A third-party dependency is missing.
-
exit_code
= 3¶
-
-
exception
ocrmypdf.exceptions.
OutputFileAccessError
¶ Cannot access the intended output file path.
-
exit_code
= 5¶
-
-
exception
ocrmypdf.exceptions.
SubprocessOutputError
¶ A subprocess returned an unexpected error.
-
exit_code
= 7¶
-
ocrmypdf.helpers¶
Support functions.
-
exception
ocrmypdf.helpers.
NeverRaise
An exception that is never raised.
-
class
ocrmypdf.helpers.
Resolution
(x: T, y: T) The number of pixels per inch in each 2D direction.
Resolution objects are considered “equal” for == purposes if they are equal to a reasonable tolerance.
-
flip_axis
() → ocrmypdf.helpers.Resolution[~T][T] Return a new Resolution object with x and y swapped.
-
is_finite
True if both x and y are finite numbers.
-
is_square
True if the resolution is square (x == y).
-
round
(ndigits: int) → ocrmypdf.helpers.Resolution Round to ndigits after the decimal point.
-
take_max
(vals: Iterable[Any], yvals: Iterable[Any] | None = None) → Resolution Return a new Resolution object with the maximum resolution of inputs.
-
take_min
(vals: Iterable[Any], yvals: Iterable[Any] | None = None) → Resolution Return a new Resolution object with the minimum resolution of inputs.
-
to_int
() → ocrmypdf.helpers.Resolution[int][int] Round to nearest integer.
-
to_scalar
() → float Return the harmonic mean of x and y as a 1D approximation.
In most cases, Resolution is 2D, but typically it is “square” (x == y) and can be approximated as a single number. When not square, the harmonic mean is used to approximate the 2D resolution as a single number.
-
-
ocrmypdf.helpers.
available_cpu_count
() → int Returns number of CPUs in the system.
-
ocrmypdf.helpers.
check_pdf
(input_file: pathlib.Path) → bool Check if a PDF complies with the PDF specification.
Checks for proper formatting and proper linearization. Uses pikepdf (which in turn, uses QPDF) to perform the checks.
-
ocrmypdf.helpers.
clamp
(n: T, smallest: T, largest: T) → T Clamps the value of
n
to betweensmallest
andlargest
.
-
ocrmypdf.helpers.
is_file_writable
(test_file: os.PathLike) → bool Intentionally racy test if target is writable.
We intend to write to the output file if and only if we succeed and can replace it atomically. Before doing the OCR work, make sure the location is writable.
-
ocrmypdf.helpers.
is_iterable_notstr
(thing: Any) → bool Is this is an iterable type, other than a string?
-
ocrmypdf.helpers.
monotonic
(seq: Sequence) → bool Does this sequence increase monotonically?
-
ocrmypdf.helpers.
page_number
(input_file: os.PathLike) → int Get one-based page number implied by filename (000002.pdf -> 2).
-
ocrmypdf.helpers.
pikepdf_enable_mmap
() → None Enable pikepdf mmap.
-
ocrmypdf.helpers.
remove_all_log_handlers
(logger: logging.Logger) → None Remove all log handlers, usually used in a child process.
The child process inherits the log handlers from the parent process when a fork occurs. Typically we want to remove all log handlers in the child process so that the child process can set up a single queue handler to forward log messages to the parent process.
-
ocrmypdf.helpers.
safe_symlink
(input_file: os.PathLike, soft_link_name: os.PathLike) → None Create a symbolic link at
soft_link_name
, which referencesinput_file
.Think of this as copying
input_file
tosoft_link_name
with less overhead.Use symlinks safely. Self-linking loops are prevented. On Windows, file copy is used since symlinks may require administrator privileges. An existing link at the destination is removed.
-
ocrmypdf.helpers.
samefile
(file1: os.PathLike, file2: os.PathLike) → bool Return True if two files are the same file.
Attempts to account for different relative paths to the same file.
ocrmypdf.hocrtransform¶
Transform .hocr and page image to text PDF.
-
class
ocrmypdf.hocrtransform.
HocrTransform
(*, hocr_filename: str | Path, dpi: float)¶ A class for converting documents from the hOCR format.
For details of the hOCR format, see: http://kba.cloud/hocr-spec/.
-
classmethod
baseline
(element: xml.etree.ElementTree.Element) → tuple¶ Get baseline’s slope and intercept.
-
classmethod
element_coordinates
(element: xml.etree.ElementTree.Element) → ocrmypdf.hocrtransform.Rect¶ Get coordinates of the bounding box around an element.
-
classmethod
polyval
(poly, x)¶ Calculate the value of a polynomial at a point.
-
pt_from_pixel
(pxl) → ocrmypdf.hocrtransform.Rect¶ Returns the quantity in PDF units (pt) given quantity in pixels.
-
classmethod
replace_unsupported_chars
(s: str) → str¶ Replaces characters with those available in the Helvetica typeface.
-
to_pdf
(*, out_filename: Path, image_filename: Path | None = None, show_bounding_boxes: bool = False, fontname: str = 'Helvetica', invisible_text: bool = False, interword_spaces: bool = False) → None¶ Creates a PDF file with an image superimposed on top of the text.
Text is positioned according to the bounding box of the lines in the hOCR file. The image need not be identical to the image used to create the hOCR file. It can have a lower resolution, different color mode, etc.
Parameters: - out_filename – Path of PDF to write.
- image_filename – Image to use for this file. If omitted, the OCR text is shown.
- show_bounding_boxes – Show bounding boxes around various text regions, for debugging.
- fontname – Name of font to use.
- invisible_text – If True, text is rendered invisible so that is selectable but never drawn. If False, text is visible and may be seen if the image is skipped or deleted in Acrobat.
- interword_spaces – If True, insert spaces between words rather than drawing each word without spaces. Generally this improves text extraction.
-
classmethod
-
exception
ocrmypdf.hocrtransform.
HocrTransformError
¶ Error while applying hOCR transform.
ocrmypdf.pdfa¶
Utilities for PDF/A production and confirmation with Ghostspcript.
-
ocrmypdf.pdfa.
file_claims_pdfa
(filename: pathlib.Path)¶ Determines if the file claims to be PDF/A compliant.
This only checks if the XMP metadata contains a PDF/A marker. It does not do full PDF/A validation.
-
ocrmypdf.pdfa.
generate_pdfa_ps
(target_filename: pathlib.Path, icc: str = 'sRGB')¶ Create a Postscript PDFMARK file for Ghostscript PDF/A conversion.
pdfmark is an extension to the Postscript language that describes some PDF features like bookmarks and annotations. It was originally specified Adobe Distiller, for Postscript to PDF conversion.
Ghostscript uses pdfmark for PDF to PDF/A conversion as well. To use Ghostscript to create a PDF/A, we need to create a pdfmark file with the necessary metadata.
This function takes care of the many version-specific bugs and peculiarities in Ghostscript’s handling of pdfmark.
The only information we put in specifies that we want the file to be a PDF/A, and we want to Ghostscript to convert objects to the sRGB colorspace if it runs into any object that it decides must be converted.
Parameters: - target_filename – filename to save
- icc – ICC identifier such as ‘sRGB’
References
Adobe PDFMARK Reference: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdfmark_reference.pdf
ocrmypdf.quality¶
Utilities to measure OCR quality.
-
class
ocrmypdf.quality.
OcrQualityDictionary
(*, wordlist: Iterable[str])¶ Manages a dictionary for simple OCR quality checks.
-
measure_words_matched
(ocr_text: str) → float¶ Check how many unique words in the OCR text match a dictionary.
Words with mixed capitalized are only considered a match if the test word matches that capitalization.
Returns: number of words that match / number
-
ocrmypdf.subprocess¶
Wrappers to manage subprocess calls.
-
ocrmypdf.subprocess.
check_external_program
(*, program: str, package: str, version_checker: Callable[[], Version], need_version: str | Version, required_for: str | None = None, recommended: bool = False, version_parser: type[Version] = <class 'packaging.version.Version'>) → None¶ Check for required version of external program and raise exception if not.
Parameters: - program – The name of the program to test.
- package – The name of a software package that typically supplies this program. Usually the same as program.
- version_checker – A callable without arguments that retrieves the installed version of program.
- need_version – The minimum required version.
- required_for – The name of an argument of feature that requires this program.
- recommended – If this external program is recommended, instead of raising an exception, log a warning and allow execution to continue.
- version_parser – A class that should be used to parse and compare version numbers. Used when version numbers do not follow standard conventions.
-
ocrmypdf.subprocess.
get_version
(program: str, *, version_arg: str = '--version', regex='(\\d+(\\.\\d+)*)', env: OsEnviron | None = None) → str¶ Get the version of the specified program.
Parameters: - program – The program to version check.
- version_arg – The argument needed to ask for its version, e.g.
--version
. - regex – A regular expression to parse the program’s output and obtain the version.
- env – Custom
os.environ
in which to run program.
-
ocrmypdf.subprocess.
run
(args: Args, *, env: OsEnviron | None = None, logs_errors_to_stdout: bool = False, check: bool = False, **kwargs) → CompletedProcess¶ Wrapper around
subprocess.run()
.The main purpose of this wrapper is to log subprocess output in an orderly fashion that identifies the responsible subprocess. An additional task is that this function goes to greater lengths to find possible Windows locations of our dependencies when they are not on the system PATH.
Arguments should be identical to
subprocess.run
, except for following:Parameters: - args – Positional arguments to pass to
subprocess.run
. - env – A set of environment variables. If None, the OS environment is used.
- logs_errors_to_stdout – If True, indicates that the process writes its error messages to stdout rather than stderr, so stdout should be logged if there is an error. If False, stderr is logged. Could be used with stderr=STDOUT, stdout=PIPE for example.
- check – If True, raise an exception if the process exits with a non-zero status code. If False, the return value will indicate success or failure.
- kwargs – Additional arguments to pass to
subprocess.run
.
- args – Positional arguments to pass to
-
ocrmypdf.subprocess.
run_polling_stderr
(args: Args, *, callback: Callable[[str], None], check: bool = False, env: OsEnviron | None = None, **kwargs) → CompletedProcess¶ Run a process like
ocrmypdf.subprocess.run
, and poll stderr.Every line of produced by stderr will be forwarded to the callback function. The intended use is monitoring progress of subprocesses that output their own progress indicators. In addition, each line will be logged if debug logging is enabled.
Requires stderr to be opened in text mode for ease of handling errors. In addition the expected encoding= and errors= arguments should be set. Note that if stdout is already set up, it need not be binary.