Batch processing

This article provides information about running OCRmyPDF on multiple files or configuring it as a service triggered by file system events.

Batch jobs

Consider using the excellent GNU Parallel to apply OCRmyPDF to multiple files at once.

Both parallel and ocrmypdf will try to use all available processors. To maximize parallelism without overloading your system with processes, consider using parallel -j 2 to limit parallel to running two jobs at once.

This command will run all ocrmypdf all files named *.pdf in the current directory and write them to the previous created output/ folder. It will not search subdirectories.

The --tag argument tells parallel to print the filename as a prefix whenever a message is printed, so that one can trace any errors to the file that produced them.

parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

OCRmyPDF automatically repairs PDFs before parsing and gathering information from them.

Directory trees

This will walk through a directory tree and run OCR on all files in place, printing the output in a way that makes

find . -printf '%p' -name '*.pdf' -exec ocrmypdf '{}' '{}' \;

Alternatively, with a docker container (mounts a volume to the container where the PDFs are stored):

find . -printf '%p' -name '*.pdf' -exec docker run --rm -v <host dir>:<container dir> jbarlow83/ocrmypdf '<container dir>/{}' '<container dir>/{}' \;

This only runs one ocrmypdf process at a time. This variation uses find to create a directory list and parallel to parallelize runs of ocrmypdf, again updating files in place.

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf '{}' '{}'

In a Windows batch file, use

for /r %%f in (*.pdf) do ocrmypdf %%f %%f

Sample script

This user contributed script also provides an example of batch processing.

misc/batch.py
#!/usr/bin/env python3
# Copyright 2016 findingorder: https://github.com/findingorder
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# This script must be edited to meet your needs.

import logging
import os
import sys

import ocrmypdf

# pylint: disable=logging-format-interpolation
# pylint: disable=logging-not-lazy

script_dir = os.path.dirname(os.path.realpath(__file__))
print(script_dir + '/batch.py: Start')

if len(sys.argv) > 1:
    start_dir = sys.argv[1]
else:
    start_dir = '.'

if len(sys.argv) > 2:
    log_file = sys.argv[2]
else:
    log_file = script_dir + '/ocr-tree.log'

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(message)s',
    filename=log_file,
    filemode='w',
)

ocrmypdf.configure_logging(ocrmypdf.Verbosity.default)

for dir_name, subdirs, file_list in os.walk(start_dir):
    logging.info(dir_name + '\n')
    os.chdir(dir_name)
    for filename in file_list:
        file_ext = os.path.splitext(filename)[1]
        if file_ext == '.pdf':
            full_path = dir_name + '/' + filename
            print(full_path)
            result = ocrmypdf.ocr(filename, filename, deskew=True)
            if result == ocrmypdf.ExitCode.already_done_ocr:
                print("Skipped document because it already contained text")
            elif result == ocrmypdf.ExitCode.ok:
                print("OCR complete")
            logging.info(result)

Synology DiskStations

Synology DiskStations (Network Attached Storage devices) can run the Docker image of OCRmyPDF if the Synology Docker package is installed. Attached is a script to address particular quirks of using OCRmyPDF on one of these devices.

This is only possible for x86-based Synology products. Some Synology products use ARM or Power processors and do not support Docker. Further adjustments might be needed to deal with the Synology’s relatively limited CPU and RAM.

misc/synology.py - Sample script for Synology DiskStations
#!/bin/env python3
# Copyright 2017 github.com/Enantiomerie
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# This script must be edited to meet your needs.

import logging
import os
import shutil
import subprocess
import sys
import time

# pylint: disable=logging-format-interpolation
# pylint: disable=logging-not-lazy

script_dir = os.path.dirname(os.path.realpath(__file__))
timestamp = time.strftime("%Y-%m-%d-%H%M_")
log_file = script_dir + '/' + timestamp + 'ocrmypdf.log'
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(message)s',
    filename=log_file,
    filemode='w',
)

if len(sys.argv) > 1:
    start_dir = sys.argv[1]
else:
    start_dir = '.'

for dir_name, subdirs, file_list in os.walk(start_dir):
    logging.info(dir_name)
    os.chdir(dir_name)
    for filename in file_list:
        file_stem, file_ext = os.path.splitext(filename)
        if file_ext != '.pdf':
            continue
        full_path = os.path.join(dir_name, filename)
        timestamp_ocr = time.strftime("%Y-%m-%d-%H%M_OCR_")
        filename_ocr = timestamp_ocr + file_stem + '.pdf'
        # create string for pdf processing
        # the script is processed as root user via chron
        cmd = [
            'docker',
            'run',
            '--rm',
            '-i',
            'jbarlow83/ocrmypdf',
            '--deskew',
            '-',
            '-',
        ]
        logging.info(cmd)
        full_path_ocr = os.path.join(dir_name, filename_ocr)
        with open(filename, 'rb') as input_file, open(
            full_path_ocr, 'wb'
        ) as output_file:
            proc = subprocess.run(
                cmd,
                stdin=input_file,
                stdout=output_file,
                stderr=subprocess.PIPE,
                check=False,
            )
        logging.info(proc.stderr.read())
        os.chmod(full_path_ocr, 0o664)
        os.chmod(full_path, 0o664)
        full_path_ocr_archive = sys.argv[2]
        full_path_archive = sys.argv[2] + '/no_ocr'
        shutil.move(full_path_ocr, full_path_ocr_archive)
        shutil.move(full_path, full_path_archive)
logging.info('Finished.\n')

Huge batch jobs

If you have thousands of files to work with, contact the author. Consulting work related to OCRmyPDF helps fund this open source project and all inquiries are appreciated.

Hot (watched) folders

Watched folders with watcher.py

OCRmyPDF has a folder watcher called watcher.py, which is currently included in source distributions but not part of the main program. It may be used natively or may run in a Docker container. Native instances tend to give better performance. watcher.py works on all platforms.

Users may need to customize the script to meet their requirements.

pip3 install -r requirements/watcher.txt

env OCR_INPUT_DIRECTORY=/mnt/input-pdfs \
    OCR_OUTPUT_DIRECTORY=/mnt/output-pdfs \
    OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
    python3 watcher.py
watcher.py environment variables
Environment variable Description
OCR_INPUT_DIRECTORY Set input directory to monitor (recursive)
OCR_OUTPUT_DIRECTORY Set output directory (should not be under input)
OCR_ON_SUCCESS_DELETE This will delete the input file if the exit code is 0 (OK)
OCR_OUTPUT_DIRECTORY_YEAR_MONTH This will place files in the output in {output}/{year}/{month}/{filename}
OCR_DESKEW Apply deskew to crooked input PDFs
OCR_JSON_SETTINGS A JSON string specifying any other arguments for ocrmypdf.ocr, e.g. 'OCR_JSON_SETTINGS={"rotate_pages": true}'.
OCR_POLL_NEW_FILE_SECONDS Polling interval
OCR_LOGLEVEL Level of log messages to report

One could configure a networked scanner or scanning computer to drop files in the watched folder.

Watched folders with Docker

The watcher service is included in the OCRmyPDF Docker image. To run it:

docker run \
    -v <path to files to convert>:/input \
    -v <path to store results>:/output \
    -e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
    -e OCR_ON_SUCCESS_DELETE=1 \
    -e OCR_DESKEW=1 \
    -e PYTHONUNBUFFERED=1 \
    -it --entrypoint python3 \
    jbarlow83/ocrmypdf \
    watcher.py

This service will watch for a file that matches /input/\*.pdf and will convert it to a OCRed PDF in /output/. The parameters to this image are:

watcher.py parameters for Docker
Parameter Description
-v <path to files to convert>:/input Files placed in this location will be OCRed
-v <path to store results>:/output This is where OCRed files will be stored
-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 Define environment variable OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1
-e OCR_ON_SUCCESS_DELETE=1 Define environment variable
-e OCR_DESKEW=1 Define environment variable
-e PYTHONBUFFERED=1 This will force STDOUT to be unbuffered and allow you to see messages in docker logs

This service relies on polling to check for changes to the filesystem. It may not be suitable for some environments, such as filesystems shared on a slow network.

A configuration manager such as Docker Compose could be used to ensure that the service is always available.

misc/docker-compose.example.yml
---
version: "3.3"
services:
  ocrmypdf:
    restart: always
    container_name: ocrmypdf
    image: jbarlow83/ocrmypdf
    volumes:
      - "/media/scan:/input"
      - "/mnt/scan:/output"
    environment:
      - OCR_OUTPUT_DIRECTORY_YEAR_MONT=0
    user: "<SET TO YOUR USER ID>:<SET TO YOUR GROUP ID>"
    entrypoint: python3
    command: watcher.py

Caveats

  • watchmedo may not work properly on a networked file system, depending on the capabilities of the file system client and server.
  • This simple recipe does not filter for the type of file system event, so file copies, deletes and moves, and directory operations, will all be sent to ocrmypdf, producing errors in several cases. Disable your watched folder if you are doing anything other than copying files to it.
  • If the source and destination directory are the same, watchmedo may create an infinite loop.
  • On BSD, FreeBSD and older versions of macOS, you may need to increase the number of file descriptors to monitor more files, using ulimit -n 1024 to watch a folder of up to 1024 files.

Alternatives

  • On Linux, systemd user services can be configured to automatically perform OCR on a collection of files.
  • Watchman is a more powerful alternative to watchmedo.

AWS Lambda is not viable

AWS Lambda and its equivalents have low limits on execution time and payload size, relative to OCRmyPDF’s needs. As of this writing, the request/response payload for AWS Lambda was 6 MB, which means many PDFs will not fit.

macOS Automator

You can use the Automator app with macOS, to create a Workflow or Quick Action. Use a Run Shell Script action in your workflow. In the context of Automator, the PATH may be set differently your Terminal’s PATH; you may need to explicitly set the PATH to include ocrmypdf. The following example may serve as a starting point:

Example macOS Automator workflow

You may customize the command sent to ocrmypdf.