% SPDX-FileCopyrightText: 2022 James R. Barlow
% SPDX-License-Identifier: CC-BY-SA-4.0

# v3

## v3.2.1

Changes

- Fixed {issue}`47`
  "convert() got and unexpected keyword argument 'dpi'" by upgrading to
  img2pdf 0.2
- Tweaked the Dockerfiles

## v3.2

New features

- Lossless reconstruction: when possible, OCRmyPDF will inject text
  layers without otherwise manipulating the content and layout of a PDF
  page. For example, a PDF containing a mix of vector and raster
  content would see the vector content preserved. Images may still be
  transcoded during PDF/A conversion. (`--deskew` and
  `--clean-final` disable this mode, necessarily.)
- New argument `--tesseract-pagesegmode` allows you to pass page
  segmentation arguments to Tesseract OCR. This helps for two column
  text and other situations that confuse Tesseract.
- Added a new "polyglot" version of the Docker image, that generates
  Tesseract with all languages packs installed, for the polyglots among
  us. It is much larger.

Changes

- JPEG transcoding quality is now 95 instead of the default 75. Bigger
  file sizes for less degradation.

## v3.1.1

Changes

- Fixed bug that caused incorrect page size and DPI calculations on
  documents with mixed page sizes

## v3.1

Changes

- Default output format is now PDF/A-2b instead of PDF/A-1b
- Python 3.5 and macOS El Capitan are now supported platforms - no
  changes were needed to implement support
- Improved some error messages related to missing input files
- Fixed {issue}`20`: uppercase .PDF extension not accepted
- Fixed an issue where OCRmyPDF failed to text that certain pages
  contained previously OCR'ed text, such as OCR text produced by
  Tesseract 3.04
- Inserts /Creator tag into PDFs so that errors can be traced back to
  this project
- Added new option `--pdf-renderer=auto`, to let OCRmyPDF pick the
  best PDF renderer. Currently it always chooses the 'hocrtransform'
  renderer but that behavior may change.
- Set up Travis CI automatic integration testing

## v3.0

New features

- Easier installation with a Docker container or Python's `pip`
  package manager
- Eliminated many external dependencies, so it's easier to setup
- Now installs `ocrmypdf` to `/usr/local/bin` or equivalent for
  system-wide access and easier typing
- Improved command line syntax and usage help (`--help`)
- Tesseract 3.03+ PDF page rendering can be used instead for better
  positioning of recognized text (`--pdf-renderer tesseract`)
- PDF metadata (title, author, keywords) are now transferred to the
  output PDF
- PDF metadata can also be set from the command line (`--title`,
  etc.)
- Automatic repairs malformed input PDFs if possible
- Added test cases to confirm everything is working
- Added option to skip extremely large pages that take too long to OCR
  and are often not OCRable (e.g. large scanned maps or diagrams);
  other pages are still processed (`--skip-big`)
- Added option to kill Tesseract OCR process if it seems to be taking
  too long on a page, while still processing other pages
  (`--tesseract-timeout`)
- Less common colorspaces (CMYK, palette) are now supported by
  conversion to RGB
- Multiple images on the same PDF page are now supported

Changes

- New, robust rewrite in Python 3.4+ with
  [ruffus](http://www.ruffus.org.uk/index.html) pipelines

- Now uses Ghostscript 9.14's improved color conversion model to
  preserve PDF colors

- OCR text is now rendered in the PDF as invisible text. Previous
  versions of OCRmyPDF incorrectly rendered visible text with an image
  on top.

- All "tasks" in the pipeline can be executed in parallel on any
  available CPUs, increasing performance

- The `-o DPI` argument has been phased out, in favor of
  `--oversample DPI`, in case we need `-o OUTPUTFILE` in the future

- Removed several dependencies, so it's easier to install. We no longer
  use:

  - GNU [parallel](https://www.gnu.org/software/parallel/)
  - [ImageMagick](http://www.imagemagick.org/script/index.php)
  - Python 2.7
  - Poppler
  - [MuPDF](http://mupdf.com/docs/) tools
  - shell scripts
  - Java and [JHOVE](http://jhove.sourceforge.net/)
  - libxml2

- Some new external dependencies are required or optional, compared to
  v2.x:

  - Ghostscript 9.14+
  - [qpdf](http://qpdf.sourceforge.net/) 5.0.0+
  - [Unpaper](https://github.com/Flameeyes/unpaper) 6.1 (optional)
  - some automatically managed Python packages

Release candidates^

- rc9:

  - Fix
    {issue}`118`:
    report error if ghostscript iccprofiles are missing
  - fixed another issue related to
    {issue}`111`: PDF
    rasterized to palette file
  - add support image files with a palette
  - don't try to validate PDF file after an exception occurs

- rc8:

  - Fix
    {issue}`111`:
    exception thrown if PDF is missing DocumentInfo dictionary

- rc7:

  - fix error when installing direct from pip, "no such file
    'requirements.txt'"

- rc6:

  - dropped libxml2 (Python lxml) since Python 3's internal XML parser
    is sufficient
  - set up Docker container
  - fix Unicode errors if recognized text contains Unicode characters
    and system locale is not UTF-8

- rc5:

  - dropped Java and JHOVE in favour of qpdf
  - improved command line error output
  - additional tests and bug fixes
  - tested on Ubuntu 14.04 LTS

- rc4:

  - dropped MuPDF in favour of qpdf
  - fixed some installer issues and errors in installation
    instructions
  - improve performance: run Ghostscript with multithreaded rendering
  - improve performance: use multiple cores by default
  - bug fix: checking for wrong exception on process timeout

- rc3: skipping version number intentionally to avoid confusion with
  Tesseract

- rc2: first release for public testing to test-PyPI, Github

- rc1: testing release process

## Compatibility notes

- `./OCRmyPDF.sh` script is still available for now
- Stacking the verbosity option like `-vvv` is no longer supported
- The configuration file `config.sh` has been removed. Instead, you
  can feed a file to the arguments for common settings:

```
ocrmypdf input.pdf output.pdf @settings.txt
```

where `settings.txt` contains *one argument per line*, for example:

```
-l
deu
--author
A. Merkel
--pdf-renderer
tesseract
```

Fixes

- Handling of filenames containing spaces: fixed

Notes and known issues

- Some dependencies may work with lower versions than tested, so try
  overriding dependencies if they are "in the way" to see if they work.
- `--pdf-renderer tesseract` will output files with an incorrect page
  size in Tesseract 3.03, due to a bug in Tesseract.
- PDF files containing "inline images" are not supported and won't be
  for the 3.0 release. Scanned images almost never contain inline
  images.