Converting a scanned document into a compressed, searchable PDF with redactions

Created: 2022-08-27T06:52:40-05:00

Return to the Index

This card pertains to a resource available on the internet.

$ infile=scan.pdf
$ tmpfile=$(mktemp)
$ outfile=searchable-scan.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$tmpfile" "$infile"
$ ocrmypdf -l eng --deskew "$tmpfile" "$outfile"
$ rm $tmpfile

Order of compression matters. Article author found running optimization with gs prior to OCRmyPDF shaved the file from 1.5mb to 1mb. Running only OCRmyPDF took the scanner's raw output from 7.9mb to 2.7mb.

jbig2enc is an aggressive compressor for purely black and white images

OCRmyPDF