2007-10-17 PDF Manipulation

With my interest in RPG PDF publishing, I now own a significant number of PDF files. Usually these PDF files have background images, fancy borders, tons of art, and some are scans where the visible text is the scanned text. A bummer to print! Some people ask for PDF files that are “unlocked” (stupid DRM restrictions) with various layers that can be enabled and disabled. What I’d love is a command line command that takes a PDF file, splits it into pages, pipes these pages through some converters, and reassembles a PDF file. Even if I don’t print a PDF, I find it hard to read with background graphics and/or text that is not black. Apple’s Preview.app is too dumb to offer the appropriate switches while I try to read the PDF files. And Acrobat Reader has so many damn options I can’t even figure out whether it can do it or not. 👎

So I’m looking at pdftk, mbtPdfAsm, the power of a real shell, and I wonder what to do. The greatest thing to do would be to use inkscape to create SVG files based on these scans, and use those! Then the scanned text would be zoomable without decaying into pixels.

pdftk

mbtPdfAsm

For now, however, I will try to split the PDF file into single pages and use ImageMagick to increase the contrast.

ImageMagick

I hate it when a Mac comes without `gs(1)`. Am I embarking on another endless compile, download, configure cycle?

Unfortunately it seems as if compiling and installing Ghostscript worked without a problem but I get tons of errors when I try to use `convert pg_0010.pdf new_pg_0010.pdf`. Damn!

Ghostscript

   **** Warning:  File has an invalid xref entry:  2.  Rebuilding xref table.
Error: /invalidfont in /findfont
Operand stack:
   --dict:9/18(L)--   F10   1   --dict:8/8(L)--   --dict:8/8(L)--   TimesNewRomanPSMT   --dict:13/13(L)--   Times-Roman   Times-Roman
Execution stack: […]
Dictionary stack: […]
Current allocation mode is local
Last OS error: 2
GPL Ghostscript 8.60: Unrecoverable error, exit code 1
convert: Postscript delegate failed `pg_0010.pdf'.
convert: missing an image filename `new_pg_0010.pdf'.

I wonder what this means. I don’t have Times-Roman on my Mac? I guess it’s true because all I have is Times *New* Roman according to my Font Book.app... Hm.

The strange thing is that the source PDF doesn’t even specify Times-Roman!

8 matches for "Times" in buffer: pg_0010.pdf
    631:<</FontName /TimesNewRomanPSMT
    647:<</BaseFont /TimesNewRomanPSMT
    712:<</FontName /TimesNewRomanPS-ItalicMT
    728:<</BaseFont /TimesNewRomanPS-ItalicMT
    739:<</FontName /TimesNewRomanPS-BoldMT
    755:<</BaseFont /TimesNewRomanPS-BoldMT
    766:<</FontName /TimesNewRomanPS-BoldItalicMT
    782:<</BaseFont /TimesNewRomanPS-BoldItalicMT

So, I’m assuming a library somewhere that refers to the font...

Pyrobombus:/opt/local/lib/ImageMagick-6.2.7/config alex$ grep Times *
type-ghostscript.xml:    name="Times-Roman"
[…]

Aahhh... Maybe if I change the fonts in this file? Unfortunately, commenting out the suspicious element did nothing to improve the situation. Argh.

Hm. `pdf2ps(1)` is having the exact same problem. And this tool was installed with Ghostscript just now. Thus, this must be independent from my existing ImageMagick installation. Hm.

ImageMagick

Grrrr. One hour wasted. And on my iBook, I get a warning but it works. Argh!

*iBook**: AFPL Ghostscript 8.54 (2006-05-17)

*Mac Mini**: GPL Ghostscript 8.60

Ok, so I used pdftk to split the PDF into single pages, and I’m using `convert(1)` as follows:

convert -sigmoidal-contrast 10,70% pg_0010.pdf pg_0010n.pdf

This does in fact increase the contrast, but it turns out the fuzzy light gray the scanner adds to the letters is helping me read the text. The image with high contrast is in fact more difficult to read. Damn. I need some more experimenting.

convert -black-threshold 70% pg_0010.pdf pg_0010n.pdf

This gives even worse results.

I’m currently lookingat `-contrast 6,80%` and the resulting PDF is not too bad. The only thing that has seriously deteriorated is readability. Now that I look at the pixels, the problem seems to be JPEG artifacts or something similar. As if the image had been decoded and recoded using terrible quality settings. And `-quality 100` seem to have no effect. Perhaps the artifacts are present in the source and exacerbate the problem?

I tried using `-gaussian-blur 1x1` but the result was terrible. `-despeckle` looks much better, but the resulting PDF still looks as if the image resolution deteriorated by a factor of two.

convert -sigmoidal-contrast 10,80% -resample 200 -despeckle pg_0010.pdf pg_0010n.pdf

Using a `-resample 200` seems to improve the situation. It’s actually pretty close to what I *thought* would improve readability. But it turns out that *it doesn’t*.

I’m stumped.

Out of ideas.

Update*: It seems that I just need to install the Ghostscript fonts. And they are not distributed with the source package. In fact, there’s a link in the docs to a FTP server I cannot connect to. But Google leads me to a different SourceForge project: Ghostscript fonts.

SourceForge

Ghostscript fonts

#Software

Comments

(Please contact me if you want to remove your comment.)

⁂

Maybe something like pdftoipe could help. I haven’t used pdftoipe or ipe, but pdftoipe converts a PDF file to a XML file readable by Ipe.

Ipe is a drawing editor.

Anyways, transforming a pdf to xml may be a first good step.

Another program to manipulate pdf’s is pdfedit.

– deusmax 2007-11-09 22:15 UTC

---

Hm, that sounds interesting... Do you have an URL for it?

– Alex Schroeder 2007-11-09 22:58 UTC

Alex Schroeder