Once again I’ve decided I needed to work on the memoirs of my grandfather Roland Li-Marchetti. I had digitized 16 pages many years ago but his memoirs contain a total of 45 pages of typed text. The last time I worked on this, I was using the OCR software that came with my scanner (a cheap Canon LiDE 25) – but today the scanner was no longer recognized by the operating system. I faintly remember having experienced this before when I upgraded my system. Bit rot!
Anyway, I was in the mood to try something new. Free Software?
1. Tesseract
2. requires Leptonica
3. and I needed to install GNU Libtool because I was getting an error: “Libtool library used but `LIBTOOL' is undefined. The usual way to define`LIBTOOL’ is to add `AC_PROG_LIBTOOL' to`configure.ac’ and run `aclocal' and`autoconf’ again. If `AC_PROG_LIBTOOL' is in`configure.ac’, make sure its definition is in aclocal’s search path.”
(While the stuff is compiling, I am in fact using a free online OCR service.)
Here is the original, taken with my Pentax K100D, loaded into Gimp, rotated, cropped, and auto-adjusted levels.
/pics/7703476934_8e4cde0f8b_z.jpg
The tesseract output is pretty cool:
que mes vingt prochaines années soient aussi riches d'aventures et de bonheur auprès des miens, main dans la main avec Agnès mon inséparable complice qui a beaucoup sacrifié et que j'espère pouvoir encore rendre heureuse.
(When I tried it on a direct photo of the page the result was far less pleasing.)
Yay!
for ((i=20; i<=46; i++)); do tesseract IMGP$((5210+$i)).JPG "page-$i" -l fra done
#OCR #Software
(Please contact me if you want to remove your comment.)
⁂
Nice, there are multiple ring binders of my grandfather’s memoirs as well. One day I should digitize them too.
BTW. you might want to have a look for other OCR solutions (I guess most of what I’ve written there would apply to Mac as well).
– Andreas Gohr 2012-08-03 13:09 UTC
---
Excellent! Thank you very much. I feel relieved that I seem to have picked the best free option. 😄
– Alex Schroeder 2012-08-03 14:09 UTC