💾 Archived View for gemini.bvnf.space › blog › 006_proofreading_the_iliad.gmi captured on 2024-03-21 at 15:03:21. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Proofreading the Iliad

On my website, you'll see that I recently added a "books" section, where I've formatted a couple of nice (public domain) texts into HTML. One of these is a translation of the Iliad by Andrew Lang, Walter Leaf, and Ernest Meyers from 1883, using the text from Project Gutenberg. I noticed a couple of sentences missing in this digitisation, but a scan on the Internet Archive of the physical book wasn't missing those sentences. I sent an email to the PG help and someone looked into it and realised that an abridged version of the translation was also published, so the digital copy must have been based off that. Together we agreed to work through the existing document and add in the missing bits.

the ebook in question

Having worked through some of it - I've started from book 13 - we both noticed that the existing digitisation isn't even the abridged version; it's in between the two, so presumably it was just badly proofread to begin with (eek).

Anyway, I've had a few thoughts about the best way to work through such a lot of text. The other volunteer working on this is more experienced, and their workflow is to read both copies at the same time, and add text when they encounter a difference between the two. This is nice because they also get to enjoy reading the epic! But it's a bit slow, and I found it difficult to keep track of both places as I read. So, I tried a word diff method:

Take the optical character recognition (OCR) scan of the original book and put every word on its own line;
Do the same with the Project Gutenberg copy, and also delete any html <tag>s with sed;
For both copies, delete blank lines;
Now make your way through the diff of the two files.

This is much faster. I wrote the diff into a file which I can browse through with my text editor, which means that I can delete bits I've already corrected, and also I get some syntax highlighting which is nice (when I get two a sentence of missing text there's about 20 lines of green). It's not perfect, though, mostly because of the inaccuracies in the OCR copy: words which don't have enough contrast against the page don't make it into the scan; some letters are jumbled, and at page breaks there are odd artifacts from the page numbers, but most of these are easy to spot because of the word diff.

I'm only doing the second half of the 24 books, and the word diff is a just under 55,000 lines (unified diff with 3 lines of context). This is quite big - there are around 170,000 words in the whole translation, and the abridged edition is about 60 pages (8%) shorter than the revised edition. Maybe that's about right - half of 170,000 is 85,000, and 8% of that, multiplied by 6 for context lines either side, is 41,000. Eh.

This method feels quicker, but I don't get to read the text in a flow. I haven't read this translation before, and there's no rush to get the corrections done since the ebook has been up since 2001. Maybe I would enjoy more the first method. I'm new to proofreading!

written 2022-01-07

home

blog home