💾 Archived View for stack.tilde.cafe › gemlog › 2022-08-18.scraping.gmi captured on 2024-07-08 at 23:58:41. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-09-08)

-=-=-=-=-=-=-

Scraping for wood

A couple of days ago I went off on a tangent. I noticed that a bunch of SpellBinding words ending in -WOOD were missing from the dictionary. I put together a list of 87 words ending in WOOD, a list that expanded rapidly to 207 words.

gemini://gemini.ctrl-c.club/~stack/gemlog/2022-08-16.wood.gmi

The process was painful: there are no reverse dictionaries on the net! The ones that do exist are semantically reverse, not alphabetically reverse from the last letter to the first. It took me hours to collect all the WOOD words.

Throughout the fiasco, I kept thinking that, since Merriam-Webster is my dictionary of choice, it would be sensible to know what words are in it. And therefore, be able to find all the words ending in -WOOD, for insance.

It's been years since I've scraped anything, and it seemed daunting to scrape the entire Merriam Webster's dictionary. I've thought about it before, and looked at the API (which does not quite do it for me). But in reality there isn't that much data since I don't really care about the definitions -- just the words themselves... Scraping the index pages is enough.

And so I wrote some Lisp this morning, and sucked in about 1100 html pages containing 338729 words/phrases and the URLs to the definitions. I was trying to be respectful, so I spread it out over a few hours.

The data is really interesting, because related words (such as "run", "runs", "ran", "running") all point at the same url ("run" in this case). This provides quite a bit of metadata, allowing me, for instance, to eliminate all third-person -S words immediately!

In fact it took me a few minutes to filter the dictionary for SpellBinding-compatible words (removing third-person -S and plural -S, and all short words, and all words with more than 7 distinct letters, and all capitalized words, and all phrases), leaving around 90000 (double my present dictionary). However, this dictionary is not too clean, as it has many ridiculous Scottish words and British-only words, etc.

But it would be interesting for instance to run it against the SpellBinding dictionary and see what SpellBinding words are not in Merriam-Webster. Maybe I will find a few errors that way.

I also have the word index, which would allow me to build a reverse dictionary, or a soundex-based inquiry system for the words. I could build a simple Gemini dictionary gateway, forward or reverse, for instance.

Funny how some things I dread turn out to be really easy... Reminds me of equipping a compiler with polymorphic caches. I thought it would be really hard, but then I did it and it was much easier than so many thing I've done before...

And there it was. I pulled 191 -WOOD words that work with SpellBinding with a simple line of Lisp code, and feel like I am done with this for now.

index

home