💾 Archived View for dece.space › notes › 20210316-dictionaries.gmi captured on 2022-07-16 at 13:48:49. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-01-08)
-=-=-=-=-=-=-
(not the Python structure)
This is specific to France and completely out of the blue, but wtf is wrong with the availability of dictionaries? You're constantly ordered to have perfect spelling unless you desire to be mocked and lacking vocabulary is considered a mark of stupidity, but I just can't find public domain lexicographical data that is properly kept up to date by disinterested parties and I have been outraged about that for a few minutes.
You may wonder where this sudden obsession for bookends comes from. Lost on Wikipedia I found the term "Ă©pouse poitrinaire", where Ă©pouse means spouse and I did not know the definition of poitrinaire. Poitrine means chest but can also refer to woman breasts, so I started to wonder if Dostoevsky had some kind of special relation with a particularly voluptuous person, but I typed the word in Duckduckgo to dissipate the mystery, then clicked on the Wiktionary result; the actual meaning here was that the spouse was tubercular.
This simple search was quickly disrupted with confusion: why looking up a word makes me send a request to a search engine then deciphering URLs to find familiar websites? What are the ways to find definitions of a word?
Now what exactly "downloading a fine dictionaries" entails?
Totally oblivious to the fact that there are different dictionary types for different purposes, I start looking for some french.db on the Web. Quickly it becomes apparent that focusing on a specific type is required.
Most dictionaries are the property of private companies (Larousse, Robert, Hachette). I can get why it makes some sense from the pure logistical perspective of ensuring formatting, printing and distribution, and hiring people to keep up with the new entries in our vocabulary such as *checks notes* "lol" and "clavardage". I don't care what companies think about words and their definitions though, except maybe the Littré.
Le Littré is a dictionary written during the 19th century and it is now in the public domain, but good luck finding a decent data source for it. Wandering in the ghost town of Google Code, I stumbled into "dictionnaire-le-littre", a GPLv3 software which sounded promising but alas the actual dictionary data is stored as completely opaque binary blobs. In the acknowledgments though it mentions "XMLittré" and a website. Of course the website is dead. Fortunately WBM saves the day and allows us to download a StarDict dictionary, which is another FOSS but still maintained, which leads me to… Debian packages, where a stardict-xmlittre package created in 2006 is still available for Debian Buster; I'll come back to that later.
There are dictionaries built by various public institutions as well, funded by the public: le dictionnaire de l'Académie Française and le Trésor de la Langue Française.
I could not care less about what the Académie Française has to say about French. If you do not know who they are, it's a bunch of extremely expensive turbo-boomers cosplaying as nobles and having naps in sumptuous historical buildings. They wander around with swords (but not the women, with a few exceptions) and call themselves "The Immortals". They manage to embody absolutely all the wrong approaches to language possible. They also explicitly forbid scraping in their user agreements. Idiots.
If you want to judge how they look…
The other one is more interesting: le Trésor de la Langue Française, literally the Treasure of French Language. It is a massive dictionary created roughly 50 years ago, and does not really try to pinpoint definitions but rather provide extensive references on the different usages of words, which is a way to see language I like more. In the early 90's, research groups spent some time typing its content into computers and naming it TLFi, and put horrible Web forms around it to make it public — and for a very brief time even delivered it on expensive CDs as well.
2004 paper "Le TLFi ou Trésor de la Langue Française informatisé"
Both websites make you wonder if your Web browser suddenly routed your traffic through the Wayback Machine. The CNRTL website has been left untouched for almost 10 years now, and the ATILF website dumps Windows paths if you start tinkering with its URLs; it feels like a miracle that both are still accessible. They both deliver the exact same content though, so there probably is a common database behind them. This entire project is the work of publicly funded research groups, so I expected to be able to get some kind of access to the data, and sent a mail to politely request it, with any format they see fit.
A few hours later I received a cordial response kindly asking me to get lost as the data is not public. I refrain from asking how dare he. That was too late anyway: inspired by the perspective of pirating a literal treasure I wrote a scraper and started to dump all lexical forms and their definitions, getting temporarily IP-banned in the process, thinking I would deliver them on Gemini later. As it is almost 100% illegal and I'm not sure I want to bring this shit here so soon, I kinda cooled down but let the little Pythons amass vast amounts of <span>s for the sake of hoarding^W^W^W^Warchiving public goods.
At some point I ended up on Debian Packages and saw the "similar packages" list on the side. It turns out that Debian has several French dictionaries that can be used for other things than constructing bruteforce wordlists. Here are a few, I'm probably missing some:
One of them has a peculiar name: "le-dico-de-rene-cougnenc", literally "the-dict-of-rene-cougnenc". Who names packages like that? The package information tells us that it contains a lot of french words (no definitions though), but also that it is a collective exercise:
This list has been carefully elaborated by a team of French BBS users and put in the public domain in accented ASCII format either using the IBM MS/DOS charset or the ISO-8859-1 charset for other systems.
Turns out René Cougnenc was a nerd from the very early days of the Web, with a taste for engaging in the various Usenet battles of yore. Sadly, or fortunately, he died in 1996 so could not see what the Web would become. I stumbled on a page to his memory, introduced with a quote that will make all French Geminauts smile:
Linux, Usenet et quelques autres lui doivent beaucoup. René avait traduit les livres sur Linux de Welch et Kirch et gérait le BBS renux longtemps avant qu'Internet devienne un sujet à la mode. Il a toujours fermement défendu Usenet lors des incursions des cyber-blaireaux et il le faisait avec humour et vigueur (c'est un euphémisme). — Stéphane Bortzmeyer
À la mémoire de René Cougnenc
Tiens, salut Stéphane !
Translated by me:
Linux, Usenet and some others owe him much. René had translated Welch and Kirch books on Linux and managed the renux BBS a long time before Internet went mainstream. He always firmly defended Usenet against cyber-blockheads and he did it with humor and vigor (it's an euphemism)
This page is a fantastic outlook on what the early people of the Web were thinking about the possible future evolutions of their platform: something cool or an utter disgrace. He probably would have been very interested to see Gemini come up roughly 25 years later, because we sure as hell know the answer. I will try to find the time to translate the whole page later because it has value for people around here.
What the hell have I been doing in the last 24 hours? Getting back to my senses, I download Stardict and the XMLittré package from APT. It works great. Maybe Wiktionary is the way to go for online research?
Anyway, thanks for reading ♥
I translated the René Cougnenc memorial page for Geminauts!
Laërte and Mjollna gave me a few additional interesting ressources. Other dictionaries:
Grand dictionnaire universel du XIXe siècle
Nenufar, Le Petit Larousse Illustré
Dictionnaire Ă©lectronique des mots
The first link is another old public domain dictionary on Wikisource, but is still incomplete.
The second is nice and talks about free licences; sadly they say that XML sources will be provided "soon" on Ortolang, which in academic slang means in ten thousand years. Should send them a smol mail…
Ortolang is a messy dump of ressources. Hope some thick CC0 XMLs can get dropped there one day!
The DEM is a nice dictionary available in different formats. It is great work, but mind that definitions are very laconic.
Yes this is a thing now, just… hit me up. 🤫