My wife is telling me about politics because we have upcoming elections here in Switzerland. The anti-vaxxers are being used by the far right to sabotage the media support law
In German, we smash words together, like Dorf (village) Jugend (youth) = Dorfjugend (the young people in the village). Software then adds the fj ligature:
I had never thought about this before, but in this context the ligature between f and j is a bit weird. It’s not Dor + fjugend. This ligature only makes sense for words like Fjord (the village fjord: Dorffjord), in my mind. I wonder where this comes from: is this a font issue? Should my authoring environment allow me to pick this? Should the environment use a dictionary? 🤔
As it turns out, after a bit of back and forth on Mastodon, this is correct.
Ligatures crossing the morpheme boundary of a composite word are sometimes considered incorrect, especially in official German orthography – Ligature, on Wikipedia
Wo zwei Bestandteile eines zusammengesetzten Wortes aufeinandertreffen … dürfen [diese] im Deutschen nicht durch Ligaturen verbunden werden. Ähnliches gilt für Beugungsendungen wie -lich, -lisch, -los, -lein, -te, -ten etc., die immer ohne Ligatur angehängt werden. In allen anderen Sprachen gibt es keine solchen Restriktionen. – Ligaturen
The solution seems to be to manually (!) insert “U+200C ZERO WIDTH NON-JOINER”. I’m not sure how this interacts with automatic hyphenation. I hope the hyphenation doesn’t get confused by the non-joiner.
OK. I’m sure this can be scripted.
(Edit: see keine-ligaturen for my current solution! The rest of this page is only for entertainment… and Perl.)
I thought, wouldn’t it be nice to write a Cosmopolitan C program to do this?
Cosmopolitan makes C a build-once run-anywhere language, similar to Java, except it doesn’t require interpreters or virtual machines be installed beforehand. Cosmo provides the same portability benefits as high-level languages like Go and Rust, but it doesn’t invent a new language and you won’t need to configure a CI system to build separate binaries for each operating system. – Cosmopolitan
OK, there are some interesting limitations to write such programs, but sadly my C is near zero so that doesn’t help. I gave up and wrote some Perl code instead.
This would insert the zero width non-joiner (ZWNJ) between “Dorf” and anything that follows if it starts with f, l, i, j.
#!/usr/bin/env perl use Modern::Perl; use open ':std', ':encoding(UTF-8)'; use utf8; my $zwnj = "\x{200c}"; my @re = map { chomp; $_ } <DATA>; for my $line (<STDIN>) { for my $re (@re) { $line =~ s/\b$re/$1$zwnj$2/gi; } print $line; } __DATA__ (dorf)([flij])
It works if I try it like this:
echo Dorfjugend | ./keine-ligaturen > test.txt
OK, so now I need to create the data… Another script to process a dictionary!
This is the output I want:
./keine-ligaturen-liste < /usr/share/dict/swiss (Ablauf)(folge) (Ablauf)(folgen) (Ablauf)(fähigkeit) (Ablauf)(fähigkeiten) (Ablauf)(leitung) (Ablauf)(leitungen) …
Well, I guess other ways of doing it would be possible, too… But let’s start with this as our requirements.
#!/usr/bin/env perl use Modern::Perl; use open ':std', ':encoding(UTF-8)'; use utf8; # takes a wordlist on STDIN, e.g. /usr/share/dict/swiss my @words; my @ends_in_f; my $i = 0; for (<STDIN>) { chomp; push(@words, $_); push(@ends_in_f, $_) if /f$/; } for my $first_word (@ends_in_f) { for (@words) { say "($first_word)($1)" if /^$first_word([flij]..+)/; } }
Now I can use the second script to append the necessary data to the first script. It’s not fast, that’s for sure…
./keine-ligaturen-liste < /usr/share/dict/swiss >> keine-ligaturen
Can confirm that the ventilators on this laptop are working!! 🌬️ 💨 😬
Of course it turns out that the wordlist does not contain “Dorfjugend”. I am disappointed.
Also, I’m wondering about these rules… Perhaps replacing the second word with `[flij]..` would be enough?
Much faster, too!
#!/usr/bin/env perl use Modern::Perl; use open ':std', ':encoding(UTF-8)'; use utf8; # takes a sorted wordlist on STDIN, e.g. /usr/share/dict/swiss my %seen; my $last = "\0"; for (<STDIN>) { chomp; $last = $_ if /f$/; $seen{$last} = 1 if /^$last[flij]../; } say "($_)([flij]..)" for sort keys %seen;
I’m still disappointed in the word list, though. No Dorffest?
#Perl #German
(Please contact me if you want to remove your comment.)
⁂
Alternatives: selnonig German patterns, need conversion… rmligs doesn’t seem to know about auftauchen, auftreten, auftragen, and … Dorfjugend.
– Alex 2022-01-30 21:36 UTC
---
I’m using selnonig, now!
– Alex 2022-01-31 23:05 UTC