2022-01-30 German has no ligatures, sometimes

My wife is telling me about politics because we have upcoming elections here in Switzerland. The anti-vaxxers are being used by the far right to sabotage the media support law

In German, we smash words together, like Dorf (village) Jugend (youth) = Dorfjugend (the young people in the village). Software then adds the fj ligature:

Dorfjugend with fj ligature

I had never thought about this before, but in this context the ligature between f and j is a bit weird. It’s not Dor + fjugend. This ligature only makes sense for words like Fjord (the village fjord: Dorffjord), in my mind. I wonder where this comes from: is this a font issue? Should my authoring environment allow me to pick this? Should the environment use a dictionary? 🤔

As it turns out, after a bit of back and forth on Mastodon, this is correct.

Ligatures crossing the morpheme boundary of a composite word are sometimes considered incorrect, especially in official German orthography – Ligature, on Wikipedia
Wo zwei Bestandteile eines zusammengesetzten Wortes aufeinandertreffen … dürfen [diese] im Deutschen nicht durch Ligaturen verbunden werden. Ähnliches gilt für Beugungsendungen wie -lich, -lisch, -los, -lein, -te, -ten etc., die immer ohne Ligatur angehängt werden. In allen anderen Sprachen gibt es keine solchen Restriktionen. – Ligaturen

Ligature, on Wikipedia

Ligaturen

The solution seems to be to manually (!) insert “U+200C ZERO WIDTH NON-JOINER”. I’m not sure how this interacts with automatic hyphenation. I hope the hyphenation doesn’t get confused by the non-joiner.

OK. I’m sure this can be scripted.

(Edit: see keine-ligaturen for my current solution! The rest of this page is only for entertainment… and Perl.)

keine-ligaturen

I thought, wouldn’t it be nice to write a Cosmopolitan C program to do this?

Cosmopolitan makes C a build-once run-anywhere language, similar to Java, except it doesn’t require interpreters or virtual machines be installed beforehand. Cosmo provides the same portability benefits as high-level languages like Go and Rust, but it doesn’t invent a new language and you won’t need to configure a CI system to build separate binaries for each operating system. – Cosmopolitan

Cosmopolitan

OK, there are some interesting limitations to write such programs, but sadly my C is near zero so that doesn’t help. I gave up and wrote some Perl code instead.

This would insert the zero width non-joiner (ZWNJ) between “Dorf” and anything that follows if it starts with f, l, i, j.

#!/usr/bin/env perl
use Modern::Perl;
use open ':std', ':encoding(UTF-8)';
use utf8;

my $zwnj = "\x{200c}";

my @re = map { chomp; $_ } <DATA>;
for my $line (<STDIN>) {
  for my $re (@re) {
    $line =~ s/\b$re/$1$zwnj$2/gi;
  }
  print $line;
}

__DATA__
(dorf)([flij])

It works if I try it like this:

echo Dorfjugend | ./keine-ligaturen > test.txt

OK, so now I need to create the data… Another script to process a dictionary!

This is the output I want:

./keine-ligaturen-liste < /usr/share/dict/swiss
(Ablauf)(folge)
(Ablauf)(folgen)
(Ablauf)(fähigkeit)
(Ablauf)(fähigkeiten)
(Ablauf)(leitung)
(Ablauf)(leitungen)
…

Well, I guess other ways of doing it would be possible, too… But let’s start with this as our requirements.

#!/usr/bin/env perl
use Modern::Perl;
use open ':std', ':encoding(UTF-8)';
use utf8;

# takes a wordlist on STDIN, e.g. /usr/share/dict/swiss

my @words;
my @ends_in_f;
my $i = 0;
for (<STDIN>) {
  chomp;
  push(@words, $_);
  push(@ends_in_f, $_) if /f$/;
}

for my $first_word (@ends_in_f) {
  for (@words) {
    say "($first_word)($1)" if /^$first_word([flij]..+)/;
  }
}

Now I can use the second script to append the necessary data to the first script. It’s not fast, that’s for sure…

./keine-ligaturen-liste < /usr/share/dict/swiss >> keine-ligaturen

Can confirm that the ventilators on this laptop are working!! 🌬️ 💨 😬

Of course it turns out that the wordlist does not contain “Dorfjugend”. I am disappointed.

Also, I’m wondering about these rules… Perhaps replacing the second word with `[flij]..` would be enough?

Much faster, too!

#!/usr/bin/env perl
use Modern::Perl;
use open ':std', ':encoding(UTF-8)';
use utf8;

# takes a sorted wordlist on STDIN, e.g. /usr/share/dict/swiss

my %seen;
my $last = "\0";

for (<STDIN>) {
  chomp;
  $last = $_ if /f$/;
  $seen{$last} = 1 if /^$last[flij]../;
}

say "($_)([flij]..)" for sort keys %seen;

I’m still disappointed in the word list, though. No Dorffest?

​#Perl ​#German

Comments

(Please contact me if you want to remove your comment.)

Alternatives: selnonig German patterns, need conversion… rmligs doesn’t seem to know about auftauchen, auftreten, auftragen, and … Dorfjugend.

selnonig

rmligs

– Alex 2022-01-30 21:36 UTC

---

I’m using selnonig, now!

using selnonig

– Alex 2022-01-31 23:05 UTC