2019-01-27 Digraphs and Name Generation

I have a name generator that works like the old Elite name generator: it generates a string of digraphs (usually syllables). Take a look at the help page.

name generator

help

But now I wonder: is there a list of these for English? An easy way to generate them based on some text? Split a text into consonants + vowels, leading vowels and ending consonants, and rank them by count? @gwmngilfen suggested starting with a e i o u but since my first language is German, I knew that this wasn’t going to cut it.

@gwmngilfen

Why is there no Unicode property to indicate vowels? For my own uses, I might go with all the vowels in Latin-1, I guess? @bkhl said “there are properties for vowels in particular writing systems, where that’s unambigous, such as some Indic ones, but not for the Latin family, where the same character can be a vowel, a consonant, or both, depending on the language.”

@bkhl

OK, so given the list of Unicode names, how would I do it? With Emacs Lisp:

(let (names vowels)
  (map-charset-chars
   (lambda (range arg)
     (let ((from (car range))
	   (to (cdr range)))
       (dotimes (c (- to from))
	 (setq names (cons (cons c (get-char-code-property (+ c from) 'name)) names)))))
   'unicode)
  (dolist (name names)
    (when (and (cdr name) (string-match "^latin small .*\\b[aeiouy]\\b" (cdr name)))
      (setq vowels (cons (car name) vowels))))
  (mapconcat 'char-to-string vowels ""))

Result: a e i o u y à á â ã ä å è é ê ë ì í î ï ò ó ô õ ö ø ù ú û ü ý ÿ ā ă ą ē ĕ ė ę ě ĩ ī ĭ į ı ō ŏ ő ũ ū ŭ ů ű ų ŷ ơ ư ƴ ǎ ǐ ǒ ǔ ǖ ǘ ǚ ǜ ǝ ǟ ǡ ǫ ǭ ǻ ǿ ȁ ȃ ȅ ȇ ȉ ȋ ȍ ȏ ȕ ȗ ȧ ȩ ȫ ȭ ȯ ȱ ȳ ɇ ɏ ɐ ɔ ɘ ɛ ɜ ɝ ɞ ɨ ɵ ʉ ʎ ʚ ᴈ ᴉ ᴑ ᴒ ᴓ ᴖ ᴗ ᴝ ᴞ ᵻ ᵾ ᶏ ᶒ ᶓ ᶔ ᶖ ᶗ ᶙ ḁ ḕ ḗ ḙ ḛ ḝ ḭ ḯ ṍ ṏ ṑ ṓ ṳ ṵ ṷ ṹ ṻ ẏ ẙ ẚ ạ ả ấ ầ ẩ ẫ ậ ắ ằ ẳ ẵ ặ ẹ ẻ ẽ ế ề ể ễ ệ ỉ ị ọ ỏ ố ồ ổ ỗ ộ ớ ờ ở ỡ ợ ụ ủ ứ ừ ử ữ ự ỳ ỵ ỷ ỹ ỿ ⱥ ⱸ ⱺ ꝋ ꝍ ꬱ ꬲ ꬳ ꬴ ꬽ ꬾ ꬿ ꭃ ꭄ ꭎ ꭏ ꭒ ꭚ ꭡ

(let (names consonants)
  (map-charset-chars
   (lambda (range arg)
     (let ((from (car range))
	   (to (cdr range)))
       (dotimes (c (- to from))
	 (setq names (cons (cons c (get-char-code-property (+ c from) 'name)) names)))))
   'unicode)
  (dolist (name names)
    (when (and (cdr name) (string-match "^latin small .*\\b[^aeiouy ]\\b" (cdr name)))
      (setq consonants (cons (car name) consonants))))
  (mapconcat 'char-to-string consonants ""))

Result: b c d f g h j k l m n p q r s t v w x z ß ç ñ ć ĉ ċ č ď đ ĝ ğ ġ ģ ĥ ħ ĵ ķ ĺ ļ ľ ŀ ł ń ņ ň ŉ ŕ ŗ ř ś ŝ ş š ţ ť ŧ ŵ ź ż ž ſ ƀ ƃ ƈ ƌ ƒ ƙ ƚ ƞ ƥ ƫ ƭ ƶ ǥ ǧ ǩ ǰ ǵ ǹ ȑ ȓ ș ț ȟ ȡ ȥ ȴ ȵ ȶ ȷ ȼ ȿ ɀ ɉ ɋ ɍ ɓ ɕ ɖ ɗ ɟ ɠ ɡ ɥ ɦ ɫ ɬ ɭ ɯ ɰ ɱ ɲ ɳ ɹ ɺ ɻ ɼ ɽ ɾ ɿ ʂ ʄ ʇ ʈ ʋ ʌ ʍ ʐ ʑ ʝ ʞ ʠ ʮ ʯ ᴟ ᵬ ᵭ ᵮ ᵯ ᵰ ᵱ ᵲ ᵳ ᵴ ᵵ ᵶ ᵷ ᵹ ᵽ ᶀ ᶁ ᶂ ᶃ ᶄ ᶅ ᶆ ᶇ ᶈ ᶉ ᶊ ᶌ ᶍ ᶎ ᶑ ḃ ḅ ḇ ḉ ḋ ḍ ḏ ḑ ḓ ḟ ḡ ḣ ḥ ḧ ḩ ḫ ḱ ḳ ḵ ḷ ḹ ḻ ḽ ḿ ṁ ṃ ṅ ṇ ṉ ṋ ṕ ṗ ṙ ṛ ṝ ṟ ṡ ṣ ṥ ṧ ṩ ṫ ṭ ṯ ṱ ṽ ṿ ẁ ẃ ẅ ẇ ẉ ẋ ẍ ẑ ẓ ẕ ẖ ẗ ẘ ẛ ẜ ẝ ỻ ỽ ↄ ⱡ ⱦ ⱨ ⱪ ⱬ ⱱ ⱳ ⱴ ⱶ ⱹ ꜿ ꝁ ꝃ ꝅ ꝇ ꝉ ꝑ ꝓ ꝕ ꝗ ꝙ ꝛ ꝟ ꝣ ꝺ ꝼ ꝿ ꞁ ꞃ ꞅ ꞇ ꞎ ꞑ ꞓ ꞔ ꞕ ꞗ ꞙ ꞡ ꞣ ꞥ ꞧ ꞩ ꬱ ꬵ ꬶ ꬷ ꬸ ꬹ ꬺ ꬻ ꬼ ꭃ ꭄ ꭅ ꭇ ꭈ ꭉ ꭊ ꭋ ꭌ ꭖ ꭗ ꭘ ꭙ ﬅ

And then some Perl:

#!/usr/bin/perl
# Copyright (C) 2009-2017  Alex Schroeder <alex@gnu.org>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU General Public License as published by the Free Software
# Foundation, either version 3 of the License, or (at your option) any later
# version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along with
# this program. If not, see <http://www.gnu.org/licenses/>.

use Modern::Perl;
use utf8;
use open qw/:std :utf8/;

# (let (names vowels)
#   (map-charset-chars
#    (lambda (range arg)
#      (let ((from (car range))
# 	   (to (cdr range)))
#        (dotimes (c (- to from))
# 	 (setq names (cons (cons c (get-char-code-property (+ c from) 'name)) names)))))
#    'unicode)
#   (dolist (name names)
#     (when (and (cdr name) (string-match "^latin small .*\\b[aeiouy]\\b" (cdr name)))
#       (setq vowels (cons (car name) vowels))))
#   (mapconcat 'char-to-string vowels ""))

my $vocals = "aeiouyàáâãäåèéêëìíîïòóôõöøùúûüýÿāăąēĕėęěĩīĭįıōŏőũūŭůűųŷơưƴǎǐǒǔǖǘǚǜǝǟǡǫǭǻǿȁȃȅȇȉȋȍȏȕȗȧȩȫȭȯȱȳɇɏɐɔɘɛɜɝɞɨɵʉʎʚᴈᴉᴑᴒᴓᴖᴗᴝᴞᵻᵾᶏᶒᶓᶔᶖᶗᶙḁḕḗḙḛḝḭḯṍṏṑṓṳṵṷṹṻẏẙẚạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỿⱥⱸⱺꝋꝍꬱꬲꬳꬴꬽꬾꬿꭃꭄꭎꭏꭒꭚꭡ";

# (let (names consonants)
#   (map-charset-chars
#    (lambda (range arg)
#      (let ((from (car range))
# 	   (to (cdr range)))
#        (dotimes (c (- to from))
# 	 (setq names (cons (cons c (get-char-code-property (+ c from) 'name)) names)))))
#    'unicode)
#   (dolist (name names)
#     (when (and (cdr name) (string-match "^latin small .*\\b[^aeiouy ]\\b" (cdr name)))
#       (setq consonants (cons (car name) consonants))))
#   (mapconcat 'char-to-string consonants ""))

my $consonants = 'bcdfghjklmnpqrstvwxzßçñćĉċčďđĝğġģĥħĵķĺļľŀłńņňŉŕŗřśŝşšţťŧŵźżžſƀƃƈƌƒƙƚƞƥƫƭƶǥǧǩǰǵǹȑȓșțȟȡȥȴȵȶȷȼȿɀɉɋɍɓɕɖɗɟɠɡɥɦɫɬɭɯɰɱɲɳɹɺɻɼɽɾɿʂʄʇʈʋʌʍʐʑʝʞʠʮʯᴟᵬᵭᵮᵯᵰᵱᵲᵳᵴᵵᵶᵷᵹᵽᶀᶁᶂᶃᶄᶅᶆᶇᶈᶉᶊᶌᶍᶎᶑḃḅḇḉḋḍḏḑḓḟḡḣḥḧḩḫḱḳḵḷḹḻḽḿṁṃṅṇṉṋṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṽṿẁẃẅẇẉẋẍẑẓẕẖẗẘẛẜẝỻỽↄⱡⱦⱨⱪⱬⱱⱳⱴⱶⱹꜿꝁꝃꝅꝇꝉꝑꝓꝕꝗꝙꝛꝟꝣꝺꝼꝿꞁꞃꞅꞇꞎꞑꞓꞔꞕꞗꞙꞡꞣꞥꞧꞩꬱꬵꬶꬷꬸꬹꬺꬻꬼꭃꭄꭅꭇꭈꭉꭊꭋꭌꭖꭗꭘꭙﬅ';

sub digraphs {
  my $word = shift;
  my @digraphs = $word =~ /([$consonants]+[$vocals]*|[$vocals]+)/g;
  # local $" = "•";
  # warn "@digraphs";
  for (@digraphs) {
    if (/^([$vocals]+)/ or /([$consonants]+)([$vocals]*)/) {
      my ($one, $two) = ($1, $2);
      $one = "." if length($one) == 0;
      $two = "." if not $two or length($two) == 0;
      $one = "[$one]" if length($one) > 1;
      $two = "[$two]" if length($two) > 1;
      $_ = "$one$two";
      # warn $_;
    } else {
      warn "WTF is $_\n";
    }
  }
  return \@digraphs;
}

sub words {
  my $text = shift;
  my @words = $text =~ /(\w+)/g;
  # local $" = "•";
  # warn "@words";
  return \@words;
}

sub process {
  my $text = shift;
  my $words = words($text);
  # local $" = "•";
  # warn "@$words";
  my %digraphs;
  for my $digraphs (map { digraphs($_) } @$words) {
    # warn "@$digraphs";
    for my $digraph (@$digraphs) {
      # warn $digraph;
      $digraphs{$digraph}++;
    }
  }
  # sort descending by number of occurences
  my @keys = sort { $digraphs{$b} <=> $digraphs{$a} } keys %digraphs;
  # print the best
  for my $digraph (@keys[0..49]) {
    print "$digraph" if $digraph;
  }
  print "\n";
}

local $/;
process(lc(<STDIN>));

And then use it to parse Romeo & Juliet:

$ ./digraph-extraction.pl < ~/Documents/Romeo\ and\ Juliet.txt
a.r.t.s.i.n.[th]eo.[nd].d.reme[ll].vetof.wi[th].m.no[ng].[you].e.besetehene[th]amafo[th]ihacow.l.myro[st].[th][ou]casololihiu.lewemode

And feed that to the name generator:

to the name generator

Mehitove
Nomond
Llng
Cofa
Ewif
Lolif
Limethithi
Eveto
Llmost
Teuhe
Remor
Thiliso
Afomem
Ngbemy
Ngfo
Istmo
Ngth
Hawref
Hengtfo
Rwili

Hm. It’s not very impressive. 😔

#Programming #Name Generator

Comments

(Please contact me if you want to remove your comment.)

⁂

This needs Markhov chains, or at least and indicator for terminal sounds. Looking at “Ngfo” for example: Sure, many words end with a last consonant like “ng”. But that doesn’t meant you can put “ng” anywhere in a word. So actually what we need to do looks like a Markhov chain based on syllables, I think? I guess you will then never get any variations on “John”. Perhaps we could implement a sort of “word2vec” for characters: we’d find that generally speaking “o” often occurs in the context of “j”, “h” and “n” and perhaps we’d find that “u” often occurs in the same context (”June”) and so if we get the “nearest” characters to “o” given the context of “j”, “h” and “n” we might get back “u”, leading to the “new” word “Juhn”.

– Alex Schroeder 2019-01-30 07:35 UTC

---

Good to see Perl use.

Ngfo, have to use that one. And make up the dialect of a language that has a silent Ng. 😊

– Blue Tyson 2019-01-30 10:12 UTC

Blue Tyson

---

Heh. 😊

– Alex Schroeder 2019-01-30 13:01 UTC