💾 Archived View for thrig.me › blog › 2023 › 04 › 25 › strings-to-numbers.gmi captured on 2024-12-17 at 09:56:36. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-11-14)

-=-=-=-=-=-=-

Strings To Numbers

Suppose we wish to convert strings to numbers. There are various reasons for this; a trivial one is to colorize IRC nicknames or by a hostname according to some algorithm. Lagrange apparently does this to colorize gemini capsules, so you might cheat and look at how it does that. Hostnames are already numbers, sequences of them, though we may want to color particular hosts in particular ways, maybe hosts under .com should be similar to one another. Another and more typical way would be to feed the sequence of numbers into a hash function, a digital equivalent of sausage.

    #!/usr/bin/env python3
    import hashlib
    tld  = hashlib.shake_256('com'.encode()).hexdigest(1)
    rest = hashlib.shake_256('foo'.encode()).hexdigest(2)
    print(tld + rest)

Grouping Hostnames Somehow

Let's try the first method, where similar hostnames should have similar numbers. This implies that the tail of the hostname, the com in example.com is more significant than whatever is going on at the beginning. Thus foo.com and bar.com should be closer to one another than foo.xyz is to foo.com. So we need some number for com, and then maybe to add foo or bar or example to that. Maybe this will go character by character (m, o, c, dot, o, o, f) or maybe we can chunk the components of the hostname up (com, foo) and then devise some yet to be invented (by me) algorithm to produce a suitable number. If com is more important, maybe it's in the millions place, while foo or example live down in the thousands place? Something like that.

A good idea might be to set the bounds on the numbers a hostname can produce, especially if you want numbers practical for computers to deal with. That is to say, not too big. A naive algorithm could easily produce big numbers. What if we multiply the character code by something to some power, and increment the power for each character? Speaking of naive algorithms,

    (defun numberize (string &aux (number 0))
      (loop for power from 1
            for char  across string do
            (incf number (* (char-code char) (expt 10 power))))
      number)

This has the nice property that foo.com and bar.com are pretty close to one another, but the bad property that very long hostnames produce very large numbers. If we are colorizing the hostnames, 0xFFFFFF might be a reasonable maximum, or 16777215, or maybe some (much) lower value if we want to use visually distinct colors for the hosts. Or maybe we want to restrict the numbers to two to the 53? The numberize algorithm fails that.

    SBCL> (NUMBERIZE "foo.com")
    1211483120
    SBCL> (NUMBERIZE "bar.com")
    1211484680
    SBCL> (NUMBERIZE
          "llanfairpwllgwyngyllgogerychwyrndrobwyll-llantysiliogogogoch.com")
    1211475125252526978380893701400352274052352525012534050114558408880

Some languages will not support a number that large, or may become terribly slow (or even moreso than usual), or may silently convert the number to something like 1.21147512525253e+66, none of which may be ideal. Also the test cases are bad, "foo" and "bar" are of the same length; consider how "aaa" and "aaaaa" numberize. A human might consider these two strings as being pretty similar; this algorithm, not so much.

    SBCL> (NUMBERIZE "aaa")
    107670
    SBCL> (NUMBERIZE "aaaaa")
    10777670
    SBCL> (FORMAT NIL "~R" (NUMBERIZE "aaa"))
    "one hundred seven thousand six hundred seventy"
    SBCL> (FORMAT NIL "~R" (NUMBERIZE "aaaaa"))
    "ten million seven hundred seventy-seven thousand six hundred seventy"

One idea might be to truncate long hostnames or hostname components so that the numbers do not grow without bound. Another idea would be to pad short hostnames to the same length so that the numbers are all of the same magnitude. And to use base 2 instead of base 10 so the numbers stay smaller? And maybe to reduce runs of "aaaaa" to just "a"? But that's a lot of arbitrary distortions to the data, little of which may produce good or even acceptable results. And how customizable is the algorithm? Maybe someone wants .com to be near purple instead of the default green, and someone else finds the blues unreadable, so how would you omit those? (I set xterm*colorMode:false by default to avoid unreadable colors.)

A pool of usable colors arranged into a shallow tree may work; there might be high level "greenish" or "purplish" buckets, and under those various suitable colors that fit the category. Then .com could randomly or might be assigned to a top-level bucket, and .com hostnames would pick some suitable color from the bucket. Probably by a hash of the hostname, unless someone can think up a clever and efficient way to make similar hostnames use similar colors from a limited pool of choices. Simpler would be to hash the whole hostname, pick a random suitable color, and to have a knob to turn the colorization off.

The Actual Motivation for these Ruminations

gemini://raek.se/orbits/space-elevator/

Orbits of hosts are a thing, or what usually is a single- or double-linked list of hosts plus often option to teleport somewhere, what were called webrings. Another idea is that hosts would be in an orbit around something, maybe with the complication that the hosts rotate, so only certain other hosts are visible at certain times, depending on where one is on the host, or there's only so much transmission power so one can only jump so far. Probably not very practical, but what is?

Maybe the host . would be the equivalent of the Sun or a galactic center, depending on the scale of your ambitions, and then .com a planet, foo.com a satellite about com. One would probably stop this at some depth, maybe three or four, so there are not too many different things in orbits of orbits? Also maybe with a limit on the total number of objects to keep the calculations somewhat sane. And yes, . is a host, or set of hosts, often visited but not often seen.

    $ host .
    $ host -t NS .
    . name server b.root-servers.net.
    . name server i.root-servers.net.
    . name server m.root-servers.net.
    . name server h.root-servers.net.
    . name server c.root-servers.net.
    . name server k.root-servers.net.
    . name server f.root-servers.net.
    . name server g.root-servers.net.
    . name server j.root-servers.net.
    . name server e.root-servers.net.
    . name server l.root-servers.net.
    . name server d.root-servers.net.
    . name server a.root-servers.net.

Hosts with a lot of subhosts may need some thought, as flounder.online in particular contains about 30% of the lupa-capsules.txt hostnames. That might involve . for the Sun, online a planet, flounder a moon, and then seven hundred plus satellites swarming that one moon. Maybe it could work out? I am not a rocket scientist, and this is not rocket science. Keeping the orbits circular would help simplify the calculations. The equation of the center is reputed to diverge for orbits that are too eccentric, and simulating everything might burn hilarious amounts of CPU.

Another way would be to limit sign-ups (as is traditional for orbits) and to put the hosts into a solar system according to some algorithm. Maybe there could be a longer-range link to other little solar systems, a warp gate or plot actuating device, something like that, if need be.