💾 Archived View for tilde.team › ~aprilnightk › gemlog › 2022 › 05 › 31-quasidomain-new.gmi captured on 2023-01-29 at 17:22:56. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
(NOTE: This is a verbatim copy of the corresponding page from Hafnium Tour at hfnp://hafnium.D4GREATu/tour/quasidomain so it could be available on here too.)
You may have noticed that this Hafnium site is named hafnium.D4GREATu. This is unusual because it's not a normal domain name, it's a HFNP quasidomain name.
Quasidomain system is technically not dependent on HFNP: at its' core it's just a particular (and arguably convoluted) way to encode host/port information.
A quasidomain always has the same form: [sitename].[appendix], where appendix is a sequence of characters (usually of length between 8-15, depending on a set of conditions).
A crucial thing to understand here is that the quasidomain system works on top of the already existing system of addresses. It is merely a wrapper.
Each quasidomain resolves to a hostname and a port number, where a hostname can be either an IPv4 address, or an actual domain name.
[sitename].[appendix] resolves into [hostname]:[port]
A sitename here is precisely that - an arbitrary name associated with this quasidomain. It doesn't have to be the same as hostname - it can be anything.
An appendix contains - in a pretty compressed way - all the information needed to resolve the quasidomain.
For each combination of hostname and port number, there are numerous possible quasidomains (1028 for an IP address, 3084 in case of a domain name) - so a site owner can choose the one that he feels looks better. So, one [hostname]:[port] pair can have many quasidomains, but each quasidomain only resolves to one [hostname]:[port] pair (many-to-one correspondence).
Following are some examples. (Note that 9777 and 9778 are recommended HFNP port numbers. These numbers yield quasidomains that are a bit (technically - a byte) shorter).
mylocalsite.DABZ6MZR --> 127.0.0.1:9777 mylocalsite.CH+t8Qo$ --> 127.0.0.1:9777 myhfnsite.WqRemA& --> myhfnsite.io:9778 myhfnsite.XNc65yA$ --> myhfnsite.io:9778 myhfnsite.rNFaDPU@zZ2HPbJmPw& --> myinternetsite.com:9779 google.com.zNjzZBG7%dw$ --> google.com:9777
This system had several key concerns in mind that were observed in the process of development. Here they are:
alice.AH9uNAo$ --> 127.0.0.1:9777 bob.AH9uNAo$ --> 127.0.0.1:9777
Following is an in-depth description of the process of generating a quasidomain appdendix out of a desired sitename, and ip/hostname, and a port number. You can safely omit reading this if you are not interested in technicalities.
Note that this algorithm is trivially reversible, so the reverse algorithm is not described here.
A quasidomain appendix as such is a (customly) base64 encoded sequence of bytes.
The first byte is always a control byte which contains information about conditions that were held for this particular quasidomain. Following is the scheme of the control byte:
1st bit \___ scenario indicator (4 possible values) 2nd bit / 3rd bit \___ port type (4 possible values) 4th bit / 5th bit ---- name mixin (2 possible values) 6th bit ---- extended variant indicator (2 possible values) 7th bit \___ nonextended variant OR 2 highest bits of an extended variant (4 possible values) 8th bit /
Scenario indicator shows what scenario was used. There are 4 scenarios:
Port type indicates whether the appendix can be shortened based on the portnumber.
Name mixin bit indicates whether the sitename was "mixed" into the appendix generation.
Extended variant bit indicates whether the variant number of this quasidomain is greater than 3. If it is greater than 3, then an additional byte (called auxilliary variant byte) must follow the control byte, to carry the variant number. Due to this, first 4 variants sometimes provide shorter quasidomain appendices due to ommitting the auxiliary byte.
Final two bits indicate:
Scenario 0
In this scenario, the control (and auxiliary, if present) bytes are followed by 2 bytes (or 1, or none at all, depending on the control byte) representing the port number, and 4 bytes representing the IP address.
Scenario 1
In this scenario, the control (and auxiliary, if present) bytes are followed by 2 bytes (or 1, or none at all, depending on the control byte) representing the port number, and the SDNE (see Appendix 1) encoded list of top-level domains. For example, if the desired sitename is "mysite" and the domain name is "mysite.org.uk", then we need to only encode ["org", "uk"].
Scenario 2
In this scenario, the control (and auxiliary, if present) bytes are followed by 2 bytes (or 1, or none at all, depending on the control byte) representing the port number, and the SDNE (see Appendix 1) encoded hostname, including the list of top-level domains. For example, if the desired sitename is "mysite" and the domain name is "differentsite.org.uk", then we need to encode "differentsite" and ["org", "uk"].
Scenario 3
In this scenario, the control (and auxiliary, if present) bytes are followed by 2 bytes (or 1, or none at all, depending on the control byte) representing the port number, and the SDNE (see Appendix 1) encoded hostname postfix, followed by the list of top-level domains. For example, if the desired sitename is "mysite-hafnium" and the domain name is "mysite.org.uk", then we need to encode "-hafnium" and ["org", "uk"].
To make multiple variants possible, the variant number (represented as 1 byte if less than 255, or 2 bytes otherwise) is SHA-256 hashed. Then, the first N bytes of this hash are taken, where N is the length of our current resultant bytes minus 2, and all the bytes except for the first two are XORred with this hash. Like this:
VARIANTHASH 01001001 10010001 ... APPENDIXBYTES 01001100 10100100 10010010 00100000 ... RESULT 01001100 10100100 11011011 10110001 ...
The reason for excluding first two bytes from XORring is that they need to stay intact, for those are our control byte and (potentially) auxiliary variant byte. While resolving, we will need to know the variant number before un-xorring the value.
(If the 5th bit of the control byte is ommited, then this stage is bypassed.)
SHA-256 hash of the desired sitename is taken, and is xorred with the appendix bytes, except for the first byte (similar to the previous step - because it's the control byte, and we need to know its' 5th bit before performing the reverse of this step).
We now have our appendix bytes, and they need to be base64 encoded. The working of base64 is tweaked here, though, and this tweaked algorithm can be referred to as base64q.
Differences of base64q as compared to base64:
== replaced with & = replaced with $ 0 replaced with # I replaced with @ O replaced with < l replaced with !
The reasoning behind choosing these characters is that = and ? can't be used so as not to confuse the URI parser, the 0, I, O and l should not be used due to be visually confusing, and the characters that are[ used need to be present on the standart keyboard.
SDNE is short for "site domain name encoding" and is a homebrew algorithm of turning strings into bytes with some specialized compression.
As an input, this algorithm takes a string and, separately, a list of strings (each of which must be a valid top-level domain name).
Input strings must satisfy the following condition: all characters must be either:
The idea behind SDNE is that we need to turn strings into as short byte sequences as possible.
Just turning each character into a byte is too wasteful: there are only 38 possible input characters, and a byte can encode up to 256. Five bits, unfortunately, aren't enough, because 11111 base 2 = 31. At the same time, there are ways to do better than a standart 6-bit encoding.
English language is predominant in URLs, so SDNE is based upon observations about English language. SDNE makes use of three lists:
SDNE_1: A list of 21 most frequently used English letters, plus all 10 digits (abcdefghiklmnoprstuwy0123456789). That is 31 values, while 32nd value is reserved.
SDNE_2: A list of 2041 most frequently used trigrams, followed by least frequently used English letters (z, q, j, x, v), a full-stop and a dash.
SDNE_3: A list of 2048 top-level domain names as published by ICANN. There are currently fewer domain names, so the free spots are reserved.
The exact lists for SDNE_2 and SDNE_3 are provided with the quasidomain library.
The encoding rules are these:
This algorithm provides compression based on the facts that frequent trigrams, occupy only 12 bits instead of 18, and the TLD names occupy only 12 bits too, while some of them are in fact pretty long.
- - - - - - -
Keith Aprilnight (aprilnightk@tilde.team)