💾 Archived View for gemlog.blue › users › ava › 1700156235.gmi captured on 2024-08-18 at 20:08:31. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

This is a gemlog about project ATHN. A new competitor to the web. So this post will be written from the perspective of developing that project. This post is about compression.

You can check out the ATHN homepage here (still on the web, how ironic)

To compress or not compress

ATHN markup is very lightweight. Translating from HTML to ATHN is already a bit like compressing it. Depending on how much extra crap you stuffed into your HTML, ATHN markup can be wildly more efficient. As an example, the HTML version of the ATHN homepage is 4176 Bytes long. The ATHN markup version is about 1.5 times smaller at 2791 Bytes. And that website literally only has 10 lines of CSS. This is the absolute worst case scenario for ATHN markup. A longer document I made which was hand translated to HTML is 7.24 **times** smaller in ATHN markup. And I'm sure a site like CNN would give even crazier results.

But if you have a lot of actual content, things can still get kinda big. And a lot of people around the world have slooooow internet connections. Compression is not as necessary for ATHN as it is for the web, but it can definitely still make things faster. So is it worth the extra complexity?

Well I think we'll have to wait and see, but in the rest of this post I'll assume that the answer to that question is yes and play around with some fun ways of doing it.

But https has compression

An easy way to get compression almost for free would be to use https or some other protocol that has compression built in. And I'm not here to say that https is bad... but: if I make the hello world app with the reqwest https library for rust, it will install 104 dependencies and take 14.17s (as opposed to 0.29s without) to compile on my 8 core Ryzen 6800H. And that's not reqwests fault, it's actually a pretty great library, it's https' fault.

Look my point is that https is bloated for what we're trying to do here. We just wanna transfer some pretty small plain text documents with a very restrictive syntax and https is over here validating X.509 certificate chains, negotiating ciphersuites, setting up resumable downloads and yes compressing on the fly. We dont need 58 different headers we need 1. https is overkill for ATHN, it has tons of features that I'm sure are useful, just not for us. And sure you can just use a library like reqwest, but we should be using the simplest possible solution that meets our needs, a protocol more like gemini. Of course gemini doesnt really support compression, and they also use TLS for encryption, which is anything but simple.

Maybe we will be able to find that perfect match for a protocol that does just what we need and not much more, one that's really simple, and maybe even one that has built in compression. Or we might have to make our own, so let's explore that next.

But first; a little side quest

To make ATHN markup as machine readable as possible the format is pretty restrictive. Which means that it's not too hard for humans to fuck it up just a little without noticing (trust me I've been there), and nobody likes a malformed document, that's just a bad time. So the server first has to verify that every document it sends is compliant, ideally as soon as possible so that it can be corrected in time. Of course we cant expect every server to be perfect, so the client also has to verify the documents it recieves to some degree. So here's a little diagram of the journey of an ATHN document.

Starting at the server:
Verify -> Compress -> Encrypt => Send to client => Decrypt -> Decompress -> Deserialize (and check) -> Render

Deserializing means to take the data from a network stream or a file on disk, in this case plain text, and convert it to a data structure like a python class or a rust struct that you can actually work with in your program.

It seems a bit wasteful to have to verify the document twice, right, well it's kind of a necessity if things have to be stabile. Just hold this little journey diagram in your head for a minute though, we'll touch on it later.

Freeeeeeeedom (what if we made a protocol)

So if we're making our own protocol we have complete freedom to compress the data in any way we like. An obvious choice would be to use one of the many compression libraries out there. We would need to find one that works well on text since that's what we're compressing. It would also need to be efficient on both small and medium sized files. Many compression libraries actually make small files larger when compressed because they're just not designed for those file sizes. But many ATHN documents would probably be quite small, and those certainly shouldnt be made any bigger. Also, we're not trying to cram every last byte into some microcontroller's tiny memory, we're doing this to make things faster. We want compression that's as fast as possible given some average internet speed, not necessarily the library with the best compression ratio. Some compression libraries are very complex pieces of code, but with the advantage that the code is already written. Picking the right compression library is no small feat, but we could also go another way.

We could also make a compression algorithm of our own. Again owing to the fact that ATHN markup has some quite strict rules we could make quite a few assumptions that might end up saving a lot of bytes transfered and CPU cycles used. But from a code complexity standpoint there's another really interesting option. We could combine some of the steps in that diagram from before.

How would that work exactly?

Instead of verifying the text file, then compressing it. We could also encode it into a binary format custom made for this exact purpose, validating *and* compressing it in the process. We could design the format in such a way that it would be impossible to encode an uncompliant document into that binary format, meaning that the document would be validated simply by being in that format. And if compression is part of the conversion process the binary version of a document would be smaller than the plain text one.

But hang on a second, did you notice what we just did? We just removed the need to validate the document on the other end. "the document would be validated simply by being in that format" means that the client can be pretty damn sure that the document is valid if it recieves it in that binary format. Also, we could design the format so that it's as easy as possible to deserialize, even easier than the original ATHN markup text format, because this format doesnt need to make sense to humans. And on the client side we could also combine the steps of decompression and deserialization.

And now the diagram from before suddenly looks like this:

Starting at the server
Encode -> Encrypt => Send to client => Decrypt -> Deserialize -> Render

Static servers could even do the encoding step when they first recieve the document and just save the encoded version to disk instead of the original plain text one. That would accomplish 3 things:

Save space on the server because the encoded version is compressed
Ensure that the server can only store and send compliant documents
The server would only have to do it once per document instead of every time someone requests it

This way we avoid double work and it looks quite a bit cleaner. You might object and say that all of this encoding and extra stuff is a lot more work than just using ready made libraries, and you'd be right. But it would mean better performance (probably), less overall code (although we might at first have to write more new code) and it would make it much easier for people to understand the system deeply.

Conclusion

Those were my initial thoughts on compression in project ATHN. I'm not saying that this is *the* answer, but it was some ideas and considerations. So I hope you enjoyed reading them and that I gave you some food for thought.