💾 Archived View for bvnf.space › blog › 003_c_complexity.gmi captured on 2022-03-01 at 15:09:53. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
I'm sure you're aware of the comparison of GNU's C code to, well, almost every other codebase. The notable example is the various implementations of yes(1), the utility to print "y\n" or another string in an infinite loop. Many people have noted this before, but for reference, here are some line counts:
GNU's yes is much faster than other implementations but this is really unnecessary optimisation. In general, GNU core utilities are POSIX + extensions + locale stuff + GNU-specific buildchain stuff, and even then the code is badly written and often unclear. BSDs have fully-featured utilities, although not often extending POSIX, written in much more understandable code, and it's safe (error checking and sandboxing etc.).
I came across k9core a while ago, and I was struck by the extreme simplicity of the utilities; most of them are just wrappers around C library functions, with almost no other code. Take its rm(1) implementation:
#include <errno.h> #include <stdio.h> #include <string.h> int main(int argc, char *argv[]) { if(argc == 1) { printf("Specify files to remove.\n"); return 1; } for(int i = 1; i < argc; i++) { int fd = remove(argv[i]); if(fd == -1) { fprintf(stderr, "Error removing file: %i = %s", errno, strerror(errno)); } } return 0; }
Obviously this is missing functionality and isn't POSIX, and you probably wouldn't want to use it as your everyday rm(1), but the code is totally understandable and readable. The underlying mechanism of our most-used utilities doesn't have to be some arcane black box.
k9core is missing a lot of functionality, and I tried to add some (and fix a few segfaults) while sticking to the same style of simplicity.
The fork is also mirrored on GitHub, but why would you want to go there.
After a few weeks of fiddling with k9core, I decided to start from scratch with my own core utilities. Part of the reason to start afresh was to improve the structure of the code, and partly so that I could write aiming purely for a POSIX implementation, and for a better opportunity to practice my C, but in particular I wanted the code I wrote to be under a more permissive license; namely, under the "unlicense", which does its best to release the code into the public domain. A license discussion is for another post.
bore (my basic core utilities)
"bore", as I've called the collection of tools, has been progessing well, albeit slowly. At the time of writing, I have the makings of 15 utilities, 7 of which are completed. I'm trying to stick to writing super-understandable code, like k9core's, although adding more complex functionality inevitably requires more complex code.
In addition, there is always a conflict between code simplicity, which leads to comprehensibility but verbosity, and code cleanliness, which is the aesthetic property of the code, which may include (but not necessarily) an improvement in performance. For example, consider the problem of writing the mode string for ls(1)'s long output. You've got the file st_mode from stat(3), which is a big octal number where each digit represents things like user, group, other permissions, if the file is executable, and so on. What's the best way to parse this into letters like "-rwxr-xr-x" or "drwxrwxrwx" or "-rw-r--r--" ?
/var/gemini/blog: -rw-r--r-- 1 git _vger 2339 Oct 27 14:16 001_setting_up_vger.gmi -rw-r--r-- 1 git _vger 598 Oct 30 02:03 002_custom_404s.gmi -rw-r--r-- 1 git _vger 3641 Nov 16 11:18 003_c_complexity.gmi
Suckless's sbase holds a buffer of the modestring, which starts as "----------", and then a run of if blocks testing with macros like S_IXUSR modifies single chars in the string. The order of the if blocks is organised so that the last check to modify a char provides the correct information - so if the file is executable by the owner and set-user-ID mode is set, that will first cause the fourth char to be 'x', but then a later check modifies it to s. After all the possibilities have been checked, the modestring is printed.
Arriving at this functionality in bore, I had previously been printing information as it was obtained; no buffers holding strings but rather more calls to putchar(3) between each section. Therefore, sticking to this method, I chose to loop over the bit in the st_mode, and print the appropriate character then and there. Doing this requires a less obvious algorithm: to examine each bit, we need to OR the st_mode with a 4, 2, or 1 in the appropriate bitfield, and then printing an 'r', 'w', 'x', or '-'. This becomes more complicated when checking for things like SsTt.
/* loop over 0400, 0200, 0100, 040, 020, etc and print the relevant char */ char *perms = "rwx"; for (int p = 2; p >= 0; p--) { for (int i = 04; i >= 1; i /= 2) { /* the following is a way of doing j = 010 to the power p */ int j = 01; for (int k = 0; k < p; k++) j *= 010; if (mode & (i * j)) printf("%c", perms[2 - i/2]); else putchar('-'); } }
This code is also using a trick with integer division to get the correct index of the perms string: 4/2 is 2; 2/2 is 1; 1/2 is 0.
So I have a choice between simpler, more verbose, less "clever" code like that of sbase, or something which is in my opinion more interesting to write but less clear. bore is a personal project of mine, so maybe I'll end up with messier code with more interesting algorithms, rather than something clean and useful for others as a reference. It's not difficult to avoid the type of complexity found in GNU code; most of it is a result of trying to optimize, and still catering for every instruction set and locale under the sun. That's difficult to do, and I'm not smart enough to try, so hopefully I will continue to avoid such problems. My questions are rather concerning the specific implementations of basic functionality.
One feature that I haven't implemented in any tool (yet) is directory recursion, like `ls -R` or `mkdir -p`. This single feature is used by so many tools and it's a situation where being UNIXy and piping the output of find(1) or something would be cleaner. However, I want to keep each utility as self-contained elements - no util.h or libbore, and no exec(3)ing something like "| sort". So it stands to reason that if I want to get to POSIX, recusion is coming.
I'll leave this post here for today, but I'd like to come back to some of the specifics of bore another time.
--
written 2021-11-17