💾 Archived View for thrig.me › blog › 2024 › 05 › 13 › split.gmi captured on 2024-05-26 at 14:45:33. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

To split, or not to split: that is the question

The previous posting alluded to a problem with a split function, in that

    my @words = split scalar readline *STDIN;

may consume lots of memory. If you are in a hurry and the number of words per line is not too high, then split is quick and easy. If not, memory use can be restricted by lexing the line and matching on individual words, which takes a bit more writing than the above:

    #!/usr/bin/env perl
    use 5.36.0;

    {
        my $buf = '';

        sub getaword {
            goto TOKEN if length $buf;    # already working on a line
          READLINE:
            $buf = readline *STDIN;       # get a new line
            defined $buf or die "EOF";
            $buf =~ m/^\W*/cg;            # skip any leading nonsense
          TOKEN:
            if ( $buf =~ m/\G(\w+)\W*/cg ) {    # did we get a word?
                my $word = $1;
                my $eol  = $buf =~ m/\G$/cg;    # tag when end-of-line
                return $word, $eol;
            }
            goto READLINE;    # did not find a word, try next line
        }
    }

    while (1) {
        say for getaword;
    }

The "end of line" marker is so the caller knows when a full line of user input has been consumed, and can maybe prompt before the next call to getaword.

Do we still have a problem?

Yes! The line itself may be too long to fit into memory, or could consume "too much" memory. In this case, either wrap the input prior to reading it, fail (hopefully without the Linux out-of-memory killer killing another random process), or memory map or chunk-read the file, but those last options are even more fiddly and difficult. What happens when otherwise good input straddles one, or more, chunk boundaries? At that point you probably want a compiled language for the speeds and maybe someone has a library for the task. Various modern software uses the quick and easy "split" route, and then some folks wonder why so much memory is being consumed…

Years ago users complained about their gene parsing code; it turned out that they were passing duplicates of the array of genes around to lots of function calls. So some of this is not knowing better options that reduce memory use. Another problem can be giving the developers machines that are too fancy, as the resulting code may not work too well or at all on more typical consumer devices. A problem here is that developers may be able to "shop around" for a job that will give them a gonzo desktop system with all the memories. Someone was pondering legislation to restrict computers to some limited specification, but that would probably only drive computers underground, as one might suspect happened somewhere in the "Dune" universe.

Only parsing up to a certain (too low) limit is called the "C disease" by some, but even with raised limits the limits are still there: at some point the parser will need to say "uncle" and give up because a sender is still sending a word of infinite length, and some limit (available memory, time to complete the job) has been reached. Google, for example, used to offer unlimited free storage. What happened to that? To veer into philosophy, unlimited free storage may match the "singularity" model (more better faster stronger), as opposed to a model that has limits, drawbacks, and diminishing returns.

Has Generative AI Already Peaked? - Computerphile