💾 Archived View for dmerej.info › blog › 0103-i-am-a-rusty-frog.gmi captured on 2021-12-06 at 14:29:53. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-11-30)
-=-=-=-=-=-=-
2019, Jul 27 - Dimitri Merejkowsky License: CC By 4.0
Do you know the fable of the frog and the boiling water? It goes like this:
If you drop a frog in a pot of boiling water, it will of course frantically try to clamber out. But if you place it gently in a pot of tepid water and turn the heat on low, it will float there quite placidly. As the water gradually heats up, the frog will sink into a tranquil stupor, exactly like one of us in a hot bath, and before long, with a smile on its face, it will unresistingly allow itself to be boiled to death.
At least that's how Daniel Quinn's tells it in *The Story of B*.
It is a powerful metaphor, although if you read the Wikipedia article[1], it turns out frogs do not actually behave that way.
1: https://en.wikipedia.org/wiki/Boiling_frog
But I think there's still some truth it in: if you experience some increasing pain gradually enough, you may not realizing how much you are actually suffering until it's too late. At least it's what happened to me.
Let's start at the beginning, before we start talking about Rust, refactoring, performance bugs and the weaknesses of the human mind.
A long time ago, I realized I spent far too much time typing `cd` to navigate between project directories, and that the default `ctrl-r` binding for `zsh` was a bit awkward to use for me.
Then I realized I could solve both of those problems by writing two very similar commands: `cwd-history` and `cmd-history`.
Let's talk about the former. It's a command-line tool that takes two actions, `add` and `list`. Here's how it works:
I use a similar program called `cwd-history` that does exactly the same thing but for the working directories. It runs in another zsh hook called `chpwd`, and is triggered by typing the command `z`.
There are a few specifics like making sure the entries are not duplicated or ensuring that every symlink in the added path is resolved, but we'll get to that later.
The two programs were written in Python and shared a ton of functionality but since I am a lazy programmer I did not bother with refactoring the code.
But something was bugging me. There was a small noticeable delay whenever I was using them, particularly on old hardware. So I decided to try and rewrite them in Rust. And surely when I measured the performance, there was a *huge* difference in the startup time: Rust compiles directly to machine code, and there was no need to wait for the Python interpreter to start. I was happy.
A few months ago, I switched from Neovim to Kakoune[2] for my main text editor and was faced with a problem.
Kakoune does not have any persisting state by design, and I was missing a particular vim feature that allowed me to quickly select any recently opened file (aka. <abbr title="Most Recently Used">MRU</abbr> files).
But I already knew how to solve that! I wrote yet another tool called `mru-files` for reading and writing the MRU files in another database, and write a bit of Kakoune script:
# Call `mru-files add` with the buffer name when opening a new buffer hook global BufOpenFile .* %{ nop %sh{ mru-files add "${kak_hook_param}" } } # Call `mru files list` when pressing <leader key>o: map global user o -docstring 'open old files' ': evaluate-commands %sh{ mru-files list --kakoune }<ret>'
Note the `--kakoune` option when calling `mru-files list`: we need to call the `menu` command so that we can present the list of MRU files to the user and open the file when it's selected.
It looks like this:
menu "/foo.txt" "edit -existing 'foo.txt:'" "/bar.txt" "edit -existing 'bar.txt'" ...
While I was it, I also added a `--kakoune` argument to `cwd-history list` so that I could switch to previously visited directories directly from Kakoune:
map global cd o -docstring 'open old working directory' ': evaluate-commands %sh{ cwd-history list --kakoune }<ret>'
So now I was faced with *three* very similar code written in Rust. It was time for a refactoring .
Here's what I did:
pub trait EntriesCollection { fn add(&mut self, entry: &str); fn list(&self) -> &Vec<String>; }
pub struct Commands { entries: Vec<String>, } impl EntriesCollection for Commands { // ... fn add(&mut self, entry: &str) { // Deduplicate the entry and store it in self.entries // ... } fn list(&self) -> Vec<String> { // return self.entries } }
pub struct WorkingDirs{ entries: Vec<String>, } impl EntriesCollection for WorkingDirs { // ... fn add(&mut self, entry: &str) { // Convert the entry to a path, check if it exists, // then canonicalize it and insert it in self.entries } }
(and similar code for MruFiles)
impl Storage { pub fn new( mut entries_collection: Box<EntriesCollection>, path: &std::path::PathBuf, ) -> Storage { let db_path = path.join(entries_collection.name()); let entries = read_db(&db_path); for entry in entries { entries_collection.add(&entry); } Storage { db_path, entries_collection, } } pub fn list(&self) -> &Vec<String> { &self.entries_collection.list() } pub fn add(&mut self, entry: &str) { &mut self.entries_collection.add(&entry); write_db(&self.db_path, &self.list()) }
pub enum StorageType { CwdHistory, CommandsHistory, FilesHistory, } pub struct StorageManager { // ... } impl StorageManager { pub fn new(storage_type: StorageType) -> Self { let entries: Box<EntriesCollection> = match storage_type { StorageType::CwdHistory => Box::new(WorkingDirs::new()), StorageType::CommandsHistory => Box::new(Commands::new()), StorageType::FilesHistory => Box::new(MruFiles::new()), }; }; let storage = Storage::new(entries, &app_dir); StorageManager { storage } } pub fn list(&self) { // Delegates to self.storage.list() } pub fn add(&mut self, entry: &str) { // Delegates to self.storage.add() } }
// In src/bin/cmd-history.rs use bwataout::StorageType; fn main() { // ... bwataout::run_storage_manager(StorageType::CommandsHistory) }
// In src/bin/cwd-history.rs use bwataout::StorageType; fn main() { // ... bwataout::run_storage_manager(StorageType::CwdHistory) }
// In src/bin/mru-files.rs.rs use bwataout::StorageType; fn main() { // ... bwataout::run_storage_manager(StorageType::FilesHistory) }
And that was my first mistake: I forgot to measure performance *after* the refactoring. I was *sure* the code was correct. After all, if it compiles, it works, right?
And sure enough, the code *did* work. I could open MRU files and old project directories from Kakoune, and I was quite pleased with myself.
Of course, by now you should have guessed there's a horrible performance bug in the code above. Did you spot it? If you did, congrats! I certainly did not at the time.
The refactoring took place 6 months ago. I am using those tools at work, which means I was using those commands for 6 months, 5 days a week. The database files contained thousands of entries. The horrible bug was still there, consuming CPU cycles without a care in the world, and I did not notice anything.
Fortunately, I did a peer programming session with one of my colleagues and something weird happened. He got *bored* watching my zsh prompt getting stuck when I was using the `z` and `ctrl-r` shortcuts.
I then measured the performance of my tools again and I found that the `cmd-history list` command took 1.7 **seconds** to run. That's *one thousand and seven hundred* milliseconds. It's a **very long** time. No wonder my colleague got bored! What's amazing though is that I did *not* notice anything wrong until he told me about it.
Now it's time to squash the bug. The culprit lies here, in the `Storage` constructor:
impl Storage { pub fn new( mut entries_collection: Box<EntriesCollection>, path: &std::path::PathBuf, ) -> Storage { let db_path = path.join(entries_collection.name()); let entries = read_db(&db_path); for entry in entries { entries_collection.add(&entry); } Storage { db_path, entries_collection, } } }
We're reading each line of the database file, and passing it to `EntriesCollection.add()`. This means we keep calling the `add()` method over and over. It does not do much, but it still has to go through *all* the entries when performing deduplication. This is a classic case of the Shlemiel algorithm[3] and it explains the abysmal performance of the tool as soon as the database gets big enough.
3: https://www.joelonsoftware.com/2001/12/11/back-to-basics/
I believe I wrote the code that way because I thought it would be nice to somehow "validate" the entries when reading the database. That way, if the algorithm in the `add` method changed, the database will be "migrated" automatically (for instance, all non-existing paths still present in the db will automatically disappear). Another classical mistake named *<abbr title="You Ain't Gonna Need It">YAGNI<abbr>*: it's doubtful I'll ever need to "migrate" the database, and when I need to, I'll probably just have to write a tiny throw-away script to do it.
Anyway, now that we've decided the "auto migrating" feature can go away, we can solve our performance issue by adding an `add_all` method to the trait, and replacing the `for` loop in the `Storage` constructor:
pub trait EntriesCollection { fn add(&mut self, entry: &str); + fn add_all(&mut self, entries: Vec<String>); fn list(&self) -> &Vec<String>; }
impl Storage { pub fn new( mut entries_collection: Box<EntriesCollection>, path: &std::path::PathBuf, ) -> Storage { let db_path = path.join(entries_collection.name()); let entries = read_db(&db_path); - for entry in entries { - entries_collection.add(&entry); - } + entries_collection.add_all(entries); Storage { db_path, entries_collection, } } }
In turn, this means we need to implement the `add_all()` method on all structs:
impl EntriesCollection for Commands { // ... fn add_all(&mut self, entries: Vec<String>) { self.entries = entries; } }
impl EntriesCollection for WorkingDirs { // ... fn add_all(&mut self, entries: Vec<String>) { self.entries = entries; } }
impl EntriesCollection for MruFiles { // ... fn add_all(&mut self, entries: Vec<String>) { self.entries = entries; } }
Surely you've noticed the implementation of `add_all()` is *exactly* the same in the three structs. Why did not we add blanket implementation of the `add_all()` method in the trait?
Well, because we need access to the `entries` field of the struct, and there's no way to do that in Rust. Since I'm in love with Rust, I can rationalize this in several ways:
Take those how you want: I can tell myself those are good arguments, but you don't have to believe me :)
Here are my takeaways:
Take care and see you another time!
----