💾 Archived View for danieljanus.pl › blog › en › 2020 › 05 › 08 › making-of-clojure-dependency › ind… captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-11-30)
-=-=-=-=-=-=-
In my previous post, “Clojure as a dependency” [1], I’ve presented the results of some toy research on Clojure version numbers seen in the wild. I’m a big believer in reproducible research [2], so I’m making available a Git repo [3] that contains code you can run yourself to reproduce these results. This post is an experience report from writing that code.
There are two main components to this project: acquisition and analysis of data (implemented in the namespaces “versions.scrape” and “versions.analyze”, respectively). Let’s look at each of these in turn.
This step uses the GitHub API v3 [4] to:
As hinted by the namespace, I’ve opted to use Skyscraper [8] to orchestrate the process. It would arguably have been simpler to use GitHub’s GraphQL v4 API [9], but I wanted to showcase Skyscraper’s custom parsing facilities.
There’s no actual HTML scraping going on (all processors use either JSON or Clojure parsers), but Skyscraper is still able to “restructure” the result – traverse the graph endpoint in a manner similar to that of GraphQL – with very little effort. It would have been possible with any other RESTful API. Plus, we get goodies like caching or tree pruning for free.
Most of the code is straightforward, but parsing of “project.clj” merits some explanation. Some of my initial assumptions proved incorrect, and it’s fun to see how. I initially tried to use clojure.edn [10], but Leiningen project definitions are not actually EDN – they are Clojure code, which is a superset of EDN. So I had to resort to “read-string” from core – with “*read-eval*” bound to nil (otherwise the code would have a Clojure injection vulnerability – think Bobby Tables [11]). Needless to say, some “project.clj”s turned out to depend on read-eval.
Some projects (I’m looking at you, Closh [12], Babashka [13] and sci [14]) keep the version number outside of “project.clj”, in a text file (typically in “resources/”), and slurp it back into “project.clj” with a read-eval’d expression:
(defproject closh-sci #=(clojure.string/trim #=(slurp "resources/CLOSH_VERSION")) …)
A trick employed by one project, Metabase [15], is to dynamically generate JVM options containing a port number at parse time, so that test suites running at the same time don’t clash with each other:
#=(eval (format "-Dmb.jetty.port=%d" (+ 3001 (rand-int 500))))
Finally, it turned out that “defproject” is not always a first form in “project.clj”. Some projects, like bridge [16], only contain a placeholder “project.clj” with no forms; others, like aleph [17], first define some constants, and then refer to them in a “defproject” form. If those constants contain parts of the dependencies list, then those dependencies won’t be processed correctly. Fortunately, not a lot of projects do this, so it doesn’t skew the results much.
Anyway, the end result of the acquisition phase is a sequence of maps describing project definitions. They look like this:
{:name "clojure-koans", :full-name "functional-koans/clojure-koans", :deps-type :leiningen, :page 1, :deps {org.clojure/clojure #:mvn{:version "1.10.0"}, koan-engine #:mvn{:version "0.2.5"}}}, :profile-deps {:dev {lein-koan #:mvn{:version "0.1.5"}}}
Homogeneity is important: every dependency description has been converted to the cli-tools format, even if it comes from a “project.clj”.
I’ve long been searching for a way to do exploratory programming in Clojure without turning the code to a tangled mess, portable only along with my computer.
Exploratory (or research) programming is very different from “normal” programming. In the latter, most of the time you typically focus on a coherent project – a program or a library. In contrast, in the former, you spend a lot of time in the REPL, trying all sorts of different things and “def”ing new values derived from already computed ones.
This is very convenient, but it’s extremely easy to get carried away in the REPL and get lost in a sea of “def”s. If you want to redo your computations from scratch, just about your only option is to take your REPL transcript and re-evaluate the expressions one by one, in the correct order. Cleaning up the code (e.g. deglobalizing) as you go is very difficult.
I’ve found an answer: Plumatic Graph [18], part of the plumbing [19] library. There are a plethora of uses for it: for example, at Fy [20], my current workplace, we’re using it to define our test fixtures. But as it turns out, it makes exploratory programming enjoyable.
The bulk of code in versions.analyze [21] consists of a big definition of a graph, with nodes representing computations – things that I’d normally have “def”’d in a REPL. Consequently, most of these definitions are short and to the point. I also gave the nodes verbose, descriptive, explicit names. Name and conquer. “raw-repos” is the output from data acquisition, “repos” is an all-important node containing those “raw-repos” that were successfully parsed, and most other things depend on it.
It also doesn’t obstruct much the normal REPL research flow. My normal workflow with REPL and Graph is something along the lines of:
1. “(def result (main))”
2. evaluate something using inputs from “result”
3. nah, it leads nowhere
4. evaluate something else
5. hey, that’s interesting!
6. add a new node to the graph definition
7. GOTO 1
Thanks to Graph’s lazy compiler, I can re-evaluate anything at need and have it evaluate only the things needed, and nothing else. Also, because the graph is explicit, it’s fairly easy to visualize it [22]. (Click the image to open it in full-size in another tab.)
[23]
Because it’s lazy, it doesn’t hurt to put extra things in there just in case, even when you’re not going to report them. For example, I was curious what things besides a version number people put in dependencies. “:exclusions”, for sure, but what else? This is the “:what-other-things-besides-versions” node.
Imagine my surprise when I found “:exlusions” (sic) in there, which turned out to be a typo in shadow-cljs’ “project.clj”! I submitted a PR [24], and Thomas Heller merged it a few days after.
My only gripe with Graph is that it runs somewhat contrary to the current trends in the Clojure community: for example, it doesn’t support namespaced keywords (although there’s an open ticket [25] for that). But on the whole, I’m sold. I’ll definitely be using it in the next piece of research in Clojure, and I’m on a lookout for something similar in pure R. If you know something, do tell me!
The plot from previous post has been generated in pure R, using ggplot2 [26] (an extremely versatile API). Clojure generates a CSV with munged data, and then R reads that CSV as a data frame and generates the plot in a few lines.
I’ve briefly played around with clojisr [27], a bridge between Clojure and R. It was an enlightening experiment, and it would let me avoid the intermediate CSV, but I decided to ditch it for a few reasons:
But it’s definitely very promising! I applaud the effort and I’ll keep a close eye on it.