💾 Archived View for tirohia.smol.pub › 1701836714 captured on 2024-06-16 at 12:02:27. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2024-02-05)
-=-=-=-=-=-=-
There was a bit of a delay, but I think this thought was triggered by a post on mastodon a few days ago, with someone asking how to improve handling and analysis of large datasets in R. My off the cuff response - imparted with what I considered a modicum of humor, was to suggest re-writing everything in Python. I assumed some people would miss the attempt at humor, and lo, I was right :)
The more serious response that eventually found it's way to that conversation was that I think Python is better at the processing of large datasets, especially anything that needs to be processed routinely. It's a slight tangent, but I also think Python is better for anything that needs to go into production. Anything where writing secure code is required.
I suspect this is often taken as an attack on R. It ... sorta could be, but it's not. R has it's uses - once you get over the (very steep) learning curve, it's an very powerful tool for analysis. I don't think it's the right tool for the job for processing large datasets though. Or for writing any public facing content. I don't think I can be moved from the position that Shiny is awful.
Part of this is the way R is built. Part of it is that there is already a fairly large ecosystem of Python code, with thousands of people working on improving it, that R just doesn't have. R has a lot of packages, but I would hazard a guess that the vast majority of those were written by academics. And if there is one thing that my time as an academic has taught me, is that academics need to accept that they are not software engineers.
Anyway, someone brought up the point that when working with large datasets in the machine learning space, your choice of high level programming language (though I am still of the mind that R isn't a programming language, it's an analysis tool), isn't that important - as long as it can interface with a number of different tools that do handle large datasets well. This is a sentiment I agree with.
Recently, I had to re-write a bunch of code to use Apache Arrow tables, instead of Pandas. I was passing data from a piece of Python code to a piece of R code using rpy2. As an aside, tools for differential expression/differential methylation analysis, are very definitely in the category of "I wish someone would re-write this tool in Python". Apparently there is a limit to the size of the pandas dataframe that can be passed this way, somewhere in the region of 50K samples. It might be possible to change this, but I found moving to Arrow tables to be the path of least resistance.
All of this led me to the thought, on the bus this morning, that I don't really care how Arrow tables are built. Or Pandas dataframes for that matter. I am assuming there's some Python used to build Pandas, but I think there's some C and Cython things in there as well, but I just don't care. I just want the thing to work.
It then occurred to me that I have this same experience with Linux. I've been a primarily linux desktop user since, oooo, something like Redhat 3.1. Which is a long, long time. Command line is not a mystery to me. At a certain level though, interest stops. I think I've tried to compile a kernel once, but I largely don't care how the linux kernel works, just that it does.
There's are other parallels, but there the interest stops because I can't go any further. I like cooking things from scratch, I can find maize to turn into masa to turn into tortillas, but growing my own corn? It'd be nice, but it's so far down my list of things I'd like to do with my life that it just sorta drops off the list.
Nothing much else to add here. Other than I found it interesting for a short while to contemplate that point at which interest in given chain of events stops, and to wonder why it stops there, and not somewhere else.