💾 Archived View for tirohia.smol.pub › 1696041478 captured on 2024-05-12 at 14:50:52. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-11-04)
-=-=-=-=-=-=-
There' a thing in coding, for me at least, where you're trying to do a thing which seems trivial. So trivial that you convince yourself that someone must have written a library, or even that there must be a native aspect of the language you're using that accomplishes what you're trying to do in a line or two of clean, sensible code.
I've had that in the other half of my work life this past couple of weeks.
One of the things that makes biology hard is the number of variables. Down at the DNA level in the human genome there's approximately 20K genes that could be interacting in all sorts of ways. Very broadly peaking, RNA seq lets us see the level of activity for each gene - so the problem becomes how do you figure out how much the activity of any particular gene has changed, and what part of that change might be responsible for whatever condition one is studying.
I study cancer, not looking for a cure for any particular type, but how to improve our the process of diagnosis, and if we're lucky, provide more accurate prognoses.
One of the problems with machine learning/data science in biology is that the number of variables is always waaaay higher than the number of samples. In an ideal world, this would be the other way around. Makes classifying things much easier.
There's a step called feature selection, early on in the process, where you try to reduce the number of features in your dataset, either by getting rid of features that don't appear to change between you're normal and test conditions, or combining features or parts of features in a variety of ways. A quick, easy and potentially flawed method of picking relevant features is to pick the ones that are the most variable. There's problems with this, but as a first pass, it's an okay place to start.
If I get the RNA sequence data, there are tools for figuring out changes in activity when comparing normal cells to cancerous ones. I'm working with methylation data at the moment - a measure of activity of a regulatory mechanism that acts on DNA. After some fairly strenuous filtering and quality control, I end up with roughly 320K methylation points, i.e. roughly 320K variables to consider. The same tools that can be used to identify changes in RNA activity, can be used for methylation.
Why then, i have come to ask myself these past couple of weeks, are we not using these tools for feature selection? Looking at blood cancers, and CNS cancers, and using differential methylation analysis, even very crudely has been delivering me some pretty good sets of features for identifying the type of the cell of origin of a cancer.
A lot of papers pick the top 1K, 2K or even 10K most highly variable features. Theoretically, all the noise in those datasets, the features that are variable, but aren't actually relevant to the cancer should end up cancelling themselves out.
I'm pulling out good classification results using maybe a hundred of so methylation probes that have been selected because they are deferentially methylated between conditions. There's a rationale for choosing them.
This seems simple. And now I'm wondering why this isn't the norm. Am I missing something?