💾 Archived View for thebird.nl › blog › 2021 › 08 › gemma-wrapper-local-non-loco.gmi captured on 2023-01-29 at 02:17:00. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-01-08)
-=-=-=-=-=-=-
In the next step we are going to speed up GEMMA without LOCO. I never thought about it before, but recently I realised we have the tools to do this.
Unfortunately this method did not deliver what I wanted. With LOCO it works because the separate matrices need to run over the full set anyway. The parallel version w.o. LOCO runs, but is no faster due to the extra (and competing) IO for every process. To make this perform I need to split the input genotype file into smaller sections first. Now, dealing with input files is something I have to get right anyway and it belongs in the bigger picture. So rather than working on this early gemma-wrapper 'hack' I am postponing running parallel processes.
Another thing I thought of is that competing cache writes for gemma-wrapper may overwrite each other when identical runs go at the same time - we need to make that more transactional. gemma2/lib does not allow overwriting of files by default - it throws an error which is not necessarily the behaviour we want on the context of a web service. gemma-wrapper should handle that more gracefully by first writing into a unique (isolated) tempdir and then making sure there are no overwrites (as we can assume the hashed output names are identical). New release v0.99.2 adds this, see commit f4387449731af9f673d9e0ca3e82e4524a9bdf25
gemma already can compute a subset of SNPs by chromosome using the -loco switch. The difference with a normal LOCO is that one kinship matrix (GRM) is used. Running
gemma-wrapper --json -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -gk \ -debug > K.json
renders in K.json:
"files":[[null,"/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.log.txt","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.cXX.txt"]]
while the LOCO version renders
"files":[["1","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.1.cXX.txt.log.txt","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.1.cXX.txt.cXX.txt"],["2","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.2.cXX.txt.log.txt","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.2.cXX.txt.cXX.txt"],["3" ...]]
To invoke GWA we know there is one single GRM, but we don't know the chromosomes yet. The logical thing to do is parse the SNP annotation file which contains the CHR as the third field:
rs31879829 4518714 1 rs36742481 4776319 1 rs6365999 4820983 1 (...)
In commit afdffb637516cb4f141d1f7211642c82ba5fcd34 I changed the behaviour of gemma-wrapper. Now the annotation file is automatically scanned and chromosomes are not passed in through the command line (less error prone too!). I added an optional --chromosomes switch to allow for limiting scans.
In the next phase we need to adapt gemma-wrapper to analyse the right chromosomes. With LOCO a set of SNPs is selected that refers to a chromosome in `setGWASnps`. This set is constructed in a function I wrote years ago named `LOCO_set_Snps` part of src/param.cpp. What it does is create on set for computing K (all SNPs except those on CHR) and GWA (only those SNPs on CHR). These sets are therefore complimentary. The good news is that for GWA we can use the LOCO switch as is, because we want to compute a CHR at a time using a full GRM.
With gemma-wrapper we call
k_files.each do | chr, kfn | # call a GWA for each chromosome gwas.call(chr,kfn,tmp_pfn,permutation) end
Without LOCO we can do the same but use the same kfn over and over. I implemented that in commit f0650ad058cbccacca4662b8ed17b1f7f1006e66.
gemma-wrapper had to change a bit. First of all 0.99.2 should is a drop-in replacement for previous versions with increased robustness against file overwrites. Later git commits, meanwhile, include a change in behaviour:
See the README for examples