💾 Archived View for thebird.nl › blog › 2021 › 08 › gemma-wrapper-local-non-loco.gmi captured on 2023-01-29 at 02:17:00. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Run gemma-wrapper locally w.o. LOCO

In the next step we are going to speed up GEMMA without LOCO. I never thought about it before, but recently I realised we have the tools to do this.

NOTE

Unfortunately this method did not deliver what I wanted. With LOCO it works because the separate matrices need to run over the full set anyway. The parallel version w.o. LOCO runs, but is no faster due to the extra (and competing) IO for every process. To make this perform I need to split the input genotype file into smaller sections first. Now, dealing with input files is something I have to get right anyway and it belongs in the bigger picture. So rather than working on this early gemma-wrapper 'hack' I am postponing running parallel processes.

Tasks

[X] Adapt gemma-wrapper to allow for split by CHR w.o. LOCO
[ ] Deploy on GN2/Tux01

First check for overwrites of gemma output

Another thing I thought of is that competing cache writes for gemma-wrapper may overwrite each other when identical runs go at the same time - we need to make that more transactional. gemma2/lib does not allow overwriting of files by default - it throws an error which is not necessarily the behaviour we want on the context of a web service. gemma-wrapper should handle that more gracefully by first writing into a unique (isolated) tempdir and then making sure there are no overwrites (as we can assume the hashed output names are identical). New release v0.99.2 adds this, see commit f4387449731af9f673d9e0ca3e82e4524a9bdf25

[X] gemma-wrapper write to tmpdir and check before overwrite

Adapt gemma-wrapper to compute by chromose w.o. LOCO

gemma already can compute a subset of SNPs by chromosome using the -loco switch. The difference with a normal LOCO is that one kinship matrix (GRM) is used. Running

gemma-wrapper --json -- \
        -g test/data/input/BXD_geno.txt.gz \
        -p test/data/input/BXD_pheno.txt \
        -gk \
        -debug > K.json

renders in K.json:

"files":[[null,"/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.log.txt","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.cXX.txt"]]

while the LOCO version renders

"files":[["1","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.1.cXX.txt.log.txt","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.1.cXX.txt.cXX.txt"],["2","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.2.cXX.txt.log.txt","/tmp/da39a3ee5e6b4b0d3255bfef95601890afd80709.2.cXX.txt.cXX.txt"],["3" ...]]

To invoke GWA we know there is one single GRM, but we don't know the chromosomes yet. The logical thing to do is parse the SNP annotation file which contains the CHR as the third field:

rs31879829      4518714 1
rs36742481      4776319 1
rs6365999       4820983 1
(...)

In commit afdffb637516cb4f141d1f7211642c82ba5fcd34 I changed the behaviour of gemma-wrapper. Now the annotation file is automatically scanned and chromosomes are not passed in through the command line (less error prone too!). I added an optional --chromosomes switch to allow for limiting scans.

Adapt GEMMA w. LOCO to allow for split by CHR w.o. LOCO

In the next phase we need to adapt gemma-wrapper to analyse the right chromosomes. With LOCO a set of SNPs is selected that refers to a chromosome in `setGWASnps`. This set is constructed in a function I wrote years ago named `LOCO_set_Snps` part of src/param.cpp. What it does is create on set for computing K (all SNPs except those on CHR) and GWA (only those SNPs on CHR). These sets are therefore complimentary. The good news is that for GWA we can use the LOCO switch as is, because we want to compute a CHR at a time using a full GRM.

With gemma-wrapper we call

k_files.each do | chr, kfn | # call a GWA for each chromosome
  gwas.call(chr,kfn,tmp_pfn,permutation)
end

Without LOCO we can do the same but use the same kfn over and over. I implemented that in commit f0650ad058cbccacca4662b8ed17b1f7f1006e66.

Changes

gemma-wrapper had to change a bit. First of all 0.99.2 should is a drop-in replacement for previous versions with increased robustness against file overwrites. Later git commits, meanwhile, include a change in behaviour:

--loco switch does not accept chromosome numbers but computes from annotation file
-a annotation switch is now obligatory
JSON output for GWA without LOCO is now the same layout as GWA with LOCO. I.e., the output files are split by chromosome.

See the README for examples

https://github.com/genetics-statistics/gemma-wrapper#usage