Run gemma-wrapper locally

Introduction

Even though it started as a simple script, gemma-wrapper does a lot of work already. It handles caching and job submitting to slurm:

Gemma-wrapper

In this work I plan the following steps:

In the process gemma-wrapper will be rewritten completely.

GNU Parallel

GEMMA runs a process for every chromosome and given enough resources we could all run them at the same time. Instead of invoking a shell command straightaway, gemma-wrapper now creates a set of commands and feeds them to GNU Parallel with a 10x speedup. This speedup only affects running LOCO the first time on a trait. I.e., it does not affect later runs on the same trait (because we cache results in GeneNetwork) and it does not affect running GEMMA without LOCO. Without the --parallel switch gemma-wrapper runs the standard BXD LOCO kinship computation in 14 seconds on Penguin2 and with GNU parallel it went down to 1.5 seconds:

# without parallel
real    0m14.090s
user    0m23.324s
sys     0m16.592s
# with parallel
real    0m1.611s
user    0m22.660s
sys     0m14.960s

and with GWA we get similar results. Before we get all excited note that all processes are running in RAM - so you have to have enough memory in the machine. So we need to check RAM:

Check RAM

When running out of RAM GEMMA will fail rapidly because matrices get built at the start. So, the procedure should be to run the parallel version and - on failure - fall back to the non-parallel version. Maybe the solution is embarrissingly simple: run it again without parallel if parallel fails. Run parallel twice with less processes because we cache results. It simply ignores what has been computed. So, run all chromosomes, half chromosomes, quarter chromosomes at a time and finally serially.

I.e., each step continues if the previous one fails. Implementing this will make the --parallel switch superfluous as we can simply start with the parallel version.

I updated the Guix genetwork2 package to have GNU parallel and the new gemma-wrapper and the new profile lives on Tux01 as /usr/local/guix-profiles/gn-latest-20210711

implementation

released gemma-wrapper 0.99.1

Ruby gem update

Updated Guix package

Deploy on Tux01

When Zach tried GEMMA on our production machine (Tux01). After updating the Guix profile and adding an update for gemma-wrapper computations for LOCO were much faster. Only one snag: with caching the results differed! Interestingly they differed for the lower hits, not for the high ones. This should not be possible(!). I.e., a cache file is named on a hash over its inputs. The only thing that can have changed is gemma itself.

Now, we are changing GEMMA on Tux01, so that can be the source of the problem. To check:

For production Zach is using

export GN2_PROFILE=/export/local/home/zas1024/opt/genenetwork2

with gemma-wrapper at

gemma-wrapper 0.99.1 (Ruby 2.6.5) by Pjotr Prins 2017-2021
Using GEMMA 0.98.4 (2020-12-15) by Xiang Zhou and team (C) 2012-2020

which is running the correct versions. So I reran the trait

https://genenetwork.org/show_trait?trait_id=10016&dataset=CFWPublish

Looking at the logs GN2 production is already running the parallel version(!?). E.g. for CHR Y

/gnu/store/yyrfmg0i18w190l4lb21cv86fqclalsk-gemma-gn2-git-0.98.3-47221d6/bin/gemma -no-check -g /home/gn2/production/genotype_files/genotype/bimbam/CFW_geno.txt -p /home/gn2/production/tmp/gn2/gn2/CFW_10016_SPVRK5.txt -a /home/gn2/production/genotype_files/genotype/bimbam/CFW_snps.txt -lmm 9 -maf 0.05 -outdir /home/gn2/production/tmp/gn2 -k /home/gn2/production/tmp/gn2/d9d673d3fb4723afacd40f6c679db9f3bc571af3.Y.cXX.txt.cXX.txt -o fbc433e88dd14a081da60d9fa56bf8cfac9eb52a.Y.assoc.txt -loco Y

(a useful test for running GEMMA!)

Mayber we are using an older version of gemma here which supposedly does not have support for -lmm 9 switch(!) Weirdly the wrapper reports also here

GEMMA 0.98.4 (2020-12-15) by Xiang Zhou and team (C) 2012-2020.

I'll check that. Turns out Guix injects

gemma_command = '/gnu/store/yyrfmg0i18w190l4lb21cv86fqclalsk-gemma-gn2-git-0.98.3-47221d6/bin/gemma'

which actually has above header and does not bail on the -lmm 9 switch, which looks like:

error! unknown analysis mode: 9

so, it is just a package definition. No worries.

Rerunning CWF gemma mapping gives the same results. The cache works as expected, the only thing that may be different is the handling of covariates:

/gnu/store/yyrfmg0i18w190l4lb21cv86fqclalsk-gemma-gn2-git-0.98.3-47221d6/bin/gemma -no-check -g /home/gn2/production/genotype_files/genotype/bimbam/BXD_geno.txt -p /home/gn2/production/tmp/gn2/gn2/BXD_12333_DE7U7A.txt -c /home/gn2/production/genotype_files/mapping/BXD_covariates.txt -a /home/gn2/production/genotype_files/genotype/bimbam/BXD_snps.txt -lmm 9 -maf 0.05 -outdir /home/gn2/production/tmp/gn2 -k /home/gn2/production/tmp/gn2/fa26737b1b4416b92b2e8d0a1334f9aa637829c8.Y.cXX.txt.cXX.txt -o 80261c3f8d3c542799faafbe88c7393b5243e5ae.Y.assoc.txt -loco Y

The covariates file is non-uniquely named. Same for CHR 12:

gn2@tux01:~/tmp$ /gnu/store/yyrfmg0i18w190l4lb21cv86fqclalsk-gemma-gn2-git-0.98.3-47221d6/bin/gemma -no-check -g /home/gn2/production/genotype_files/genotype/bimbam/BXD_geno.txt -p /home/gn2/production/tmp/gn2/gn2/BXD_12333_DE7U7A.txt -c /home/gn2/production/genotype_files/mapping/BXD_covariates.txt -a /home/gn2/production/genotype_files/genotype/bimbam/BXD_snps.txt -lmm 9 -maf 0.05 -outdir /home/gn2/production/tmp/gn2 -k /home/gn2/production/tmp/gn2/fa26737b1b4416b92b2e8d0a1334f9aa637829c8.12.cXX.txt.cXX.txt -o 80261c3f8d3c542799faafbe88c7393b5243e5ae.12.assoc.txt -loco 12
GEMMA 0.98.4 (2020-12-15) by Xiang Zhou and team (C) 2012-2020
Reading Files ...

## number of total individuals = 236
## number of analyzed individuals = 75
## number of covariates = 2
## number of phenotypes = 1
## leave one chromosome out (LOCO) =       12
## number of total SNPs/var        =     7321
## number of SNPS for K            =     7013
## number of SNPS for GWAS         =      308
## number of analyzed SNPs         =     7321
Start Eigen-Decomposition...
pve estimate =0.759057
se(pve) =0.197965
================================================== 100%

ls -lrt /home/gn2/production/tmp/gn2/80261c3f8d3c542799faafbe88c7393b5243e5ae.12.assoc.txt*
-rw-r--r-- 1 gn2 gn2  1726 Aug  5 09:40 /home/gn2/production/tmp/gn2/80261c3f8d3c542799faafbe88c7393b5243e5ae.12.assoc.txt.log.txt
-rw-r--r-- 1 gn2 gn2 27145 Aug  5 09:40 /home/gn2/production/tmp/gn2/80261c3f8d3c542799faafbe88c7393b5243e5ae.12.assoc.txt.assoc.txt

Apart from the dodgy file naming GEMMA+covariates appears to work fine.

Continued...