mgamma kinship computations

mgamma master page

Computing kinship matrices K for LOCO is quite expensive for larger datasets. In GeneNetwork, so far, we compute K for every case and cache the results. That is not a bad approach, if K was computed on the fly in seconds, the first run would be fast and we would not need the cache.

Also, previously we computed K for every phenotype - that is missing individuals were removed from the computation. This should not be necessary as the distance between individuals is a fixed number between 0 and 1 based on the genotypes involved. Down the line we will look at reducing the genotypes, but for the final result the distance should not change much.

Matrix computation

Here we track zig matrix computations:

zig-matrix-multiplication.gmi

Output

Previously we stored all kinship information in separate files. This we are changing by using lmdb as a key value store. We can also store half the data because kinship matrices are symmetric.

The first iteration of reading lmdb, computing K and writing lmdb takes is already quite a bit faster than the comparable gemma command:

mgamma compute one K without LOCO

wrk@napoli /export/local/home/wrk/iwrk/opensource/code/genetics/mgamma [env]$ zig build -Drelease-fast=true && zig test mgamma.zig && time ./zig-out/bin/mgamma
LLVM Emit Object... LLVM Emit Object... Test [0/2] test "lmdb-setup"... Test [1/2] test "compute matrix with its transpose"... All 2 tests passed.
mgamma loading test/data/bxd...

nrows,ncols = 7321,236
top left value is 0, bottom right is 2
top left value is 0.0e+00, bottom right is 0.0e+00
top left value is 3.126e+03, top right is 4.528e+03, bottom left is 4.528e+03, bottom right is 2.0119e+04
mgamma writing K.lmdb...

real	0m0.369s
user	0m0.369s
sys	0m0.000s

gemma compute is still faster because it uses multi-threaded openblas for a single matrix

wrk@napoli /export/local/home/wrk/iwrk/opensource/code/genetics/gemma [env]$ time ./bin/gemma -gk -g example/BXD_geno.txt.gz -p example/BXD_pheno.txt
GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022
Reading Files ...

## number of total individuals = 198
## number of analyzed individuals = 67
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =     7320
## number of analyzed SNPs         =     7320
Calculating Relatedness Matrix ...
================================================== 100%


real	0m0.258s
user	0m0.818s
sys	0m1.124s

and it uses less individuals to compute K for 198. I.e. my version computes 30% more data points (it is quadratic).

My 'naive' computation without openblas is pretty close and I can still make it faster.

Note that gemma IO is quite heavy (sys value). The zig version creates a file of 294K bytes (which can still be halved) compared to the GEMMA 580K.

The next steps are:

- [ ] compute all LOCO kinship matrices in one go

- [ ] Try concurrent matrix compute and CPU usage

at this stage we want to see if we can outcompete openblas built-in multithreading. This BLOG confirms our experience with openblas:

https://discourse.julialang.org/t/regarding-the-multithreaded-performance-of-openblas/75450

key thing is that computation is memory-bound and it will be interesting to see how the latest AMD Genoa will do.