Computing kinship matrices K for LOCO is quite expensive for larger datasets. In GeneNetwork, so far, we compute K for every case and cache the results. That is not a bad approach, if K was computed on the fly in seconds, the first run would be fast and we would not need the cache.
Also, previously we computed K for every phenotype - that is missing individuals were removed from the computation. This should not be necessary as the distance between individuals is a fixed number between 0 and 1 based on the genotypes involved. Down the line we will look at reducing the genotypes, but for the final result the distance should not change much.
Here we track zig matrix computations:
Previously we stored all kinship information in separate files. This we are changing by using lmdb as a key value store. We can also store half the data because kinship matrices are symmetric.
The first iteration of reading lmdb, computing K and writing lmdb takes is already quite a bit faster than the comparable gemma command:
mgamma compute one K without LOCO
wrk@napoli /export/local/home/wrk/iwrk/opensource/code/genetics/mgamma [env]$ zig build -Drelease-fast=true && zig test mgamma.zig && time ./zig-out/bin/mgamma LLVM Emit Object... LLVM Emit Object... Test [0/2] test "lmdb-setup"... Test [1/2] test "compute matrix with its transpose"... All 2 tests passed. mgamma loading test/data/bxd... nrows,ncols = 7321,236 top left value is 0, bottom right is 2 top left value is 0.0e+00, bottom right is 0.0e+00 top left value is 3.126e+03, top right is 4.528e+03, bottom left is 4.528e+03, bottom right is 2.0119e+04 mgamma writing K.lmdb... real 0m0.369s user 0m0.369s sys 0m0.000s
gemma compute is still faster because it uses multi-threaded openblas for a single matrix
wrk@napoli /export/local/home/wrk/iwrk/opensource/code/genetics/gemma [env]$ time ./bin/gemma -gk -g example/BXD_geno.txt.gz -p example/BXD_pheno.txt GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022 Reading Files ...
and it uses less individuals to compute K for 198. I.e. my version computes 30% more data points (it is quadratic).
My 'naive' computation without openblas is pretty close and I can still make it faster.
Note that gemma IO is quite heavy (sys value). The zig version creates a file of 294K bytes (which can still be halved) compared to the GEMMA 580K.
The next steps are:
- [ ] compute all LOCO kinship matrices in one go
- [ ] Try concurrent matrix compute and CPU usage
at this stage we want to see if we can outcompete openblas built-in multithreading. This BLOG confirms our experience with openblas:
https://discourse.julialang.org/t/regarding-the-multithreaded-performance-of-openblas/75450
key thing is that computation is memory-bound and it will be interesting to see how the latest AMD Genoa will do.