💾 Archived View for thebird.nl › gn-gemtext-threads › topics › systems › mariadb › precompute-mappin… captured on 2023-06-16 at 16:54:49. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-04-19)
-=-=-=-=-=-=-
GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries.
Above is the quick win for plugging in GEMMA values. We will make sure not to recompute the values that are already up to date.
Next:
The original reaper precompute lives in
https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py
This script first fetches inbredsets
MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5; +----+----------+ | Id | Name | +----+----------+ | 1 | BXD | | 2 | B6D2F2 | | 4 | AXBXA | | 5 | AKXD | | 6 | B6BTBRF2 | +----+----------+
and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today. Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id.
Next for this Id we fetch, known as `ProbeSetXRefInfos`:
MariaDB [db_webqtl]> select ProbeSetId, Locus, DataId from ProbeSetXRef where ProbeSetFreezeId=1 limit 5; +------------+----------------+--------+ | ProbeSetId | Locus | DataId | +------------+----------------+--------+ | 1 | rs13480619 | 1 | | 2 | rs29535974 | 2 | | 3 | rs49742109 | 3 | | 4 | rsm10000002321 | 4 | | 5 | rsm10000019445 | 5 | +------------+----------------+--------+
Then we fetch the trait values:
MariaDB [db_webqtl]> select Strain.Name, ProbeSetData.value from Strain, ProbeSetData where Strain.Id = ProbeSetData.StrainId and ProbeSetData.Id = 1 limit 5; +----------+-------+ | Name | value | +----------+-------+ | B6D2F1 | 5.742 | | C57BL/6J | 5.006 | | DBA/2J | 6.079 | | BXD1 | 6.414 | | BXD2 | 4.885 | +----------+-------+
with genotypes and these phenotypes qtlreaper is started and next we update the values for
select * from ProbeSetXRef where ProbeSetId=1 and ProbeSetFreezeId=1 limit 5; +------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+ | ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | LRS_old | pValue_old | mean | se | Locus | LRS | pValue | additive | h2 | +------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+ | 1 | 1 | 1 | 10.095.400 | 13.3971627898894 | 0.163 | 5.48794285714286 | 0.08525787814808819 | rs13480619 | 12.590069931048 | 0.269 | -0.28515625 | NULL | +------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+
the actual update is
update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s
so Locus, LRS and additive fields are updated.
From this exercise we can conclude:
Rob voiced a wish to retain all scores (at least those above 1.0). This is not feasible in mariadb, but we can retain GEMMA output if stored efficiently.
ProbeSetXRef is pretty small, currently @5.6Gb and 48,307,650 rows, so we could decide to add columns to track different mappers. Something funny
select count(LRS) from ProbeSetXRef; +------------+ | count(LRS) | +------------+ | 28335955 | +------------+
half the fields are used. What to think of
MariaDB [db_webqtl]> select * from ProbeSetXRef where LRS=0 limit 5; +------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+ | ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | LRS_old | pValue_old | mean | se | Locus | LRS | pValue | additive | h2 | +------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+ | 589 | 8599010 | 70606737 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8593710 | 70606738 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8607637 | 70606739 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8594490 | 70606740 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8591994 | 70606741 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | +------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+
MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where LRS=0 and Locus="rsm10000000001"; +----------+ | count(*) | +----------+ | 291447 | +----------+
There is obviously more. I think this table can use some cleaning up?
A good dataset to take apart is
http://genenetwork.org/show_trait?trait_id=1436869_at&dataset=HC_M2_0606_P
because it has 71 BXD samples and 32 other samples.