💾 Archived View for thebird.nl › gn-gemtext-threads › issues › slow-correlations.gmi captured on 2023-06-14 at 14:20:09. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-07-16)
-=-=-=-=-=-=-
Correlation in gn2 has regressed when compared gn1
Issue experienced by users include
[x] separation of concerns
split between correlation code and code to database part
for easier debug
[x] optimize db queries
[x] Cache for huge datasets in text files
[x] Cache for traits metadata
[x] refactor data structures used
[x] limit number of results rendered to user
[] implement parralel computation for correlation
[] Server side pagination
As Rob has pointed out before, gn2 is much much slower than
gn1. Before, we mistakenly thought that it was because that it only
computed one of the correlations; but Zach correctly pointed out that
it, gn1, did in fact still compute all correlations in a similar
fashion to gn2.
The problems we have with gn2 are 2-fold:
- Slow computations
- UI crashing on our users for huge datasets
We took a step back; tried to probe deeper how we do correlations. To
do a correlation, we need to run a query on the entire dataset. After
running a query on this dataset, we additionally fetch metadata on
this dataset as seen here:
This takes a long time: it's our biggest bottleneck.
For sample correlation we call this function to fetch the data:
IMO this seems to be the main issue among all queries.
For tissue correlation we call this function to fetch the data this
doesn't take much time less than 20 seconds to create instance and
fetch results.
For lit correlation, we fetch the correlation from the DB no
computation happens
Assume a user selects "sample correlation" in the form with limit
2000, they will fetch the results for the entire sample dataset to
compute the sample correlation; then filter the top 2000 traits. Fetch
the tissue input for them then do the correlation then fetch lit
results for them.
ATM, we know that our datasets are immutable unless @Acenteno updates
things. So why don't we just cache the results of such queries in
Redis, or in some json file. And use those instead of running the
query on every computation? A file look-up or a Redis look-up would be
much faster than what we already have.
Also, another thing that could be improved on is replacing some basic
data-structures used during the computations with more efficient
ones. As an example, it makes little sense to use a list that holds a
huge number of elements, when we could use a generator instead, or
depending on the application, a more appropriate structure. That could
shave some more seconds.
Something else worth mentioning is that the fast correlations that
used parallelisation produced bugs in gn2 could be re-written in a
more reliable way using threads-- that's what IIRC what gn1 did. So
that's something worth exploring too.
WRT the UI crashing, we rely too much on Javascript
(DataTables). AFAICT, the massive results we get from the correlations
are sent to DataTables to process. That's asking too much! We
brainstormed on some high level ideas on how to do this. One of them
being to have the results stored somewhere. And to build pagination
off those results. Now that's up to Alex to decide how to go about it.
Something cool that Alex pointed is an interesting "manual" testing
mechanism which he can feel free to try out: Separate the actual
"computation" and the "pre-fetching" in code. And see what takes
time.
Atm GN2 is un-usable for Rob for basic tours and show-and-tells, and
it is a persistent problem that is getting worse the more he
complains. Correlation is slower than it was ever before; and search
is broken. For a simple search of 10,000 phenotypes, it takes a lot of
time to compute.
According to Rob, GN1 does not rely on a cache. Instead it is
computing from a materialized view of the database that is
intentionally designed for a fast web service.
Most of the above issue have been addressed
correlation speed has greatly improved no complain't
on the issue as of 12/04/2022
for example the dataset below no longer crashes for this datashe computa
http://gn2.genenetwork.org/show_trait?trait_id=ENSG00000244734&dataset=GTEXv8_Wbl_tpm_0220
Also, wrt to parralel computation
implementation in python leads
to memory error for forked processes and
is best implemented in a different
language if the issue arises
Closing down the issue,to speed up things the gn2 correlation computation
was to be rewritten using rust
Added an issue tracker for this
https://issues.genenetwork.org/issues/implement-parallel-correlation-with-rust.gmi