💾 Archived View for thebird.nl › gn-gemtext-threads › issues › slow-correlations.gmi captured on 2023-06-14 at 14:20:09. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2022-07-16)

-=-=-=-=-=-=-

Slow Correlations and UI crashes

Correlation in gn2 has regressed when compared gn1

Issue experienced by users include

Tags

Tasks

[x] separation of concerns

split between correlation code and code to database part

for easier debug

[x] optimize db queries

[x] Cache for huge datasets in text files

[x] Cache for traits metadata

[x] refactor data structures used

[x] limit number of results rendered to user

[] implement parralel computation for correlation

[] Server side pagination

Background on the issue

As Rob has pointed out before, gn2 is much much slower than

gn1. Before, we mistakenly thought that it was because that it only

computed one of the correlations; but Zach correctly pointed out that

it, gn1, did in fact still compute all correlations in a similar

fashion to gn2.

The problems we have with gn2 are 2-fold:

- Slow computations

- UI crashing on our users for huge datasets

We took a step back; tried to probe deeper how we do correlations. To

do a correlation, we need to run a query on the entire dataset. After

running a query on this dataset, we additionally fetch metadata on

this dataset as seen here:

https://github.com/genenetwork/genenetwork2/blob/70f8ed53f85cfb42ca81ed6c3b4c9cf1060940e5/wqflask/wqflask/correlation/show_corr_results.py#L88

This takes a long time: it's our biggest bottleneck.

For sample correlation we call this function to fetch the data:

https://github.com/genenetwork/genenetwork2/blob/70f8ed53f85cfb42ca81ed6c3b4c9cf1060940e5/wqflask/base/data_set.py#L731

IMO this seems to be the main issue among all queries.

For tissue correlation we call this function to fetch the data this

doesn't take much time less than 20 seconds to create instance and

fetch results.

https://github.com/genenetwork/genenetwork2/blob/70f8ed53f85cfb42ca81ed6c3b4c9cf1060940e5/wqflask/base/mrna_assay_tissue_data.py#L78>

For lit correlation, we fetch the correlation from the DB no

computation happens

Assume a user selects "sample correlation" in the form with limit

2000, they will fetch the results for the entire sample dataset to

compute the sample correlation; then filter the top 2000 traits. Fetch

the tissue input for them then do the correlation then fetch lit

results for them.

ATM, we know that our datasets are immutable unless @Acenteno updates

things. So why don't we just cache the results of such queries in

Redis, or in some json file. And use those instead of running the

query on every computation? A file look-up or a Redis look-up would be

much faster than what we already have.

Also, another thing that could be improved on is replacing some basic

data-structures used during the computations with more efficient

ones. As an example, it makes little sense to use a list that holds a

huge number of elements, when we could use a generator instead, or

depending on the application, a more appropriate structure. That could

shave some more seconds.

Something else worth mentioning is that the fast correlations that

used parallelisation produced bugs in gn2 could be re-written in a

more reliable way using threads-- that's what IIRC what gn1 did. So

that's something worth exploring too.

WRT the UI crashing, we rely too much on Javascript

(DataTables). AFAICT, the massive results we get from the correlations

are sent to DataTables to process. That's asking too much! We

brainstormed on some high level ideas on how to do this. One of them

being to have the results stored somewhere. And to build pagination

off those results. Now that's up to Alex to decide how to go about it.

Something cool that Alex pointed is an interesting "manual" testing

mechanism which he can feel free to try out: Separate the actual

"computation" and the "pre-fetching" in code. And see what takes

time.

Updates

Mon 18 Oct 2021 12:42:17 PM EAT

Atm GN2 is un-usable for Rob for basic tours and show-and-tells, and

it is a persistent problem that is getting worse the more he

complains. Correlation is slower than it was ever before; and search

is broken. For a simple search of 10,000 phenotypes, it takes a lot of

time to compute.

According to Rob, GN1 does not rely on a cache. Instead it is

computing from a materialized view of the database that is

intentionally designed for a fast web service.

Notes

Tue, 12 April 2022

Most of the above issue have been addressed

correlation speed has greatly improved no complain't

on the issue as of 12/04/2022

for example the dataset below no longer crashes for this datashe computa

http://gn2.genenetwork.org/show_trait?trait_id=ENSG00000244734&dataset=GTEXv8_Wbl_tpm_0220

Also, wrt to parralel computation

implementation in python leads

to memory error for forked processes and

is best implemented in a different

language if the issue arises

Notes

Friday, 8 July 2022

Closing down the issue,to speed up things the gn2 correlation computation

was to be rewritten using rust

Added an issue tracker for this

https://issues.genenetwork.org/issues/implement-parallel-correlation-with-rust.gmi