Large Language Models (LLMs) & Metadata

assigned: soloshelby, priscilla
contact: bonfacem
keywords: gnsoc, LLMs, metadata

Integrate an LLM Q&A system into gn.genenetwork.org

This development will be done in stages:

[X] 1 - get API access to FahamuAI GeneNetwork Q&A system
[X] 2 - create local python Flask sandbox
[X] 3 - build placeholder UI
[X] 3.5 - integrate FahamuAI API into placeholder @ github.com/ShelbySolomonDarnell/GN-LLMs
[-] 4 - create guix image from Docker
[ ] 5 - Serve GNQA on tux02
[ ] # - Get feedback from testers of GNQA
[ ] # - Improve GNQA after evaluating feedback
[-] 6 - create UI for Q&A window that fits into current GN framework
[ ] 7 - create CI/CD tests for new module
[ ] 8 - integrate new functionality into GN{2-3}

Tasks for Priscilla

[-] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes
[-] 2 - Acquire 1000 more research documents w.r.t GeneNetwork.org
[ ] 3 - Write Service level agreement for collaborator

Task for collaborator

[] Build feedback into api for qualifying/rating references and answers
[] Build better document referencing using a documents bibliography information

Add GN metadata

[ ] export GN RDF triples
[ ] convert data of triples into plain English sentences
[ ] submit triples-based sentences to Q&A LLM
[ ] submit RDF metadata to an Oracle to support Q&A system truthfulness

Set up system update protocol

These are all living systems that must be kept up-to-date.

GN is consistently being used for research and we are improving its design and functionality to make this statement perpetually true.

In order to keep the Q&A system up-to-date we must:

[ ] create protocol to get new publications
[ ] query web for new publications utilizing GN
[ ] pull links to the newly found documents
[ ] acquire the documents
[ ] process documents for LLM

The National Library of Medicine's PubMed is a National Institute of Health system that is one of the most widely used resources for researchers found

PubMed is consistently updated by the NIH, so we must build a script to:

[ ] poll its API on a regular basis
[ ] download new citations,
[ ] parse citations and metadata for input into LLM
[ ] upload new data into LLM

By ensuring up-to-date information about the main information sources for the GeneNetwork Q&A system, the system grows with the knowledgebase.

Add functionality that allows someone to submit documentation to the system, which is added after being reviewed by a specialist.