Project PI: Chun-Nan Hsu
Members: Ai He (USC), Shitij Bhargava, Gordon Lin, Priyanka Ganapathi, Suvir Jain
Sponsor: NHGRI/NIH 5U01HG006894
Project Period: 09/24/2012 – 06/30/2015
The Catalog of GWAS is an important resource containing published association between SNPs and phenotypes identified by Genome-Wide Association Studies (GWAS), a well-defined study approach.
However, curation of the catalog is current performed by expert curators. Though this will ensure the quality but new publications in GWAS really outpace any human curation team can possibly handle. This project is to solve this problem by applying information extraction techniques in Text Mining.
- Original PDFs of the articles(1,382 PDFs and totally 600+ MB even if compressed)
- Sampled PDFs of the articles (81 PDFs for gene and disease mention detection)
Detection on sampled date sets
- Gene mention detection (Not yet validated "gold standards")
- BioCreative 2 gold standard: bc2GNandGMgold_Subs.tar.gz http://sourceforge.net/projects/biocreative/
- Perl script for evaluation https://github.com/bioinformatics-ua/gimli/blob/master/resources/evaluation/bc2gm/alt_eval.perl
- Disease mention detection (Same as the above)
- Supporting Resources
- A collective list of Diseases and Traits (Totally 306,478 entries)
Meeting dates 2014