Experiences from NCBI’s Biomedical Data Science Hackathon 2016

Posted on:

Data sharing is critical to advance genomic research because it reduces the demand to collect new data by reusing and combining existing data, while upholding transparency in research, promoting reproducible research.  As of January 2015, the National Institutes of Health (NIH) has implemented a Genomic Data Sharing (GDS) Policy that applies to all large scale genomic data generated with NIH funding.  The policy states that all large scale genomic data and relevant meta-data be shared in a timely manner through one of the NIH designated data repositories. As the necessity and importance of data sharing continues grow, researchers will see an increase in the use of public repositories to gain access to this data.

Though policies are in place, and an increasing number of researchers recognize the need for data sharing, there are challenges to data sharing in practice.  These challenges include; personal health information privacy, misuse of data, sharing policy breaches, and technical issues with actually obtaining the data.  One such technical issue in obtaining data is that many of the databases, including those with similar data, operate independently of one another. It is this technical issue that was one of the topics of the National Center for Biotechnology Information’s (NCBI) biomedical data science hackathon held in Bethesda, MD in early August 2016.  A hackathon is a short but intensive event where programmers collaborate on software development projects.

The overarching aim of this hackathon was to bring researchers together from various disciplines, having diverse skills at different phases of their careers, with the goal of building pipelines or tools to analyze large genomic and biomedical datasets.  The hackathon took place over the course of three days at the National Library of Medicine, where approximately 30 researchers were divided into five groups based on their skill set and experience. Each group was given a broad scientific problem and tasked to develop an analysis pipeline or tool to solve the problem.  Groups were encouraged to use open-source genomic analysis tools and public datasets to design, build and test their pipelines.

The group I was assigned to was tasked with designing a tool that would allow and facilitate interactions between two commonly used, publically available, genomic databases, i.e. NCBI’s database of Genotypes and Phenotypes  (dbGaP) and The Cancer Genome Atlas (TCGA) hosted by the NIH.  Though these two databases are hosted by two US federal government agencies and contain similar data, currently there is no interaction between them.  The current protocol for identifying studies of interest from TCGA and dbGaP is to search both databases separately. Even though this is not technically difficult, problems arise when the terms embedded in the metadata of similar studies do not match exactly.  In general, the metadata query terms for TCGA is limited and controlled by TCGA at the time of study submission.  In contrast the dbGaP metadata query terms are user defined and can be extremely varied.

Our team’s original idea was to develop a tool that would allow a user who was working with sequencing data from TCGA to identify similar studies within dbGaP and automatically download, unencrypt, and reformat the corresponding sequencing data to a usable file format (SAM format).  While our team’s tool will accomplish this task, the current functionality of the tool is also greatly expanded.  The tool will allow users to query both dbGaP and TCGA at the same time, using several different search criteria, and identify studies that match the criteria from either database.  An optional function to the tool includes the option to automatically download the sequencing data from the dbGaP database.  The reason behind adding functionality for easily downloading, unencrypting, and re-formatting dbGaP data but not TCGA data is that the TCGA sequencing data is available for download as VCF files whereas dbGaP sequencing data is stored in sequencing read archive (SRA) format.  VCF is a human readable file format that can be directly used for analysis while SRA is a compressed binary format which requires special software to manipulate and extract the data.

The tool and all documentation is available at https://github.com/NCBI-Hackathons/TCGA_dbGaP and the manuscript entitled Extending TCGA Queries to Automatically Identify Analogous Genomic Data from dbGaP by Wagner et al. (2016), was submitted to F1000Research on October 13th, 2016.

The other four groups were assigned four different scientific problems.  Problems addressed included analysis, annotation and accurate prediction of breakpoints for structural variants (svcompare); application of machine learning techniques in the prediction of immunogenicity (categorical; positive or negative) based on a protein and its associated amino acid properties (Machine_Learning_Immunogenicity); identification and annotation of microbial genes from an environmental sample (Metagenomic_Transcriptomes); identification of interactions between SNPs and the effect on phenotype expression (Network_Stats_Acc_Interop).  These pipelines and tools, along with the pipelines and tools developed at previous hackathons are available at https://github.com/NCBI-Hackathons.

These pipelines and tools, and others like them, are excellent sources that BSSI utilize as we continually strive to bring cutting edge methodology to our clients.