Pulse lineResearch With Heart Logo

A hybrid computational strategy to address WGS variant analysis in >5000 samples.

TitleA hybrid computational strategy to address WGS variant analysis in >5000 samples.
Publication TypeJournal Article
Year of Publication2016
AuthorsHuang Z, Rustagi N, Veeraraghavan N, Carroll A, Gibbs R, Boerwinkle E, Venkata MGorentla
Secondary AuthorsYu F
JournalBMC Bioinformatics
Volume17
Issue1
Pagination361
Date Published2016 Sep 10
ISSN1471-2105
KeywordsDatabases, Genetic, Genome, Human, Genomics, High-Throughput Nucleotide Sequencing, Humans
Abstract

BACKGROUND: The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies.

RESULTS: We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms.

CONCLUSIONS: Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.

DOI10.1186/s12859-016-1211-6
Alternate JournalBMC Bioinformatics
PubMed ID27612449
PubMed Central IDPMC5018196
Grant ListHHSN268201100012C / HL / NHLBI NIH HHS / United States
HHSN268201100009I / HL / NHLBI NIH HHS / United States
HHSN268201100010C / HL / NHLBI NIH HHS / United States
HHSN268201100008C / HL / NHLBI NIH HHS / United States
U01 HL080295 / HL / NHLBI NIH HHS / United States
HHSN268201500001C / HL / NHLBI NIH HHS / United States
HHSN268201100005G / HL / NHLBI NIH HHS / United States
HHSN268201100008I / HL / NHLBI NIH HHS / United States
HHSN268201100007C / HL / NHLBI NIH HHS / United States
N01 HC015103 / HC / NHLBI NIH HHS / United States
HHSN268201100011I / HL / NHLBI NIH HHS / United States
HHSN268201100011C / HL / NHLBI NIH HHS / United States
N01 HC085085 / HC / NHLBI NIH HHS / United States
N01HC55222 / HL / NHLBI NIH HHS / United States
N01HC85086 / HL / NHLBI NIH HHS / United States
HHSN268201100006C / HL / NHLBI NIH HHS / United States
HHSN268201200036C / HL / NHLBI NIH HHS / United States
HHSN268201100005I / HL / NHLBI NIH HHS / United States
HHSN268201500001I / HL / NHLBI NIH HHS / United States
N01 HC085084 / HC / NHLBI NIH HHS / United States
N01HC85082 / HL / NHLBI NIH HHS / United States
N01HC75150 / HL / NHLBI NIH HHS / United States
R01 HG008115 / HG / NHGRI NIH HHS / United States
HHSN268201100009C / HL / NHLBI NIH HHS / United States
N01HC85083 / HL / NHLBI NIH HHS / United States
HHSN268201100005C / HL / NHLBI NIH HHS / United States
N01HC25195 / HL / NHLBI NIH HHS / United States
HHSN268201100007I / HL / NHLBI NIH HHS / United States
R01 AG023629 / AG / NIA NIH HHS / United States
N01 HC045133 / HC / NHLBI NIH HHS / United States
N01HC85080 / HL / NHLBI NIH HHS / United States
N01 HC035129 / HC / NHLBI NIH HHS / United States
N01HC85081 / HL / NHLBI NIH HHS / United States