Home Projects Publications Presentations People News Activities About DCSL Internal
 
<< All Projects Principles and Practice in Scaling Computational Genomics Applications
Summary

Gene sequencing instruments are producing huge volumes of data, straining the capabilities of current database searching algorithms and hindering efforts of researchers analyzing larger collections of data to obtain greater insights. In the space of parallel genomic sequence search, most of the popular softwares, like mpiBLAST, use the database segmentation approach, wherein the entire database is sharded and searched on different nodes. However this approach does not scale well with the increasing length of individual query sequences as well as the rapid growth in size of sequence databases. In this paper, we propose a fine-grained parallelism technique, called Orion, that divides the input query into an adaptive number of fragments and shards the database. Our technique achieves higher parallelism (and hence speedup) and load balancing, while maintaining 100% accuracy. We show that it is 12.3X faster than mpiBLAST for solving a relevant comparative genomics problem.

The classic algorithm for performing sequence alignment, identifying matches between a query and a database of sequences, is the Basic Local Alignment Search Tool (BLAST) [1], [2] (References are from our Supercomputing 2014 paper). BLAST operates by comparing each of the sequences in the input query set against each of the sequences in a database to identify alignments that partially or completely overlap. The more similarity there is, the higher the alignment’s score. E-value ia numerical value that captures the likelihood that the similarity is statistically significant. Alignments with E-value below a certain threshold are output as potential matches by the algorithm. Section II describes the algorithm in more detail. The National Center of Biotechnology Information (NCBI) provides public databases of gene sequences that researchers can search using BLAST.1. Unfortunately, the explosive growth in the number of biological sequences poses a formidable challenge to the current database searching algorithms. In December 2013, the GenBank database—hosted by NCBI—had about 170 million sequences, and the number of bases has doubled approximately every 18 months [3], [19].

Given the exponential growth in the size of sequence databases, and the requirement to query longer sequences, current database searching algorithms struggle to provide the alignment and search results in a timely manner. Early parallel BLAST implementations [5], [7] exploited coarse-grained parallelism: individual queries can be processed simultaneously against the same database. However, while such parallelism improves throughput, it does not help an individual researcher with a single query: For example, a BLAST job with a query sequence of 100,000 contiguous fragments (i.e., contigs or overlapping sequenced data reads) BLASTed against the nonredundant (NR) nucleotide database could take 70 days [30]! To provide genomics researchers with reasonable latency for their searches, exploiting additional parallelism has become a necessity.

The most popular open source parallelization of BLAST is mpiBLAST, using, unsurprisingly, MPI to run BLAST in parallel on clusters [8]. mpiBLAST adopts a natural parallelization strategy. Because BLAST compares the input query against each sequence in the database separately, parallelism can be exploited by performing multiple such comparisons concurrently. mpiBLAST thus shards the database into multiple pieces each containing a subset of the databases’s sequences and distributes the shards across the computational nodes in the cluster. These shards can then be searched independently and simultaneously for alignments with the input query.

Unfortunately, while mpiBLAST can exploit parallelism by sharding large databases, and even by processing multiple input queries in parallel, it has significant limitations for many biological use cases. In long sequence alignment, a long input query is matched against a database. Such use cases are becoming increasingly common. With the rapid expansion of next generation sequencing technologies, the number of organisms whose entire genomes are being sequenced has been growing at a rapid pace. Once a genome is sequenced, it is annotated, which involves (among other processes) comparing the newly-sequenced genome, or parts thereof, with that of a closely-related organism or with the expansive NT database, to establish the evolutionary relations of this newly-sequenced organism. This results in large queries, with the upper bound being the size of the entire genome, which can be millions of nucleotides.

In this scenario, mpiBLAST runs out of parallelization opportunities. There is but one input sequence, so parallelism by processing multiple queries simultaneously is impossible. And increasing the number of database shards to increase parallelism suffers from diminishing returns: even if the database contains enough sequences to profitably create additional shards, additional shards increase scheduling overhead as well as the time required to aggregate the output from each queryshard work unit.

Moreover, mpiBLAST’s parallelization strategy can lead to severe load imbalance with large queries, or with queries of very different sizes. If a query sequence is long, or has many matches with a particular database sequence, it will take a long time to process, while a short query sequence, or one with little similarity to a database sequence can be completed much faster. As a result, the execution time of different queryshard work units can vary significantly, a problem that is only exacerbated as queries get longer [26], [12]. Further, it is difficult to predict what the running time for a unit of work will be from simple metrics as the length of the query [12]. Consequently, the static load balancing approach of mpiBLAST tends to create severe load imbalances among the different nodes processing different work units, as we experimentally show in our evaluation.

To address these concerns, we propose Orion, a new parallel BLAST implementation that exploits finer-grained parallelism than mpiBLAST, achieving both more parallelism in the face of long sequences as well as better load balance. The key insight behind Orion is that a single, long query sequence need not be matched against a database sequence serially; instead, the query can be fragmented into sub-queries (which we call “query fragments”), each of which can be matched against the database independently and in parallel. Figure 1 captures the various levels of parallelism inherent in sequence alignment. The early approaches to sequence alignment primarily targeted the lowest level, processing multiple queries in parallel against the entire database, while mpiBLAST exploits the two lowest levels, processing the same query against different database shards simultaneously. Orion exploits all levels of parallelism: inter-query, intra-database, and intra-query.

In Orion, we limit the size of the overlap by querying the input parameters such as the thresholds in the BLAST algorithm and the penalties due to a mismatch in BLAST, and employ a novel extension and aggregation strategy to avoid missing alignments. Our fragmenting strategy is such that practically there is no loss in accuracy, i.e., every sequence that will be matched successfully in BLAST will also be matched successfully in Orion. However, the overlaps are not so large as to eliminate the scope for intra-query parallelism.

We introduce three chief novelties:
1) We develop an analytical model based on BLAST’s scoring formula that identifies the optimal fragmentation strategy, avoiding redundant work.
2) We introduce a speculative extension strategy that allows alignments that may cross query fragment boundaries to be identified.
3) We build an aggregation algorithm that combines full and partial alignments from each fragment to generate a final set of alignments that matches the original sequential algorithm.

We parallelize and implement our algorithm using the Hadoop MapReduce framework, and demonstrate that our algorithm yields better parallelization, performance and load balance than mpiBLAST, while producing the same results.

 

Achieved Technical
Goals
Publications
Future Work
Students
Code & Data
Funding Source
 
 
465 Northwestern Avenue, West Lafayette, IN 47907   |  dcsl@ecn.purdue.edu   |  +1 765 494 3510
Home |  Projects  |  Publications  |  Presentations  |  People
News  |  Activities |  About DCSL  |  Internal


Last Update: June 28, 2014 14:00 by GMHoward