Radenski, A. ,
Ehwerhemuepha L., Speeding-Up Codon Analysis on the Cloud with Local MapReduce
Aggregation, Information Sciences, Elsevier, 263 (2014), 175-185.
A notable obstacle to higher performance of data-intensive Hadoop MapReduce (MR) bioinformatics algorithms is the large volume of intermediate data that need to be sorted, shuffled, and transmitted between mapper and reducer tasks. This difficulty manifests itself quite clearly in MR codon analysis which is known to generate voluminous intermediate data that create a bottleneck in basic MR codon analysis algorithms. Our proposed approach to handle the intermediate data bottleneck is local in-mapper aggregation (or simply local aggregation), a technique that helps reduce the intermediate data volume between mapper and reducer tasks in MR. We experimentally evaluate the performance of local aggregation (i) by developing codon analysis MR algorithms with and without local aggregation and (ii) by experimentally measuring their performance on Amazon Web Services (AWS), the Amazon cloud platform. Codon analysis with local aggregation maintains consistently high performance with the growth of larger data sets while basic codon analysis, without local aggregation becomes impractically slow even for smaller data sets. Our results can be beneficial (i) to members of the bioinformatics community who need to perform fast and cost-effective nucleotide MR analysis on the cloud and (ii) to computer scientists who strive to increase the performance of MR algorithms.
Follow the link below to download Python implementations of our codon analysis algorithms, together with a small set of sample data and also instructions of how to execute the code with Elastic MapReduce on the Amazon cloud platform, AWS.
Code plus small set of
sample data
Follow the link below to
download a larger dataset (267 Mb zipped, 1 Gb when
unzipped). The unzipped dataset can be used with our code (see above) to
replicate all experiments.
This download may take long time, because of the large size
of the dataset. The exact time will
depend on the speed of network connections. For example, from our campus (where
this server is) the download takes up to a minute. From a home location in the
same area, the download completes in about 15 minutes. From more remote
locations the download may certainly take much longer. Please, wait patiently
for your browser to complete the download. Thank you.
Article online at http://dx.doi.org/10.1016/j.ins.2013.11.028
(access to the full text of this article will depend on your personal or
institutional entitlements)
Last updated: November 2013.