Design and Optimization of MapReduce Streaming Algorithms

Participants

Dr. Atanas Radenski

Professor of Computer Science, Schmid College of Science and Technology, Chapman University, Orange, CA; http://www1.chapman.edu/~radenski/

Dr. Radenski’s research interests are in the areas of scientific computing, big data, parallel and distributed computing, cloud computing, programming languages, object-oriented computing, and e-learning. His research in has been supported by long-term grants from the National Science Foundation and the National Aeronautics and Space Administration, and by grants from other agencies. He has authored 85 publications, including 50 journal articles and refereed papers in edited volumes, 28 invited papers, technical reports, and abstracts in proceedings, and 6 books.

Dr. Boyana Norris

$Description: C:\Users\radenski\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.Word\norris.jpeg$ Computer Scientist, Mathematics and Computer Science Division, Argonne National Laboratory; http://www.mcs.anl.gov/~norris/

Dr. Norris’s research is in the area of enabling technologies for high-performance simulations in computational science and engineering, with emphasis on automation of the development, deployment, testing, and performance tuning of parallel applications. Specific research areas include (1) source code analysis and transformation, specifically on automatic differentiation, performance analysis, and empirical code tuning; (2) embeddable domain-specific languages for high-performance computing; and (3) quality of service infrastructure for scientific software (including numerical software taxonomy and automated configuration), for optimizing performance, energy use, and resilience of complex applications.

Louis Ehwerhemuepha

Graduate student and adjunct faculty, Schmid College of Science and Technology, Chapman University, Orange, CA; http://www.chapman.edu/our-faculty/louis-ehwerhemuepha

Description: PIC-Author-LE.jpg Louis Ehwerhemuepha is a graduate student in Computational Sciences at Chapman University, California. Before coming to Chapman University as a graduate student, Louis obtained a BS degree in Physics from Delta State University, Nigeria and worked in the energy efficiency and conservation sector of Nigeria. His current interests are in biostatistics, bioinformatics and computational biology, and scientific computing.

Contact

Dr. Atanas Radenski, Chapman University; Radenski@chapman.edu; +1-714-744-7657

Project Motivation

We design, optimize, and empirically evaluate MapReduce (MR) streaming algorithms. We choose to work with MR streaming algorithms because such algorithms can be easier to understand, modify, and apply by domain scientists, who may not be software experts but can work well with higher-level languages such as Python. In an essential departure from the standard MR semantics, MR streaming does not aggregate intermediate same-key key-value pairs. As a result, MR streaming can produce a very large volume of intermediate key-value records that are fed into reducers, a process that can become a performance bottleneck. We aim to eliminate the performance bottleneck by appropriate optimizations, such as in-mapper local aggregation and strip partitioning. Technically, we design optimized MR streaming algorithms for various problems and empirically evaluate their performance on the cloud. Problems that we are solving include the simulation of grid-based models, such as heat relaxation and a-life simulations, and also codon frequency analysis.

The Need for Cloud Resources

We use Amazon’s Elastic MapReduce cloud to implement both basic and optimized MR algorithms, to run such implementations, to gather performance data, and to evaluate performance benefits of our optimization techniques. We use the Amazon’s EC2 cloud to configure an in-memory MapReduce server with the Phoenix framework. This permits the porting of our optimized MR streaming algorithms into optimized in-memory MR algorithms, the empirical evaluation of in-memory MR performance, and the analysis of tradeoffs between distributed MR and in-memory MR.

Technically, we use an Elastic MR cluster with up to 20 large instances for experiments with MR streaming algorithms and he S3 cloud to store data and code. We use an EC2 extra-large instance to experiment with optimized in-memory MR algorithms.

Outcomes

Our optimized MR grid-based algorithms can be used as prototypes for the development of large-scale MR algorithms for challenging data-intensive boundary-value problems and can serve as the basis for MR algorithms for large-scale cellular life simulations. Our codon analysis algorithms can be beneficial to members of the bioinformatics community who need to perform fast and cost-effective nucleotide MR analysis on the cloud.

Publications

1. Radenski, A., B. Norris. Distributed Large-Scale Laplace Relaxation on the Cloud with MapReduce (in progress)

2. Radenski, A. Using MapReduce for Distributed Life Simulation on the Cloud (in progress)

3. Radenski, A., L. Ehwerhemuepha. Speeding-Up Codon Analysis on the Cloud with Local MapReduce Aggregation (in progress)

4. Radenski, A. Distributed Simulated Annealing with Mapreduce. In Proceedings of the 2012th European conference on Applications of Evolutionary Computation (EvoApplications'12), Cecilia Chio, Alexandros Agapitos, Stefano Cagnoni, Carlos Cotta, and Francisco Fernández Vega (Eds.). Springer-Verlag, Berlin, Heidelberg, 2012.

Presentations

1. Radenski, A., L. Ehwerhemuepha. Speeding-Up Codon Analysis on the Cloud with Local MapReduce Aggregation. Chapman University Symposium on Big Data and Analytics: The 44th Symposium on the Interface of Computing Science and Statistics, April 2013.

2. Ehwerhemuepha L., A. Radenski. Codon Analysis on the Cloud: Making the Case for MapReduce Aggregation. SuperComputing’12 poster presentation (November 2012).