SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

Boduo Li; Edward Mazur; Yanlei Diao; Andrew McGregor; Prashant Shenoy

doi:10.1145/2389241.2389246

Loading next page...

References (48)

S. Babu (2010)
Towards automatic optimization of MapReduce programs
Tyson Condie, Neil Conway, P. Alvaro, J. Hellerstein, Khaled Elmeleegy, R. Sears (2010)
MapReduce Online
D. Sleator, R. Tarjan (1985)
Amortized efficiency of list update and paging rules
Commun. ACM, 28
D. DeWitt, J. Gray (1992)
Parallel database systems: the future of high performance database systems
Commun. ACM, 35
Muthu Dayalan (2004)
MapReduce: simplified data processing on large clusters
Commun. ACM, 51
A. Fiat, R. Karp, M. Luby, L. McGeoch, D. Sleator, N. Young (1991)
Competitive Paging Algorithms
ArXiv, cs.DS/0205038
J. Misra, D. Gries (1982)
Finding Repeated Elements
Sci. Comput. Program., 2
J. Hellerstein, S. Chaudhuri, M. Rosenblum (2010)
Proceedings of the 11th ACM Symposium on Cloud Computing
Proceedings of the 11th ACM Symposium on Cloud Computing
B. Li, E. Mazur, Y. Diao, A. Mcgregor, Prashant Shenoy (2011)
A platform for scalable one-pass analytics using MapReduce
Andrew Pavlo, Erik Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, M. Stonebraker (2009)
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Tom White (2009)
Hadoop: The Definitive Guide
D. Abadi, Yanif Ahmad, M. Balazinska, U. Çetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, A. Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, S. Zdonik (2005)
The Design of the Borealis Stream Processing Engine
H. Karloff, Siddharth Suri, Sergei Vassilvitskii (2010)
A model of computation for MapReduce
I. Tatarinov, Stratis Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, Chun Zhang (2002)
Storing and querying ordered XML using a relational database system
Yuan Yu, P. Gunda, M. Isard (2009)
Distributed aggregation for data-parallel computing: interfaces and implementations
D. DeWitt, J. Gray (1998)
Parallel database systems: the future of high performance database systems
M. Charikar, Kevin Chen, Martín Farach-Colton (2002)
Finding frequent items in data streams
Theor. Comput. Sci., 312
Feng Tian, D. DeWitt (2003)
Tuple Routing Strategies for Distributed Eddies
J. Hellerstein, J. Naughton (1996)
Query execution techniques for caching expensive methods
Ahmed Metwally, D. Agrawal, A. Abbadi (2005)
Efficient Computation of Frequent and Top-k Elements in Data Streams
Graham Cormode, S. Muthukrishnan (2004)
An improved data stream summary: the count-min sketch and its applications
reduce task slots per node, which was used in our previous experiments. To solve the problem, we set 3 map task slots and 4 reduce task slots per node in EC2
R. Chaiken, Bob Jenkins, P. Larson, Bill Ramsey, Darren Shakib, S. Weaver, Jingren Zhou (2008)
SCOPE: easy and efficient parallel processing of massive data sets
Proc. VLDB Endow., 1
Dawei Jiang, B. Ooi, Lei Shi, Sai Wu (2010)
The performance of MapReduce
Proceedings of the VLDB Endowment, 3
P. Cochat, L. Vaucoret, J. Sarles (2008)
Et al
Evidence Based Mental Health, 11
Article 27, Publication date: December 2012. Language Memory per reducer(MB) INC-hash DINC-Frequent DINC-Marker
Christopher Olston, B. Reed, U. Srivastava, Ravi Kumar, A. Tomkins (2008)
Pig latin: a not-so-foreign language for data processing
8 The details of our memory allocation scheme are the following: On a slave node, we are able to allocate
D. Kane, Jelani Nelson, David Woodruff (2010)
An optimal algorithm for the distinct elements problem
Kristi Morton, M. Balazinska, D. Grossman (2010)
ParaTimer: a progress indicator for MapReduce DAGs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hung-chih Yang, Ali Dasdan, R. Hsiao, D. Parker (2007)
Map-reduce-merge: simplified relational data processing on large clusters
Sumit Ganguly, Anirban Majumder (2006)
CR-precis: A Deterministic Summary Structure for Update Data Streams
L. Shapiro (1986)
Join processing in database systems with large main memories
ACM Trans. Database Syst., 11
L. Neumeyer, B. Robbins, Anish Nair, Anand Kesari (2010)
S4: Distributed Stream Computing Platform
2010 IEEE International Conference on Data Mining Workshops
Qiong Zou, Huayong Wang, R. Soulé, Martin Hirzel, H. Andrade, B. Gedik, Kun-Lung Wu (2010)
From a stream of relational queries to distributed stream processing
Proceedings of the VLDB Endowment, 3
E. Mazur, B. Li, Y. Diao, Prashant Shenoy (2011)
Towards Scalable One-Pass Analytics Using MapReduce
2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
D. DeWitt, Shahram Ghandeharizadeh, Donovan Schneider, A. Bricker, Hui-I Hsiao, Rick Rasmussen (1990)
The Gamma Database Machine Project
IEEE Trans. Knowl. Data Eng., 2
Dan Suciu, G. Weikum (2005)
ACM Transactions on Database Systems
ACM Transactions on Database Systems, 30
D. DeWitt, Robert Gerber, G. Graefe, M. Heytens, K. Kumar, M. Muralikrishna (1986)
GAMMA - A High Performance Dataflow Database Machine
Abhishek Roy, Y. Diao, E. Mauceli, Yiping Shen, Bai-Lin Wu (2012)
Massive Genomic Data Processing and Deep Analysis
Proc. VLDB Endow., 5
Radu Berinde, Graham Cormode, P. Indyk, M. Strauss (2009)
Space-optimal heavy hitters with strong error bounds
L. McGeoch, D. Sleator (1991)
A strongly competitive randomized paging algorithm
Algorithmica, 6
(2008)
Pig Mix benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix
R. Ramakrishnan, J. Gehrke (2003)
Database management systems (3. ed.)
R.S.V Aparajitha., M. Kavitha, T.R.P Monisha, T.S.B Pavithra, Raja Vinoth (2010)
Database Management Systems
Software
S. Muthukrishnan (2005)
Data streams: algorithms and applications
Lap-Kei Lee, H. Ting (2006)
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows
Ashish Thusoo, Joydeep Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, H. Liu, P. Wyckoff, R. Murthy (2009)
Hive - A Warehousing Solution Over a Map-Reduce Framework
Proc. VLDB Endow., 2

Publisher: Association for Computing Machinery
Copyright: Copyright © 2012 by ACM Inc.
ISSN: 0362-5915
DOI: 10.1145/2389241.2389246
Publisher site: See Article on Publisher Site

Abstract

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce BODUO LI, EDWARD MAZUR, YANLEI DIAO, ANDREW MCGREGOR, and PRASHANT SHENOY, University of Massachusetts Amherst Today's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the dataset to be fully loaded into the cluster before running analytical queries. This article examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely used sortmerge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows

Journal

ACM Transactions on Database Systems (TODS) – Association for Computing Machinery

Published: Dec 1, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

References (48)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies