Access the full text.
Sign up today, get DeepDyve free for 14 days.
S. Babu (2010)
Towards automatic optimization of MapReduce programs
Tyson Condie, Neil Conway, P. Alvaro, J. Hellerstein, Khaled Elmeleegy, R. Sears (2010)
MapReduce Online
D. Sleator, R. Tarjan (1985)
Amortized efficiency of list update and paging rulesCommun. ACM, 28
D. DeWitt, J. Gray (1992)
Parallel database systems: the future of high performance database systemsCommun. ACM, 35
Muthu Dayalan (2004)
MapReduce: simplified data processing on large clustersCommun. ACM, 51
A. Fiat, R. Karp, M. Luby, L. McGeoch, D. Sleator, N. Young (1991)
Competitive Paging AlgorithmsArXiv, cs.DS/0205038
J. Misra, D. Gries (1982)
Finding Repeated ElementsSci. Comput. Program., 2
J. Hellerstein, S. Chaudhuri, M. Rosenblum (2010)
Proceedings of the 11th ACM Symposium on Cloud ComputingProceedings of the 11th ACM Symposium on Cloud Computing
B. Li, E. Mazur, Y. Diao, A. Mcgregor, Prashant Shenoy (2011)
A platform for scalable one-pass analytics using MapReduce
Andrew Pavlo, Erik Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, M. Stonebraker (2009)
A comparison of approaches to large-scale data analysisProceedings of the 2009 ACM SIGMOD International Conference on Management of data
Tom White (2009)
Hadoop: The Definitive Guide
D. Abadi, Yanif Ahmad, M. Balazinska, U. Çetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, A. Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, S. Zdonik (2005)
The Design of the Borealis Stream Processing Engine
H. Karloff, Siddharth Suri, Sergei Vassilvitskii (2010)
A model of computation for MapReduce
I. Tatarinov, Stratis Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, Chun Zhang (2002)
Storing and querying ordered XML using a relational database system
Yuan Yu, P. Gunda, M. Isard (2009)
Distributed aggregation for data-parallel computing: interfaces and implementations
D. DeWitt, J. Gray (1998)
Parallel database systems: the future of high performance database systems
M. Charikar, Kevin Chen, Martín Farach-Colton (2002)
Finding frequent items in data streamsTheor. Comput. Sci., 312
Feng Tian, D. DeWitt (2003)
Tuple Routing Strategies for Distributed Eddies
J. Hellerstein, J. Naughton (1996)
Query execution techniques for caching expensive methods
Ahmed Metwally, D. Agrawal, A. Abbadi (2005)
Efficient Computation of Frequent and Top-k Elements in Data Streams
Graham Cormode, S. Muthukrishnan (2004)
An improved data stream summary: the count-min sketch and its applications
reduce task slots per node, which was used in our previous experiments. To solve the problem, we set 3 map task slots and 4 reduce task slots per node in EC2
R. Chaiken, Bob Jenkins, P. Larson, Bill Ramsey, Darren Shakib, S. Weaver, Jingren Zhou (2008)
SCOPE: easy and efficient parallel processing of massive data setsProc. VLDB Endow., 1
Dawei Jiang, B. Ooi, Lei Shi, Sai Wu (2010)
The performance of MapReduceProceedings of the VLDB Endowment, 3
P. Cochat, L. Vaucoret, J. Sarles (2008)
Et alEvidence Based Mental Health, 11
Article 27, Publication date: December 2012. Language Memory per reducer(MB) INC-hash DINC-Frequent DINC-Marker
Christopher Olston, B. Reed, U. Srivastava, Ravi Kumar, A. Tomkins (2008)
Pig latin: a not-so-foreign language for data processing
8 The details of our memory allocation scheme are the following: On a slave node, we are able to allocate
D. Kane, Jelani Nelson, David Woodruff (2010)
An optimal algorithm for the distinct elements problem
Kristi Morton, M. Balazinska, D. Grossman (2010)
ParaTimer: a progress indicator for MapReduce DAGsProceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hung-chih Yang, Ali Dasdan, R. Hsiao, D. Parker (2007)
Map-reduce-merge: simplified relational data processing on large clusters
Sumit Ganguly, Anirban Majumder (2006)
CR-precis: A Deterministic Summary Structure for Update Data Streams
L. Shapiro (1986)
Join processing in database systems with large main memoriesACM Trans. Database Syst., 11
L. Neumeyer, B. Robbins, Anish Nair, Anand Kesari (2010)
S4: Distributed Stream Computing Platform2010 IEEE International Conference on Data Mining Workshops
Qiong Zou, Huayong Wang, R. Soulé, Martin Hirzel, H. Andrade, B. Gedik, Kun-Lung Wu (2010)
From a stream of relational queries to distributed stream processingProceedings of the VLDB Endowment, 3
E. Mazur, B. Li, Y. Diao, Prashant Shenoy (2011)
Towards Scalable One-Pass Analytics Using MapReduce2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
D. DeWitt, Shahram Ghandeharizadeh, Donovan Schneider, A. Bricker, Hui-I Hsiao, Rick Rasmussen (1990)
The Gamma Database Machine ProjectIEEE Trans. Knowl. Data Eng., 2
Dan Suciu, G. Weikum (2005)
ACM Transactions on Database SystemsACM Transactions on Database Systems, 30
D. DeWitt, Robert Gerber, G. Graefe, M. Heytens, K. Kumar, M. Muralikrishna (1986)
GAMMA - A High Performance Dataflow Database Machine
Abhishek Roy, Y. Diao, E. Mauceli, Yiping Shen, Bai-Lin Wu (2012)
Massive Genomic Data Processing and Deep AnalysisProc. VLDB Endow., 5
Radu Berinde, Graham Cormode, P. Indyk, M. Strauss (2009)
Space-optimal heavy hitters with strong error bounds
L. McGeoch, D. Sleator (1991)
A strongly competitive randomized paging algorithmAlgorithmica, 6
(2008)
Pig Mix benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix
R. Ramakrishnan, J. Gehrke (2003)
Database management systems (3. ed.)
R.S.V Aparajitha., M. Kavitha, T.R.P Monisha, T.S.B Pavithra, Raja Vinoth (2010)
Database Management SystemsSoftware
S. Muthukrishnan (2005)
Data streams: algorithms and applications
Lap-Kei Lee, H. Ting (2006)
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows
Ashish Thusoo, Joydeep Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, H. Liu, P. Wyckoff, R. Murthy (2009)
Hive - A Warehousing Solution Over a Map-Reduce FrameworkProc. VLDB Endow., 2
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce BODUO LI, EDWARD MAZUR, YANLEI DIAO, ANDREW MCGREGOR, and PRASHANT SHENOY, University of Massachusetts Amherst Today's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the dataset to be fully loaded into the cluster before running analytical queries. This article examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely used sortmerge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows
ACM Transactions on Database Systems (TODS) – Association for Computing Machinery
Published: Dec 1, 2012
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.