Precomputing architecture for flexible and efficient big data analytics

Precomputing architecture for flexible and efficient big data analytics The rising of big data brings revolutionary changes to many aspects of our lives. Huge volume of data, along with its complexity poses big challenges to data analytic applications. Techniques proposed in data warehousing and online analytical processing, such as precomputed multidimensional cubes, dramatically improve the response time of analytic queries based on relational databases. There are some recent works extending similar concepts into NoSQL such as constructing cubes from NoSQL stores and converting existing cubes into NoSQL stores. However, only limited attention in literature have been devoted to precomputing structure within the NoSQL databases. In this paper, we present an architecture for answering temporal analytic queries over big data by precomputing the results of granulated chunks of collections which are decomposed from the original large collection. In extensive experimental evaluations on drill-down and roll-up temporal queries over large amount of data we demonstrated the effectiveness and efficiency under different settings. Keywords NoSQL · Data warehouse · Precompute · Temporal 1 Introduction Data warehouse integrates data from different data sources into large repositories. Data arriving into data warehouses are With the development of data-driven applications in many normally in denormalise multidimensional form which will aspects of our daily lives, it is worthy to praise the signifi- be stored into data cubes to reduce the cost of costly table cance of big data whose value has already been recognized joins. A large part of modern OLAP systems are built on top by both industry and academia. A significant amount of data of data warehouses which are stored with additional informa- is being collected and analyzed to support various decision tion (e.g., metadata). An essential and widely used technique makings, and that amount is expected to increase every year. in OLAP systems is precomputation where analytic results One of the major tasks in mining big data is to answer the are precomputed and materialized in the data warehouses. analytic queries efficiently. Analytic queries often involve When a user submits queries, the system simply retrieves the sophisticated aggregations which demand significant com- corresponding pre-computed results and therefore efficiently puting powers. The huge volume of big data, along with returns the final result to the user. complexity of these queries pose a big challenge for effi- Previous techniques proposed in data warehouses and cient processing. Aiming to tackle these challenges and to OLAP systems are mainly focusing on relational data struc- enhance the performance of analytic query processing, the ture and relational databases. However, it is evident that concept of data warehouse [13] and OLAP [6] have been relational DBMS are struggling to handle large volume of introduced. unstructured data as they do not scale well [20,24]. More- over, relational DBMS are not well-equipped to handle the new multidimensional networks [4,23,25]. The rise of B Nigel Franciscus NoSQL databases [21] has attracted the attention of database n.franciscus@griffith.edu.au community due to its flexibility in providing schema-later Xuguang Ren architecture and its scalability for handling huge amount of x.ren@griffith.edu.au big data. Various NoSQL databases have been chosen and Bela Stantic applied in many domains, which leads to more and more b.stantic@griffith.edu.au data being stored in NoSQL databases. Consequently, it has become an urgent need to process analytic queries based Institute for Integrated and Intelligent Systems, Griffith University, Gold Coast, QLD, Australia 123 134 Vietnam Journal of Computer Science (2018) 5:133–142 on the NoSQL databases efficiently. Some recent works are 3. We answered three types of common query models along extending the techniques of data warehouses and OLAP into with the strategies to answer each query type. NoSQL. The work of [19] present strategies for constructing 4. We conducted extensive experiments to demonstrate the cubes from NoSQL stores. In contrast, the work in [5]pro- performance of proposed architecture in the real world poses method to convert existing cubes into NoSQL stores. end-to-end applications. However, there are few works focusing on the precomputing structure dedicated to NoSQL databases. 1.2 Organization For many modern big data applications, the complexity of analytic tasks often require the combination of NoSQL The rest of the paper is organised as follows: in Sect. 2,we databases and Hadoop. NoSQL databases have been known give some related works; in Sect. 3, we present the details for its schema-less design that flexibly translate any data into of our precomputing architecture; in Sect. 4, we provide the the desired format while Hadoop fills the gap of scalabil- experiment results; and finally in Sect. 5 we conclude the ity with its MapReduce framework. These two platforms paper and indicate some future work. work in conjunction to achieve a large-scale interactive real- time processing. Hadoop ecosystem is designed as a highly fault-tolerant system for batch processing of long-running 2 Related work complex jobs on very large datasets. Due to the lack of inter- active exploration in HDFS, often NoSQL databases are used In this section, we present some related work which we clas- as the output sink for the MapReduce process for real-time sify into two main categories. interaction. To speed up the computation even further, an (1) NoSQL database. According to a survey presented in efficient pre-computation has becoming a good alternative [12] there are more than 100 NoSQL databases developed [10]. for various purposes. Specifically, No-SQL databases can be Motivated by the above findings, in this paper we present classified into four classes: an architecture for answering temporal analytic queries over big data within NoSQL database, in particular document- (a) Key-value stores the data as key-value pairs where the oriented and key-value store. As the time is an essential value can be anything and is treated as opaque binary data, dimension for most of data analytic platforms, we choose the key is transformed into an index using a hash function. temporal queries aspect as focus and as the starting point Redis is one of the widely used key-value databases. of this work. Queries on this specific part of the data (e.g., (b) Column-family applies an column-oriented architecture timestamp) are costly, often requiring full scan of all key- which is contrast to the row-oriented architecture in value pair despite for simple equality queries [17]. We plan RDBMS. Cassandra and HBase are two most used extend our work into other dimensions in future works. The column-family databases. basic idea of proposed architecture is to divide the original (c) Document-database treats the document as the mini- data into separated and smaller chunks and then precompute mum data unit and is designed deliberately for managing the results for each chunk. The precomputed results are then document-oriented information, such as JSON, XML materialized in the NoSQL database. We process the upcom- documents. MongoDB is a typical document database ing analytic queries based on the precomputed results. which is designed to handle JSON documents. (d) Graph-database models the data as graphs and focuses 1.1 Contribution more on the relationships between data units. There are over 30 graph database systems such as Neo4j, Titan, and This paper is the extended version of the precomputing archi- Sparksee. tecture for answering analytic queries for NoSQL databases [11]. We extend our previous work with further practical eval- In this work we stored data in two NoSQL databases, Mon- uations of the drill-down and roll-up temporal queries over goDB and Redis, which have different pros and cons and are a large amount of data in the application perspective, specif- using entirely different mechanisms. However, we present ically: the query processing performance for those two databases based on our pre-computing structure. 1. We proposed the technique to index raw temporal data (2) Data warehouse and OLAP. The concept of data ware- into separated and smaller chunks based on temporal house and OLAP have been proposed very early aiming to interval. answer analytic queries efficiently. The key structure in data 2. We designed efficient storage structures for the precom- warehouse is the cube which is normally stored as a denor- puted results in the document-oriented and key-value malised multidimensional table in relational databases [3,7]. databases. A large part of modern OLAP systems are built on top of 123 Vietnam Journal of Computer Science (2018) 5:133–142 135 Fig. 1 Pre-computing architecture data warehouses and utilize the cubes when processing ana- raw Twitter data and store them into a NoSQL database lytic queries [8] or time-range queries [22]. Some advances (MongoDB) as time-indexed collections. The dotted line rep- have been made to extend OLAP into emerging data, such as resents the temporal indexing structure in MongoDB where imprecise data [2], taxonomies [18], sequence [15], and text the preprocessed results are suited to the task processing, [14]. There are some recent works extending the techniques for example, mapreduce. (2) Precompute results structure, of data warehouses and OLAP into NoSQL. The work of where we execute analytic jobs (MapReduce based on [19] present strategies for constructing cubes from NoSQL Hadoop) and then store the precomputed results into NoSQL 1 2 stores. In contrast, the work in [5] gives the rules in con- database (MongoDB and Redis ). (3) Query answering, verting existing cubes into NoSQL stores. However, only where we apply efficient strategies to answer queries by uti- few works studying the precomputing structure deliberately lizing the precomputed results through merging. As a case within NoSQL databases. study, we demonstrate our architecture using specific data In contrast to previous work, we focus on the processing sources, database platforms, analytic jobs and processing of analytic queries for NoSQL databases where no data are techniques in this paper, as indicated within the above paren- stored in relational databases. We propose an index structure theses. However, it is worth noting that our architecture is suited to answer drill-down and roll-up queries over large quite flexible and can be easily extended to other use cases. amount of data within fast response time. Similar to work in We present more details about each chosen specific ingredi- [22], we particularly focus on the temporal queries with the ent. time-range as the query parameter. Data source As shown in Fig. 1, we use Twitter as the data source as a case study. Twitter is an online social net- working service that enables users to post short 140-character messages called "tweets". Twitter is widely used in monitor- 3 Precomputing architecture ing society trends and user behaviors due to its large user pool [1]. The tweets are naturally formatted into JavaScript In this section, we present design of the precomputing archi- Object Notation (JSON) and they include the textual content tecture. We first give an overview of the architecture, then as well as the posted time. Every tweet consists of several we elaborate each component, (1) Raw data indexing, (2) attributes embedded as the property beside the main text (Fig. Precomputed results structure and (3) Query answering. 2). We test the performance for Twitter dataset both with the attributes and without the attributes. 3.1 Overview The overview of proposed precomputing architecture is shown in Fig. 1. It can be divided into several inter-related MongoDB, https://www.mongodb.com/. components: (1) Raw data indexing, where we collect the Redis, https://redis.io/. 123 136 Vietnam Journal of Computer Science (2018) 5:133–142 Hadoop computing platform. Since keys may contain strings, hashes, lists, sets and sorted sets, Redis can be used to support the final metrics for front end visualisation to serve data out of Hadoop, caching the ‘hot’ pieces of data in-memory for fast access. Combining this simple client with the power of MapReduce will let you write and read data to and from Redis in parallel. NoSQL databases–Hadoop integration The MongoDB- Hadoop connector is a plugin for Hadoop to integrate with MongoDB as the source and/or sink instead of HDFS. Note that we opt not to use HDFS due to the the interactivity of data exploration in MongoDB, although it is evident that reading and writing time from Hadoop to HDFS is faster compared with MongoDB [9]. Further, our main intention to index the raw data is primarily for reading data from MongoDB to Hadoop. At this time this paper is written there is no offi- cial Redis-Hadoop connector for writing the output result to Redis from Hadoop available. However, there are some open source connectors that allows the integration between Redis and Hadoop. We opt to use Jedis, a Java client library to connect both platform. Analytic jobs The precomputing architecture is designed to process data that are sequential or based on the order. Therefore, it is flexible to compute a variety of jobs for both spatial and temporal data. However, we consider the importance of temporal data since they typically have lower granularity which may consistently create excessive seek Fig. 2 Tweet structure index problem. Text aggregation is a widely used analytic job in many lit- eratures. Its intuitive application is the word frequency which Database platform MongoDB is an open source NoSQL is intensively used to detect hot topics and trends in the soci- document-store database, founded by 10gen. MongoDB ety. Compared with word frequency, sometimes we are more stores data in document layout consists of field-value which interested in the frequency of word-pair (co-occurrence of can be nested as an associate arrays. Documents are serialised two words) as it can help us to detect hidden patterns. There- in JSON and written internally as Binary JSON (BSON). The fore, we choose the job of computing word-pair frequency Twitter data is in JSON format which is generally supported in our case. That is given a set of tweets and any word-pair, by the MongoDB. We choose MongoDB to store the raw data we compute the number of tweets in which this word-pair co-occurred. in our case study due to its in-house flexibility. MongoDB provides some features from relational database management Processing techniques We choose Hadoop as the proces- sor to execute the word-pair jobs. Hadoop is an open source system like sorting, compound indexing and range/equal queries. Additionally, MongoDB has its own aggregation implementation of MapReduce framework. As previously capability and in-house MapReduce operation. One of the stated, although MongoDB ships with in-house MapReduce, major drawbacks from MongoDB is that it does not guaran- it has poor analytic libraries compared to Hadoop. Addi- tee concurrency which limit the native MapReduce program tionally, Hadoop has better integration with other big data from running in multi-thread. This is due to the implemen- platforms with its specialised cluster management such as tation of SpiderMonkey JavaScript engine, known for its Zookeeper . We later store the precomputed results into both threadsafe. In addition, MongoDB in-house MapReduce has Redis and MongoDB. poor analytic libraries compared to Hadoop. This leads us to use Hadoop in conjunction to provide better MapReduce computation. MongoHadoop, http://api.mongodb.org/hadoop/. We also store the end result in Redis database to compare the end-to-end performance with MongoDB. Redis is a fast Jedis, https://github.com/xetorthio/jedis. open-source key-value store that can instantiate result from Zookeeper, http://zookeeper.apache.org/. 123 Vietnam Journal of Computer Science (2018) 5:133–142 137 Fig. 3 Time interval index structure 3.2 Raw data indexing necessity to precompute/store the result for super-interval collection such as weekly, monthly or yearly. It is evident that tremendous amount of data is difficult to process without proper indexing. However indexing the high- 3.3 Precomputed results structure cardinality attribute, such as timestamp, is not suitable due to excessive seek [22]. For example, there will be numerous As discussed in the above subsection, we precompute the ana- index entries if we index every specific timestamp for the lytic results for each indexed collection. In this subsection, tweets, which will lead to a higher latency. To tackle this we study the structure to store the precompute results which problem, in this subsection, we introduce the technique of are the frequencies of word-pair in tweets. We present the time interval index inside the collection layer of MongoDB. structures for two NoSQL databases: MongoDB and Redis. Specifically, tweets are grouped into a single collection For the self-completeness of this paper, we also present our where the time of those tweets are within the same interval. MapReduce algorithms to compute the frequencies of word- The length of the interval can be tuned based on the dataset; pair. it can be an hour, a day, or a month. We use the timestamp of MongoDB structure In MongoDB, we use a separate col- this time interval as the name of the corresponding indexed lection to store the results of each indexed collection. Each collection. By utilizing the time interval index, we dramat- result collection contains a list of frequency results for word- ically alleviate the cost of index seeking while still be able pairs. The format of each frequency result for any word-pair to support drill-down and roll-up temporal queries. Consider is in the following document format: the example index structure in Fig. 3, we choose a day as [_id,word ,word , frequency] 1 2 the time interval. The tweets posted on the same day (shaded box) will be grouped into the same collection. It is worth where _id is created automatically by MongoDB if not spec- noting that a week is a higher interval of a day; however, ified, word and word are the words in this word-pair and 1 2 we do not store a separated collection to group the tweets in the frequency is the number of tweets in which this word-pair the same week. As this will dramatically increase the storage co-occurred. Consider the example in Fig. 4, the name of the size. result collection is in timestamp label 1475118067000 (29 In order to support the drill-down and roll-up tempo- Sep 2016). The hello and world co-occurred in 100 tweets ral queries, we precompute the analytic results for each which are posted on the day of 29 Sep 2016. indexed collection. For example, we precompute the results Redis structure Redis is an in-memory key-value database. for the day collections in Fig. 3. For each time-range query, We use a combination of timestamp and the word-pair as the the system will answer the query using bottom-up merging key and the frequency as the value. The format is given as approach. Specifically, given a time range query whose range follows: is more than 1 day, we first select the tweets collections within [time _word _word : frequency] x 1 2 this range and load their corresponding precomputed results. Then we merge these results to get the final results to answer Redis support searching based on key pattern, thus, we the query. By implementing this technique, we remove the can quickly lock down to the corresponding set of records 123 138 Vietnam Journal of Computer Science (2018) 5:133–142 Algorithm 2: Reduce word−pair Input: Key, List values 1 sum ← 0 2 for each value val ∈ values do 3 sum += value 4 end 5 context.write(Key, sum) The complexity of the MapReduce algorithm is O(N × M ) where N is the number of tweets and M is the number Fig. 4 MongoDB result structure of meaningful word in each tweet. 3.4 Query answering In above subsections, we presented the indexing strategy and the structures to store the precomputed results. Now we are ready to study the process of answering user queries. We classify the user queries into three types: (1) Single selectivity query, (2) Drill-down query, (3) Roll-up query as we can see in the following models: Fig. 5 Redis result structure 1. Single selectivity query QUERY data WHERE time = T WITH Gra(T ) = Φ. x x 2. Drill-down query when given a specific timestamp and/or word-pair. As we can QUERY data WHERE time = T , time = T AND x y see in the example in Fig. 5,the hello and world co-occurred T -T <Φ WITH Gra(T , T )<Φ. y x x y in 100 tweets which are posted on 1475118067000(29 Sep 3. Roll-up query 2016) while they co-occurred in 60 tweets which are posted QUERY data WHERE time = T , time = T AND x y on 1474315055000(19 Sep 2016). T -T >Φ WITH Gra(T , T )>Φ OR Gra(T , T ) = Φ y x x y x y Now we give our MapReduce algorithm for computing the IF Gra(T , T ) = Φ . x y frequencies of word-pairs, as shown in Algorithms 1 and 2. In the above models, we use Φ to denote the interval when We take the twitter collections and a pre-fixed index interval we index the raw data (as mentioned in Sect. 3.2). The func- P as input in map Algorithm 1. We first extract the timestamp tion Gra(T , T ) is to decide the granularity of the parameter x y T (Line 1) and message (Line 2) information from the tweet. i time T intuitively, for example, Gra (12AM 15 Sep 2016) = Then we compute the index stamp by running a modular com- hour and for Gra (15 Sep 2016) = day. Accordingly, Single putation from T on P (Line 3). In Line 4, we tokenize and i Selectivity query aims to query the data falling into a single clean the text message to get a set of meaningful words. After indexed data collection. Drill-down query aims to query the that, for each word-pair with a preserved order, we record its data which are a subset of a single indexed collection. Roll- appearance by 1. The reduce Algorithm 2 is straightforward up query aims to query the data involves multiple indexed to understand which sums up the frequencies for each word- collections. We use the parent interval collection Φ when pair. the interval given in the roll-up query matches T -T = Φ . y x Consider the following example where each one query cor- responds to one query type respectively. Algorithm 1: Map word−pair 1. Word-pair frequency on 02/April/2016. Input:Atweet T within the collections, index 2. Word-pair frequency from 9:00pm of 08/April/2016 to interval P 1 Timestamp T ← ExtractT ime(T ) 11:00pm of 08/April/2016. i w 2 Tokenizer T ← ExtractT ext(T ) x w 3. Word-pair frequency from 18/April/2016 to 3 IndexStamp IndT ← T /P i i 28/April/2016. 4 Tokenizer Tok ← T okenizeClean(T ) x x 5 for each token t ∈ Tok do 1. The time is trivial to answer the single selectivity query, x x 6 for each token t ∈ T do as we only need to navigate to the corresponding result collec- 7 if strCompare(t ,t ) ¿ 0 then tion of MongoDB (set of records of Redis) by the timestamp. 8 Key ← T t t i x 2. To answer the drill-down query, we need to navigate to 9 context.write(Key,one) 10 end the corresponding indexed data collection, fetch the tweets 11 end falling into the time range and then execute the word-pair job 12 end onto the filtered tweets. The time of this process depends on 123 Vietnam Journal of Computer Science (2018) 5:133–142 139 the complexity of the analytic job to be executed and can be very slow if size of the fetched tweets is large. 3. To answer the roll-up query, we need to merge multiple result collec- tions in MongoDB (sets of records in Redis) falling into the time range. This process is similar to the table-join in the relational database. Since the weekly interval partially cov- ers some of the daily interval (18 April 2016–24 April 2016), we select the respective weekly collection and the rest of the daily collections until 28 April 2016. Algorithm 3: MergeResults Input: Multiple precomputed results T = {T ...T } 1 n Fig. 6 Document level processing time Output: final result R 1 HashMap H ←∅ 2 for each collection T ∈ T do 3 for each document w ∈ T do 4 if w.word w.word is not in H then 1 2 5 H(w.word w.word ) ← w.frequency 1 2 6 end 7 else 8 H(w.word w.word ) ← 1 2 H(w.word w.word )+ w.frequency 1 2 9 end 10 end 11 end 12 Result R ← JSON(H) 13 return R We present a basic algorithm to merge multiple result Fig. 7 Collection merging MongoDB collections in the wordpair job, as shown in Algorithm 3. As we can see in the merging algorithm, the we present our experiment results so as to study the response algorithm takes multiple results as input and output the word- time of the query answering under different data settings. We pair result R. A hashmap H is used to temporarily save describe the result of the core processing inside the database the frequency(value) of the word_pair(key) (Line 1). The and the end-to-end processing time from back-end to front- algorithm iterates through each collection and visits each end. document inside the collection (Line 2 to 10). For each doc- ument, if there is no such word_pair in the hashmap, we add a new word_pair to the hashmap (Line 4 to 6). If there 4.1 Dataset and environment is already one, we just add up the frequency (Line 7 to 9). The above algorithm can be very fast if we tune the index The twitter data in our experiment were downloaded though interval properly. The merging algorithm for Redis is similar the public API provided by Twitter. We wrapped 5 datasets 3 3 3 to Algorithm 3, we omit it here. which contains 200 × 10 , 400 × 10 , 600 × 10 , 800 × It is worth noting that a larger interval leads to a larger doc- 10 and 1 million tweets, respectively. For each dataset, we ument collection. Many queries will fall into the drill-down indexed data according to a day interval. Bigger dataset will type. When the number of tweets in one indexed collection is lead to more indexed collections and more documents within large, it will increase the time to answer a drill-down query. each collection. We synthetically generated three query sets In contrast, a smaller interval will lead to many result col- for each query type, each of which contains 100 queries. lections(sets). Many queries will fall into the roll-up type. The process of answering selectivity query and roll-up query Excessive merge will increase the time to answer a roll-up utilized the precomputed results in MongoDB and Redis, query. Therefore, it is a trade-off between the performance thus we report the average response time of these two types of drill-down and roll-up when tuning the index interval. for MongoDB and Redis respectively. As drill-down query only involves indexed data collections which are stored in MongoDB, we simply report the the average response time 4 Experiments for it. Our experiments were conducted on a cluster with 20 Our architecture has already demonstrated its effectiveness nodes, each node is equipped with quad core Intel(R) within a practical HumanSensor project [11]. In this section, Core(TM) i5-2400 CPU @ 3.10 GHz with 4GB RAM. We 123 140 Vietnam Journal of Computer Science (2018) 5:133–142 Fig. 8 Single selectivity Fig. 10 Roll-up Fig. 9 Drill-down used Hadoop (version 2.6.0), MongoDB (version 3.2.9) and Redis (version 3.0.1). Fig. 11 End-to-end processing tasks 4.2 Results and analysis Figure 6 depicts the processing time to read a single collection in MongoDB. We measured different number of documents varying from thousand to million level. The result shows that MongoDB performs scan in a linear scale. For ten thousand documents, it takes approximately a second to read and for a million it takes up to 18 s. To compute the time range, we merge the result for each affected collection. To merge multi- ple collections, we use a join function which takes the value of each key, mapping them into hashmap, then group into final result. Figure 7 represents the multi-collection process- ing time. We calculate two different size of collections, one Fig. 12 Front-end vs back-end processing million and ten thousand respectively. For each collection involves in the merging process, the time takes to compute is increasing linearly. For collections at a million document a result collection name, the time to locate the correspond- level, the merging cost takes approximately 6 s per collec- ing collection is a hash-search which are trivial and almost tion. However, when the number of document per collection constant. While the results of an indexed data collection for is small, the merging cost is trivial. In practical, keeping low Redis are spread into the KEY. The internal pattern search of number document will benefit the overall processing. Results Redis takes a linear time in terms of the number of KEYS. suggest that dividing large collection into several partitions Figure 9 presents the results of answering drill-down greatly benefit the entire processing time. query. As discussed in Sect. 3.4, the time cost by answer- The results of processing single selectivity query are given ing drill-down query depends on the size of the indexed data in Fig. 8. As we can see, the time cost of answering selec- collection and the complexity of the analytic job. As we can tivity query almost keeps constant if precomputed results are see in Fig. 9, the time experience a linear increment in terms stored in MongoDB. While it depicts a linear increment when of the size of datasets. Note that, it takes linear time to execute utilizing the precomputed results stored in Redis. The reason word_pair job. for this phenomena is due to the storage structure of results The results of processing roll-up queries were given in in MongoDB and Redis. For MongoDB, we saved the results Fig. 10. Both the time cost for MongoDB and Redis demon- of an indexed data collection in a separated collection. Given strates an sharper increment in terms of the size of the 123 Vietnam Journal of Computer Science (2018) 5:133–142 141 Table 1 Average end-to-end query processing time with and without precomputing Query Type Without precomputing (second) With precomputing (second) FIND (sentiment_score = negative) where time = 07/07/17 4 3 FIND (sentiment_score = negative) AND (location near 153.3,-27) 32 7 where time = 07/07/17 FIND (sentiment_score = negative) AND (location near 153.3,-27) 140 9 AND (topic = food AND shop) where time = 07/07/17 FIND (sentiment score = negative) AND (location near 153.3,-27) 187 12 AND (topic = food AND shop) where time > 07/07/17 AND time < 09/07/17 FIND (sentiment_score = negative) AND (location near 153.3,-27) 195 11 AND (topic = food AND shop) where time = 07/07/17 OR time = 09/07/17 FIND (sentiment score = negative) AND (location near 153.3,-27) 220 24 AND (topic = food AND shop) where time > 07/07/17 AND time < 09/07/17 OR time > 08/08/17 AND time < 10/10/17 datasets. This is because of the merging process is similar trending topic of people’s interest in a certain period. By to the table-join of the relational database whose time con- combining these three aspects we can easily understand the sumption may grow quickly when the data size gets bigger. general point of view of a certain region. However, through a proper tune of the index interval, we can The main problem of this complex big data analytic is achieve a reasonable response time in practice. The Mon- the processing efficiency of collecting raw data into the goDB shows a slightly better performance than Redis, this final front-end visualisation. Normally, the process will go shares the same reason when answering selectivity query. through several steps such as data cleaning, segmentation, It takes more time for Redis to assemble the precomputed clustering, and score weighting. There are two ways of pro- results for a given indexed collection. cessing data in a typical real world application, first is to directly grab raw data and then process them in the appli- cation layer (front-end). Second is by processing data in the 4.3 Practical scenario database (back-end) and then send the result to the appli- cation. Both ways are acceptable, however, when data size When the choice of an optimal execution plan is not criti- is increasing, the application processing becomes a major cal, MongoDB is shown to be a viable alternative compared bottleneck. As an example, in our use case we use NodeJS to relational databases [16]. We implement the precomput- as the front-end application where Fig. 12 depicts the two ing system in a real-world social media analysis project, the processes described above. The dotted line indicates the raw HumanSensor. We use Twitter dataset as the main source data are processed in the application layer. for analytic tasks. Each tweet size is approximately 3 kB Finally, we measure the end-to-end metrics of information which is stored in MongoDB collection with each collection retrieval starting from user query to the sub-tasks processing indexed in a per-day timestamp. Tweets carry some informa- within the database in real-time. Table 1 shows the aver- tion such as text, GPS location, and the time posted. Based on age processing time of the end-to-end platforms for various these attributes, we define some sub-tasks to monitor public’s NoSQL query-based. We measure the query on a total of opinion, urban activities, and topic modelling with the time one million tweets. Compared to the traditional processing interval as the parameter. Figure 11 represents the main idea without the precomputation, the overall time is reduced from of the system where the first query is divided into sub-tasks minutes to seconds. When the query complexity increase, which will be queried according to temporal information. the processing time grows exponentially due to additional Opinion mining is measured through the sentiment anal- MapReduce processing task which is further affected by net- ysis which calculates the "good and bad" based on the work bandwidth cost. On the other hand, the precomputing sentiment score. The process itself relies purely on the tweet depends on the number of collections since it only has to (text) which is segmented based on supervised learning. scan collections that fall into the query. With the precom- Urban activities on the other hand is determined by the GPS puting architecture, we are able to speed-up the temporal location provided in the tweet attributes (see Fig. 2). We merging which enables the real-time processing for complex visualise each location point as the heatmap to determine the task over big data. density of urban movements. By visualising the location, we can predict the traffic congestion towards a certain period of time. Lastly, the topic modelling is used to capture the Node.js, https://nodejs.org/en/. 123 142 Vietnam Journal of Computer Science (2018) 5:133–142 5 Conclusion tions. In: Proceedings of the Sixteenth International Workshop on Data Warehousing and OLAP, pp. 67–70. ACM (2013) 9. Dede, E., Govindaraju, M., Gunter, D., Canon, R.S., Ramakrishnan, We presented a precomputing architecture suitable for L.: Performance evaluation of a mongodb and hadoop platform for NoSQL databases in particular MongoDB and Redis to scientific data analysis. In: Proceedings of the 4th ACM Workshop answer temporal analytic queries. Within the architecture, on Scientific Cloud Computing, pp. 13–20. ACM (2013) 10. Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical we proposed indexing techniques, results storage structures query processing in mapreduce. VLDB J. 23(3), 355–380 (2014) as well as query processing strategies. Based on proposed 11. Franciscus, N., Ren, X., Stantic, B.: Answering temporal analytic architecture we are able to efficiently answer drill-down and queries over big data based on precomputing architecture. In: Intel- roll-up temporal queries over large amount of data with fast ligent Information and Database Systems—9th Asian Conference, ACIIDS 2017, Kanazawa, Japan, April 3–5, 2017, Proceedings, response time. Through integration in real project running in Part I, pp. 281–290 (2017) Big data and Smart Analytics lab at Griffith University and 12. Gudivada, V.N., Rao, D., Raghavan, V.V.: Renaissance in database experimental performance study we proved the effectiveness management: navigating the landscape of candidate systems. IEEE of our architecture and demonstrated its efficiency under dif- Comput. 49(4), 31–42 (2016) 13. Inmon, W.H., Hackathorn, R.D.: Using the Data Warehouse, vol. ferent settings. We also showed that document-based store 2. Wiley, New York (1994) can outperform key-value store when the data fit in the mem- 14. Lin, C.X., Ding, B., Han, J., Zhu, F., Zhao, B.: Text cube: Com- ory. Considering future works it would be interesting to puting ir measures for multidimensional text database analysis. In: consider include extending the precomputed results over spa- Data Mining, 2008. ICDM’08. Eighth IEEE International Confer- ence on, pp. 905–910. IEEE (2008) tial data and to enable distributed join for merging functions, 15. Lo, E., Kao, B., Ho, W.S., Lee, S.D., Chui, C.K., Cheung, D.W.: which will enable parallel join and hopefully reduce time Olap on sequence data. In: Proceedings of the 2008 ACM SIGMOD threshold per collection. International Conference on Management of Data, pp. 649–660. ACM (2008) Acknowledgements This project was partly funded through a National 16. Mahmood, K., Risch, T., Zhu, M.: Utilizing a nosql data store Environment Science Program (NESP) fund, within the Tropical Water for scalable log analysis. In: Proceedings of the 19th International Quality Hub (Project No: 2.3.2). Database Engineering & Applications Symposium, pp. 49–55. ACM (2015) Open Access This article is distributed under the terms of the Creative 17. Ntarmos, N., Patlakas, I., Triantafillou, P.: Rank join queries in Commons Attribution 4.0 International License (http://creativecomm nosql databases. Proc. VLDB Endow. 7(7), 493–504 (2014) ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, 18. Qi, Y., Candan, K.S., Tatemura, J., Chen, S., Liao, F.: Supporting and reproduction in any medium, provided you give appropriate credit OLAP operations over imperfectly integrated taxonomies. In: Pro- to the original author(s) and the source, provide a link to the Creative ceedings of the 2008 ACM SIGMOD International Conference on Commons license, and indicate if changes were made. Management of Data, pp. 875–888. ACM (2008) 19. Scriney, M., Roantree, M.: Efficient cube construction for smart city data. In: Proceedings of the Workshops of the EDBT/ICDT 2016 Joint Conference (2016) 20. Stantic, B., Pokorny, J.: Opportunities in big data management References and processing. Front. Artif. Intell. Appl. 270, 15–26 (2014). (IOS Press) 1. Becken, S., Stantic, B., Alaei, A.R.: Sentiment analysis in tourism: 21. Stonebraker, M.: SQL databases v. NoSQL databases. Commun. capitalising on big data. J. Travel Res. p. 0047287517747753 ACM 53(4), 10–11 (2010) (2017). 10.1177/0047287517747753 22. Tao, Y., Papadias, D.: The mv3r-tree: a spatio-temporal access 2. Burdick, D., Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Olap method for timestamp and interval queries. In: Proceedings of Very over imprecise data with domain constraints. In: Proceedings of Large Data Bases Conference (VLDB), 11-14 September, Rome, the 33rd International Conference on Very Large Data Bases, pp. pp. 431–440 (2001) 39–50. VLDB Endowment (2007) 23. Wu, L., Sumbaly, R., Riccomini, C., Koo, G., Kim, H.J., Kreps, J., 3. Chaudhuri, S., Dayal, U.: An overview of data warehousing and Shah, S.: Avatara: OLAP for web-scale analytics products. Proc. olap technology. ACM Sigmod Rec. 26(1), 65–74 (1997) VLDB Endow. 5(12), 1874–1877 (2012) 4. Chen, C., Yan, X., Zhu, F., Han, J., Philip, S.Y.: Graph olap: towards 24. Zhao, H., Ye, X.: A practice of TPC-DS multidimensional imple- online analytical processing on graphs. In: Data Mining, 2008. mentation on NoSQL database systems. In: Technology Confer- ICDM’08. Eighth IEEE International Conference on, pp. 103–112. ence on Performance Evaluation and Benchmarking, pp. 93–108. IEEE (2008) Springer (2013) 5. Chevalier, M., El Malki, M., Kopliku, A., Teste, O., Tournier, 25. Zhao, P., Li, X., Xin, D., Han, J.: Graph cube: on warehousing and R.: Implementing multidimensional data warehouses into nosql. OLAP multidimensional networks. In: Proceedings of the 2011 In: International Conference on Enterprise Information Systems ACM SIGMOD International Conference on Management of data, (ICEIS 2015), pp. 172–183 (2015) pp. 853–864. ACM (2011) 6. Codd, E.F., Codd, S.B., Salley, C.T.: Providing olap (on-line ana- lytical processing) to user-analysts: an IT mandate. Codd and Date 32 (1993) Publisher’s Note Springer Nature remains neutral with regard to juris- 7. Coronel, C., Morris, S.: Database systems: design, implementation, dictional claims in published maps and institutional affiliations. and management. Cengage Learn. (2016) 8. Cuzzocrea, A., Bellatreche, L., Song, I.Y.: Data warehousing and olap over big data: current challenges and future research direc- http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Vietnam Journal of Computer Science Springer Journals

Precomputing architecture for flexible and efficient big data analytics

Free
10 pages

Loading next page...
 
/lp/springer_journal/precomputing-architecture-for-flexible-and-efficient-big-data-arsr0pTVzK
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Computer Science; Information Systems and Communication Service; Artificial Intelligence (incl. Robotics); Computer Applications; e-Commerce/e-business; Computer Systems Organization and Communication Networks; Computational Intelligence
ISSN
2196-8888
eISSN
2196-8896
D.O.I.
10.1007/s40595-018-0109-9
Publisher site
See Article on Publisher Site

Abstract

The rising of big data brings revolutionary changes to many aspects of our lives. Huge volume of data, along with its complexity poses big challenges to data analytic applications. Techniques proposed in data warehousing and online analytical processing, such as precomputed multidimensional cubes, dramatically improve the response time of analytic queries based on relational databases. There are some recent works extending similar concepts into NoSQL such as constructing cubes from NoSQL stores and converting existing cubes into NoSQL stores. However, only limited attention in literature have been devoted to precomputing structure within the NoSQL databases. In this paper, we present an architecture for answering temporal analytic queries over big data by precomputing the results of granulated chunks of collections which are decomposed from the original large collection. In extensive experimental evaluations on drill-down and roll-up temporal queries over large amount of data we demonstrated the effectiveness and efficiency under different settings. Keywords NoSQL · Data warehouse · Precompute · Temporal 1 Introduction Data warehouse integrates data from different data sources into large repositories. Data arriving into data warehouses are With the development of data-driven applications in many normally in denormalise multidimensional form which will aspects of our daily lives, it is worthy to praise the signifi- be stored into data cubes to reduce the cost of costly table cance of big data whose value has already been recognized joins. A large part of modern OLAP systems are built on top by both industry and academia. A significant amount of data of data warehouses which are stored with additional informa- is being collected and analyzed to support various decision tion (e.g., metadata). An essential and widely used technique makings, and that amount is expected to increase every year. in OLAP systems is precomputation where analytic results One of the major tasks in mining big data is to answer the are precomputed and materialized in the data warehouses. analytic queries efficiently. Analytic queries often involve When a user submits queries, the system simply retrieves the sophisticated aggregations which demand significant com- corresponding pre-computed results and therefore efficiently puting powers. The huge volume of big data, along with returns the final result to the user. complexity of these queries pose a big challenge for effi- Previous techniques proposed in data warehouses and cient processing. Aiming to tackle these challenges and to OLAP systems are mainly focusing on relational data struc- enhance the performance of analytic query processing, the ture and relational databases. However, it is evident that concept of data warehouse [13] and OLAP [6] have been relational DBMS are struggling to handle large volume of introduced. unstructured data as they do not scale well [20,24]. More- over, relational DBMS are not well-equipped to handle the new multidimensional networks [4,23,25]. The rise of B Nigel Franciscus NoSQL databases [21] has attracted the attention of database n.franciscus@griffith.edu.au community due to its flexibility in providing schema-later Xuguang Ren architecture and its scalability for handling huge amount of x.ren@griffith.edu.au big data. Various NoSQL databases have been chosen and Bela Stantic applied in many domains, which leads to more and more b.stantic@griffith.edu.au data being stored in NoSQL databases. Consequently, it has become an urgent need to process analytic queries based Institute for Integrated and Intelligent Systems, Griffith University, Gold Coast, QLD, Australia 123 134 Vietnam Journal of Computer Science (2018) 5:133–142 on the NoSQL databases efficiently. Some recent works are 3. We answered three types of common query models along extending the techniques of data warehouses and OLAP into with the strategies to answer each query type. NoSQL. The work of [19] present strategies for constructing 4. We conducted extensive experiments to demonstrate the cubes from NoSQL stores. In contrast, the work in [5]pro- performance of proposed architecture in the real world poses method to convert existing cubes into NoSQL stores. end-to-end applications. However, there are few works focusing on the precomputing structure dedicated to NoSQL databases. 1.2 Organization For many modern big data applications, the complexity of analytic tasks often require the combination of NoSQL The rest of the paper is organised as follows: in Sect. 2,we databases and Hadoop. NoSQL databases have been known give some related works; in Sect. 3, we present the details for its schema-less design that flexibly translate any data into of our precomputing architecture; in Sect. 4, we provide the the desired format while Hadoop fills the gap of scalabil- experiment results; and finally in Sect. 5 we conclude the ity with its MapReduce framework. These two platforms paper and indicate some future work. work in conjunction to achieve a large-scale interactive real- time processing. Hadoop ecosystem is designed as a highly fault-tolerant system for batch processing of long-running 2 Related work complex jobs on very large datasets. Due to the lack of inter- active exploration in HDFS, often NoSQL databases are used In this section, we present some related work which we clas- as the output sink for the MapReduce process for real-time sify into two main categories. interaction. To speed up the computation even further, an (1) NoSQL database. According to a survey presented in efficient pre-computation has becoming a good alternative [12] there are more than 100 NoSQL databases developed [10]. for various purposes. Specifically, No-SQL databases can be Motivated by the above findings, in this paper we present classified into four classes: an architecture for answering temporal analytic queries over big data within NoSQL database, in particular document- (a) Key-value stores the data as key-value pairs where the oriented and key-value store. As the time is an essential value can be anything and is treated as opaque binary data, dimension for most of data analytic platforms, we choose the key is transformed into an index using a hash function. temporal queries aspect as focus and as the starting point Redis is one of the widely used key-value databases. of this work. Queries on this specific part of the data (e.g., (b) Column-family applies an column-oriented architecture timestamp) are costly, often requiring full scan of all key- which is contrast to the row-oriented architecture in value pair despite for simple equality queries [17]. We plan RDBMS. Cassandra and HBase are two most used extend our work into other dimensions in future works. The column-family databases. basic idea of proposed architecture is to divide the original (c) Document-database treats the document as the mini- data into separated and smaller chunks and then precompute mum data unit and is designed deliberately for managing the results for each chunk. The precomputed results are then document-oriented information, such as JSON, XML materialized in the NoSQL database. We process the upcom- documents. MongoDB is a typical document database ing analytic queries based on the precomputed results. which is designed to handle JSON documents. (d) Graph-database models the data as graphs and focuses 1.1 Contribution more on the relationships between data units. There are over 30 graph database systems such as Neo4j, Titan, and This paper is the extended version of the precomputing archi- Sparksee. tecture for answering analytic queries for NoSQL databases [11]. We extend our previous work with further practical eval- In this work we stored data in two NoSQL databases, Mon- uations of the drill-down and roll-up temporal queries over goDB and Redis, which have different pros and cons and are a large amount of data in the application perspective, specif- using entirely different mechanisms. However, we present ically: the query processing performance for those two databases based on our pre-computing structure. 1. We proposed the technique to index raw temporal data (2) Data warehouse and OLAP. The concept of data ware- into separated and smaller chunks based on temporal house and OLAP have been proposed very early aiming to interval. answer analytic queries efficiently. The key structure in data 2. We designed efficient storage structures for the precom- warehouse is the cube which is normally stored as a denor- puted results in the document-oriented and key-value malised multidimensional table in relational databases [3,7]. databases. A large part of modern OLAP systems are built on top of 123 Vietnam Journal of Computer Science (2018) 5:133–142 135 Fig. 1 Pre-computing architecture data warehouses and utilize the cubes when processing ana- raw Twitter data and store them into a NoSQL database lytic queries [8] or time-range queries [22]. Some advances (MongoDB) as time-indexed collections. The dotted line rep- have been made to extend OLAP into emerging data, such as resents the temporal indexing structure in MongoDB where imprecise data [2], taxonomies [18], sequence [15], and text the preprocessed results are suited to the task processing, [14]. There are some recent works extending the techniques for example, mapreduce. (2) Precompute results structure, of data warehouses and OLAP into NoSQL. The work of where we execute analytic jobs (MapReduce based on [19] present strategies for constructing cubes from NoSQL Hadoop) and then store the precomputed results into NoSQL 1 2 stores. In contrast, the work in [5] gives the rules in con- database (MongoDB and Redis ). (3) Query answering, verting existing cubes into NoSQL stores. However, only where we apply efficient strategies to answer queries by uti- few works studying the precomputing structure deliberately lizing the precomputed results through merging. As a case within NoSQL databases. study, we demonstrate our architecture using specific data In contrast to previous work, we focus on the processing sources, database platforms, analytic jobs and processing of analytic queries for NoSQL databases where no data are techniques in this paper, as indicated within the above paren- stored in relational databases. We propose an index structure theses. However, it is worth noting that our architecture is suited to answer drill-down and roll-up queries over large quite flexible and can be easily extended to other use cases. amount of data within fast response time. Similar to work in We present more details about each chosen specific ingredi- [22], we particularly focus on the temporal queries with the ent. time-range as the query parameter. Data source As shown in Fig. 1, we use Twitter as the data source as a case study. Twitter is an online social net- working service that enables users to post short 140-character messages called "tweets". Twitter is widely used in monitor- 3 Precomputing architecture ing society trends and user behaviors due to its large user pool [1]. The tweets are naturally formatted into JavaScript In this section, we present design of the precomputing archi- Object Notation (JSON) and they include the textual content tecture. We first give an overview of the architecture, then as well as the posted time. Every tweet consists of several we elaborate each component, (1) Raw data indexing, (2) attributes embedded as the property beside the main text (Fig. Precomputed results structure and (3) Query answering. 2). We test the performance for Twitter dataset both with the attributes and without the attributes. 3.1 Overview The overview of proposed precomputing architecture is shown in Fig. 1. It can be divided into several inter-related MongoDB, https://www.mongodb.com/. components: (1) Raw data indexing, where we collect the Redis, https://redis.io/. 123 136 Vietnam Journal of Computer Science (2018) 5:133–142 Hadoop computing platform. Since keys may contain strings, hashes, lists, sets and sorted sets, Redis can be used to support the final metrics for front end visualisation to serve data out of Hadoop, caching the ‘hot’ pieces of data in-memory for fast access. Combining this simple client with the power of MapReduce will let you write and read data to and from Redis in parallel. NoSQL databases–Hadoop integration The MongoDB- Hadoop connector is a plugin for Hadoop to integrate with MongoDB as the source and/or sink instead of HDFS. Note that we opt not to use HDFS due to the the interactivity of data exploration in MongoDB, although it is evident that reading and writing time from Hadoop to HDFS is faster compared with MongoDB [9]. Further, our main intention to index the raw data is primarily for reading data from MongoDB to Hadoop. At this time this paper is written there is no offi- cial Redis-Hadoop connector for writing the output result to Redis from Hadoop available. However, there are some open source connectors that allows the integration between Redis and Hadoop. We opt to use Jedis, a Java client library to connect both platform. Analytic jobs The precomputing architecture is designed to process data that are sequential or based on the order. Therefore, it is flexible to compute a variety of jobs for both spatial and temporal data. However, we consider the importance of temporal data since they typically have lower granularity which may consistently create excessive seek Fig. 2 Tweet structure index problem. Text aggregation is a widely used analytic job in many lit- eratures. Its intuitive application is the word frequency which Database platform MongoDB is an open source NoSQL is intensively used to detect hot topics and trends in the soci- document-store database, founded by 10gen. MongoDB ety. Compared with word frequency, sometimes we are more stores data in document layout consists of field-value which interested in the frequency of word-pair (co-occurrence of can be nested as an associate arrays. Documents are serialised two words) as it can help us to detect hidden patterns. There- in JSON and written internally as Binary JSON (BSON). The fore, we choose the job of computing word-pair frequency Twitter data is in JSON format which is generally supported in our case. That is given a set of tweets and any word-pair, by the MongoDB. We choose MongoDB to store the raw data we compute the number of tweets in which this word-pair co-occurred. in our case study due to its in-house flexibility. MongoDB provides some features from relational database management Processing techniques We choose Hadoop as the proces- sor to execute the word-pair jobs. Hadoop is an open source system like sorting, compound indexing and range/equal queries. Additionally, MongoDB has its own aggregation implementation of MapReduce framework. As previously capability and in-house MapReduce operation. One of the stated, although MongoDB ships with in-house MapReduce, major drawbacks from MongoDB is that it does not guaran- it has poor analytic libraries compared to Hadoop. Addi- tee concurrency which limit the native MapReduce program tionally, Hadoop has better integration with other big data from running in multi-thread. This is due to the implemen- platforms with its specialised cluster management such as tation of SpiderMonkey JavaScript engine, known for its Zookeeper . We later store the precomputed results into both threadsafe. In addition, MongoDB in-house MapReduce has Redis and MongoDB. poor analytic libraries compared to Hadoop. This leads us to use Hadoop in conjunction to provide better MapReduce computation. MongoHadoop, http://api.mongodb.org/hadoop/. We also store the end result in Redis database to compare the end-to-end performance with MongoDB. Redis is a fast Jedis, https://github.com/xetorthio/jedis. open-source key-value store that can instantiate result from Zookeeper, http://zookeeper.apache.org/. 123 Vietnam Journal of Computer Science (2018) 5:133–142 137 Fig. 3 Time interval index structure 3.2 Raw data indexing necessity to precompute/store the result for super-interval collection such as weekly, monthly or yearly. It is evident that tremendous amount of data is difficult to process without proper indexing. However indexing the high- 3.3 Precomputed results structure cardinality attribute, such as timestamp, is not suitable due to excessive seek [22]. For example, there will be numerous As discussed in the above subsection, we precompute the ana- index entries if we index every specific timestamp for the lytic results for each indexed collection. In this subsection, tweets, which will lead to a higher latency. To tackle this we study the structure to store the precompute results which problem, in this subsection, we introduce the technique of are the frequencies of word-pair in tweets. We present the time interval index inside the collection layer of MongoDB. structures for two NoSQL databases: MongoDB and Redis. Specifically, tweets are grouped into a single collection For the self-completeness of this paper, we also present our where the time of those tweets are within the same interval. MapReduce algorithms to compute the frequencies of word- The length of the interval can be tuned based on the dataset; pair. it can be an hour, a day, or a month. We use the timestamp of MongoDB structure In MongoDB, we use a separate col- this time interval as the name of the corresponding indexed lection to store the results of each indexed collection. Each collection. By utilizing the time interval index, we dramat- result collection contains a list of frequency results for word- ically alleviate the cost of index seeking while still be able pairs. The format of each frequency result for any word-pair to support drill-down and roll-up temporal queries. Consider is in the following document format: the example index structure in Fig. 3, we choose a day as [_id,word ,word , frequency] 1 2 the time interval. The tweets posted on the same day (shaded box) will be grouped into the same collection. It is worth where _id is created automatically by MongoDB if not spec- noting that a week is a higher interval of a day; however, ified, word and word are the words in this word-pair and 1 2 we do not store a separated collection to group the tweets in the frequency is the number of tweets in which this word-pair the same week. As this will dramatically increase the storage co-occurred. Consider the example in Fig. 4, the name of the size. result collection is in timestamp label 1475118067000 (29 In order to support the drill-down and roll-up tempo- Sep 2016). The hello and world co-occurred in 100 tweets ral queries, we precompute the analytic results for each which are posted on the day of 29 Sep 2016. indexed collection. For example, we precompute the results Redis structure Redis is an in-memory key-value database. for the day collections in Fig. 3. For each time-range query, We use a combination of timestamp and the word-pair as the the system will answer the query using bottom-up merging key and the frequency as the value. The format is given as approach. Specifically, given a time range query whose range follows: is more than 1 day, we first select the tweets collections within [time _word _word : frequency] x 1 2 this range and load their corresponding precomputed results. Then we merge these results to get the final results to answer Redis support searching based on key pattern, thus, we the query. By implementing this technique, we remove the can quickly lock down to the corresponding set of records 123 138 Vietnam Journal of Computer Science (2018) 5:133–142 Algorithm 2: Reduce word−pair Input: Key, List values 1 sum ← 0 2 for each value val ∈ values do 3 sum += value 4 end 5 context.write(Key, sum) The complexity of the MapReduce algorithm is O(N × M ) where N is the number of tweets and M is the number Fig. 4 MongoDB result structure of meaningful word in each tweet. 3.4 Query answering In above subsections, we presented the indexing strategy and the structures to store the precomputed results. Now we are ready to study the process of answering user queries. We classify the user queries into three types: (1) Single selectivity query, (2) Drill-down query, (3) Roll-up query as we can see in the following models: Fig. 5 Redis result structure 1. Single selectivity query QUERY data WHERE time = T WITH Gra(T ) = Φ. x x 2. Drill-down query when given a specific timestamp and/or word-pair. As we can QUERY data WHERE time = T , time = T AND x y see in the example in Fig. 5,the hello and world co-occurred T -T <Φ WITH Gra(T , T )<Φ. y x x y in 100 tweets which are posted on 1475118067000(29 Sep 3. Roll-up query 2016) while they co-occurred in 60 tweets which are posted QUERY data WHERE time = T , time = T AND x y on 1474315055000(19 Sep 2016). T -T >Φ WITH Gra(T , T )>Φ OR Gra(T , T ) = Φ y x x y x y Now we give our MapReduce algorithm for computing the IF Gra(T , T ) = Φ . x y frequencies of word-pairs, as shown in Algorithms 1 and 2. In the above models, we use Φ to denote the interval when We take the twitter collections and a pre-fixed index interval we index the raw data (as mentioned in Sect. 3.2). The func- P as input in map Algorithm 1. We first extract the timestamp tion Gra(T , T ) is to decide the granularity of the parameter x y T (Line 1) and message (Line 2) information from the tweet. i time T intuitively, for example, Gra (12AM 15 Sep 2016) = Then we compute the index stamp by running a modular com- hour and for Gra (15 Sep 2016) = day. Accordingly, Single putation from T on P (Line 3). In Line 4, we tokenize and i Selectivity query aims to query the data falling into a single clean the text message to get a set of meaningful words. After indexed data collection. Drill-down query aims to query the that, for each word-pair with a preserved order, we record its data which are a subset of a single indexed collection. Roll- appearance by 1. The reduce Algorithm 2 is straightforward up query aims to query the data involves multiple indexed to understand which sums up the frequencies for each word- collections. We use the parent interval collection Φ when pair. the interval given in the roll-up query matches T -T = Φ . y x Consider the following example where each one query cor- responds to one query type respectively. Algorithm 1: Map word−pair 1. Word-pair frequency on 02/April/2016. Input:Atweet T within the collections, index 2. Word-pair frequency from 9:00pm of 08/April/2016 to interval P 1 Timestamp T ← ExtractT ime(T ) 11:00pm of 08/April/2016. i w 2 Tokenizer T ← ExtractT ext(T ) x w 3. Word-pair frequency from 18/April/2016 to 3 IndexStamp IndT ← T /P i i 28/April/2016. 4 Tokenizer Tok ← T okenizeClean(T ) x x 5 for each token t ∈ Tok do 1. The time is trivial to answer the single selectivity query, x x 6 for each token t ∈ T do as we only need to navigate to the corresponding result collec- 7 if strCompare(t ,t ) ¿ 0 then tion of MongoDB (set of records of Redis) by the timestamp. 8 Key ← T t t i x 2. To answer the drill-down query, we need to navigate to 9 context.write(Key,one) 10 end the corresponding indexed data collection, fetch the tweets 11 end falling into the time range and then execute the word-pair job 12 end onto the filtered tweets. The time of this process depends on 123 Vietnam Journal of Computer Science (2018) 5:133–142 139 the complexity of the analytic job to be executed and can be very slow if size of the fetched tweets is large. 3. To answer the roll-up query, we need to merge multiple result collec- tions in MongoDB (sets of records in Redis) falling into the time range. This process is similar to the table-join in the relational database. Since the weekly interval partially cov- ers some of the daily interval (18 April 2016–24 April 2016), we select the respective weekly collection and the rest of the daily collections until 28 April 2016. Algorithm 3: MergeResults Input: Multiple precomputed results T = {T ...T } 1 n Fig. 6 Document level processing time Output: final result R 1 HashMap H ←∅ 2 for each collection T ∈ T do 3 for each document w ∈ T do 4 if w.word w.word is not in H then 1 2 5 H(w.word w.word ) ← w.frequency 1 2 6 end 7 else 8 H(w.word w.word ) ← 1 2 H(w.word w.word )+ w.frequency 1 2 9 end 10 end 11 end 12 Result R ← JSON(H) 13 return R We present a basic algorithm to merge multiple result Fig. 7 Collection merging MongoDB collections in the wordpair job, as shown in Algorithm 3. As we can see in the merging algorithm, the we present our experiment results so as to study the response algorithm takes multiple results as input and output the word- time of the query answering under different data settings. We pair result R. A hashmap H is used to temporarily save describe the result of the core processing inside the database the frequency(value) of the word_pair(key) (Line 1). The and the end-to-end processing time from back-end to front- algorithm iterates through each collection and visits each end. document inside the collection (Line 2 to 10). For each doc- ument, if there is no such word_pair in the hashmap, we add a new word_pair to the hashmap (Line 4 to 6). If there 4.1 Dataset and environment is already one, we just add up the frequency (Line 7 to 9). The above algorithm can be very fast if we tune the index The twitter data in our experiment were downloaded though interval properly. The merging algorithm for Redis is similar the public API provided by Twitter. We wrapped 5 datasets 3 3 3 to Algorithm 3, we omit it here. which contains 200 × 10 , 400 × 10 , 600 × 10 , 800 × It is worth noting that a larger interval leads to a larger doc- 10 and 1 million tweets, respectively. For each dataset, we ument collection. Many queries will fall into the drill-down indexed data according to a day interval. Bigger dataset will type. When the number of tweets in one indexed collection is lead to more indexed collections and more documents within large, it will increase the time to answer a drill-down query. each collection. We synthetically generated three query sets In contrast, a smaller interval will lead to many result col- for each query type, each of which contains 100 queries. lections(sets). Many queries will fall into the roll-up type. The process of answering selectivity query and roll-up query Excessive merge will increase the time to answer a roll-up utilized the precomputed results in MongoDB and Redis, query. Therefore, it is a trade-off between the performance thus we report the average response time of these two types of drill-down and roll-up when tuning the index interval. for MongoDB and Redis respectively. As drill-down query only involves indexed data collections which are stored in MongoDB, we simply report the the average response time 4 Experiments for it. Our experiments were conducted on a cluster with 20 Our architecture has already demonstrated its effectiveness nodes, each node is equipped with quad core Intel(R) within a practical HumanSensor project [11]. In this section, Core(TM) i5-2400 CPU @ 3.10 GHz with 4GB RAM. We 123 140 Vietnam Journal of Computer Science (2018) 5:133–142 Fig. 8 Single selectivity Fig. 10 Roll-up Fig. 9 Drill-down used Hadoop (version 2.6.0), MongoDB (version 3.2.9) and Redis (version 3.0.1). Fig. 11 End-to-end processing tasks 4.2 Results and analysis Figure 6 depicts the processing time to read a single collection in MongoDB. We measured different number of documents varying from thousand to million level. The result shows that MongoDB performs scan in a linear scale. For ten thousand documents, it takes approximately a second to read and for a million it takes up to 18 s. To compute the time range, we merge the result for each affected collection. To merge multi- ple collections, we use a join function which takes the value of each key, mapping them into hashmap, then group into final result. Figure 7 represents the multi-collection process- ing time. We calculate two different size of collections, one Fig. 12 Front-end vs back-end processing million and ten thousand respectively. For each collection involves in the merging process, the time takes to compute is increasing linearly. For collections at a million document a result collection name, the time to locate the correspond- level, the merging cost takes approximately 6 s per collec- ing collection is a hash-search which are trivial and almost tion. However, when the number of document per collection constant. While the results of an indexed data collection for is small, the merging cost is trivial. In practical, keeping low Redis are spread into the KEY. The internal pattern search of number document will benefit the overall processing. Results Redis takes a linear time in terms of the number of KEYS. suggest that dividing large collection into several partitions Figure 9 presents the results of answering drill-down greatly benefit the entire processing time. query. As discussed in Sect. 3.4, the time cost by answer- The results of processing single selectivity query are given ing drill-down query depends on the size of the indexed data in Fig. 8. As we can see, the time cost of answering selec- collection and the complexity of the analytic job. As we can tivity query almost keeps constant if precomputed results are see in Fig. 9, the time experience a linear increment in terms stored in MongoDB. While it depicts a linear increment when of the size of datasets. Note that, it takes linear time to execute utilizing the precomputed results stored in Redis. The reason word_pair job. for this phenomena is due to the storage structure of results The results of processing roll-up queries were given in in MongoDB and Redis. For MongoDB, we saved the results Fig. 10. Both the time cost for MongoDB and Redis demon- of an indexed data collection in a separated collection. Given strates an sharper increment in terms of the size of the 123 Vietnam Journal of Computer Science (2018) 5:133–142 141 Table 1 Average end-to-end query processing time with and without precomputing Query Type Without precomputing (second) With precomputing (second) FIND (sentiment_score = negative) where time = 07/07/17 4 3 FIND (sentiment_score = negative) AND (location near 153.3,-27) 32 7 where time = 07/07/17 FIND (sentiment_score = negative) AND (location near 153.3,-27) 140 9 AND (topic = food AND shop) where time = 07/07/17 FIND (sentiment score = negative) AND (location near 153.3,-27) 187 12 AND (topic = food AND shop) where time > 07/07/17 AND time < 09/07/17 FIND (sentiment_score = negative) AND (location near 153.3,-27) 195 11 AND (topic = food AND shop) where time = 07/07/17 OR time = 09/07/17 FIND (sentiment score = negative) AND (location near 153.3,-27) 220 24 AND (topic = food AND shop) where time > 07/07/17 AND time < 09/07/17 OR time > 08/08/17 AND time < 10/10/17 datasets. This is because of the merging process is similar trending topic of people’s interest in a certain period. By to the table-join of the relational database whose time con- combining these three aspects we can easily understand the sumption may grow quickly when the data size gets bigger. general point of view of a certain region. However, through a proper tune of the index interval, we can The main problem of this complex big data analytic is achieve a reasonable response time in practice. The Mon- the processing efficiency of collecting raw data into the goDB shows a slightly better performance than Redis, this final front-end visualisation. Normally, the process will go shares the same reason when answering selectivity query. through several steps such as data cleaning, segmentation, It takes more time for Redis to assemble the precomputed clustering, and score weighting. There are two ways of pro- results for a given indexed collection. cessing data in a typical real world application, first is to directly grab raw data and then process them in the appli- cation layer (front-end). Second is by processing data in the 4.3 Practical scenario database (back-end) and then send the result to the appli- cation. Both ways are acceptable, however, when data size When the choice of an optimal execution plan is not criti- is increasing, the application processing becomes a major cal, MongoDB is shown to be a viable alternative compared bottleneck. As an example, in our use case we use NodeJS to relational databases [16]. We implement the precomput- as the front-end application where Fig. 12 depicts the two ing system in a real-world social media analysis project, the processes described above. The dotted line indicates the raw HumanSensor. We use Twitter dataset as the main source data are processed in the application layer. for analytic tasks. Each tweet size is approximately 3 kB Finally, we measure the end-to-end metrics of information which is stored in MongoDB collection with each collection retrieval starting from user query to the sub-tasks processing indexed in a per-day timestamp. Tweets carry some informa- within the database in real-time. Table 1 shows the aver- tion such as text, GPS location, and the time posted. Based on age processing time of the end-to-end platforms for various these attributes, we define some sub-tasks to monitor public’s NoSQL query-based. We measure the query on a total of opinion, urban activities, and topic modelling with the time one million tweets. Compared to the traditional processing interval as the parameter. Figure 11 represents the main idea without the precomputation, the overall time is reduced from of the system where the first query is divided into sub-tasks minutes to seconds. When the query complexity increase, which will be queried according to temporal information. the processing time grows exponentially due to additional Opinion mining is measured through the sentiment anal- MapReduce processing task which is further affected by net- ysis which calculates the "good and bad" based on the work bandwidth cost. On the other hand, the precomputing sentiment score. The process itself relies purely on the tweet depends on the number of collections since it only has to (text) which is segmented based on supervised learning. scan collections that fall into the query. With the precom- Urban activities on the other hand is determined by the GPS puting architecture, we are able to speed-up the temporal location provided in the tweet attributes (see Fig. 2). We merging which enables the real-time processing for complex visualise each location point as the heatmap to determine the task over big data. density of urban movements. By visualising the location, we can predict the traffic congestion towards a certain period of time. Lastly, the topic modelling is used to capture the Node.js, https://nodejs.org/en/. 123 142 Vietnam Journal of Computer Science (2018) 5:133–142 5 Conclusion tions. In: Proceedings of the Sixteenth International Workshop on Data Warehousing and OLAP, pp. 67–70. ACM (2013) 9. Dede, E., Govindaraju, M., Gunter, D., Canon, R.S., Ramakrishnan, We presented a precomputing architecture suitable for L.: Performance evaluation of a mongodb and hadoop platform for NoSQL databases in particular MongoDB and Redis to scientific data analysis. In: Proceedings of the 4th ACM Workshop answer temporal analytic queries. Within the architecture, on Scientific Cloud Computing, pp. 13–20. ACM (2013) 10. Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical we proposed indexing techniques, results storage structures query processing in mapreduce. VLDB J. 23(3), 355–380 (2014) as well as query processing strategies. Based on proposed 11. Franciscus, N., Ren, X., Stantic, B.: Answering temporal analytic architecture we are able to efficiently answer drill-down and queries over big data based on precomputing architecture. In: Intel- roll-up temporal queries over large amount of data with fast ligent Information and Database Systems—9th Asian Conference, ACIIDS 2017, Kanazawa, Japan, April 3–5, 2017, Proceedings, response time. Through integration in real project running in Part I, pp. 281–290 (2017) Big data and Smart Analytics lab at Griffith University and 12. Gudivada, V.N., Rao, D., Raghavan, V.V.: Renaissance in database experimental performance study we proved the effectiveness management: navigating the landscape of candidate systems. IEEE of our architecture and demonstrated its efficiency under dif- Comput. 49(4), 31–42 (2016) 13. Inmon, W.H., Hackathorn, R.D.: Using the Data Warehouse, vol. ferent settings. We also showed that document-based store 2. Wiley, New York (1994) can outperform key-value store when the data fit in the mem- 14. Lin, C.X., Ding, B., Han, J., Zhu, F., Zhao, B.: Text cube: Com- ory. Considering future works it would be interesting to puting ir measures for multidimensional text database analysis. In: consider include extending the precomputed results over spa- Data Mining, 2008. ICDM’08. Eighth IEEE International Confer- ence on, pp. 905–910. IEEE (2008) tial data and to enable distributed join for merging functions, 15. Lo, E., Kao, B., Ho, W.S., Lee, S.D., Chui, C.K., Cheung, D.W.: which will enable parallel join and hopefully reduce time Olap on sequence data. In: Proceedings of the 2008 ACM SIGMOD threshold per collection. International Conference on Management of Data, pp. 649–660. ACM (2008) Acknowledgements This project was partly funded through a National 16. Mahmood, K., Risch, T., Zhu, M.: Utilizing a nosql data store Environment Science Program (NESP) fund, within the Tropical Water for scalable log analysis. In: Proceedings of the 19th International Quality Hub (Project No: 2.3.2). Database Engineering & Applications Symposium, pp. 49–55. ACM (2015) Open Access This article is distributed under the terms of the Creative 17. Ntarmos, N., Patlakas, I., Triantafillou, P.: Rank join queries in Commons Attribution 4.0 International License (http://creativecomm nosql databases. Proc. VLDB Endow. 7(7), 493–504 (2014) ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, 18. Qi, Y., Candan, K.S., Tatemura, J., Chen, S., Liao, F.: Supporting and reproduction in any medium, provided you give appropriate credit OLAP operations over imperfectly integrated taxonomies. In: Pro- to the original author(s) and the source, provide a link to the Creative ceedings of the 2008 ACM SIGMOD International Conference on Commons license, and indicate if changes were made. Management of Data, pp. 875–888. ACM (2008) 19. Scriney, M., Roantree, M.: Efficient cube construction for smart city data. In: Proceedings of the Workshops of the EDBT/ICDT 2016 Joint Conference (2016) 20. Stantic, B., Pokorny, J.: Opportunities in big data management References and processing. Front. Artif. Intell. Appl. 270, 15–26 (2014). (IOS Press) 1. Becken, S., Stantic, B., Alaei, A.R.: Sentiment analysis in tourism: 21. Stonebraker, M.: SQL databases v. NoSQL databases. Commun. capitalising on big data. J. Travel Res. p. 0047287517747753 ACM 53(4), 10–11 (2010) (2017). 10.1177/0047287517747753 22. Tao, Y., Papadias, D.: The mv3r-tree: a spatio-temporal access 2. Burdick, D., Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Olap method for timestamp and interval queries. In: Proceedings of Very over imprecise data with domain constraints. In: Proceedings of Large Data Bases Conference (VLDB), 11-14 September, Rome, the 33rd International Conference on Very Large Data Bases, pp. pp. 431–440 (2001) 39–50. VLDB Endowment (2007) 23. Wu, L., Sumbaly, R., Riccomini, C., Koo, G., Kim, H.J., Kreps, J., 3. Chaudhuri, S., Dayal, U.: An overview of data warehousing and Shah, S.: Avatara: OLAP for web-scale analytics products. Proc. olap technology. ACM Sigmod Rec. 26(1), 65–74 (1997) VLDB Endow. 5(12), 1874–1877 (2012) 4. Chen, C., Yan, X., Zhu, F., Han, J., Philip, S.Y.: Graph olap: towards 24. Zhao, H., Ye, X.: A practice of TPC-DS multidimensional imple- online analytical processing on graphs. In: Data Mining, 2008. mentation on NoSQL database systems. In: Technology Confer- ICDM’08. Eighth IEEE International Conference on, pp. 103–112. ence on Performance Evaluation and Benchmarking, pp. 93–108. IEEE (2008) Springer (2013) 5. Chevalier, M., El Malki, M., Kopliku, A., Teste, O., Tournier, 25. Zhao, P., Li, X., Xin, D., Han, J.: Graph cube: on warehousing and R.: Implementing multidimensional data warehouses into nosql. OLAP multidimensional networks. In: Proceedings of the 2011 In: International Conference on Enterprise Information Systems ACM SIGMOD International Conference on Management of data, (ICEIS 2015), pp. 172–183 (2015) pp. 853–864. ACM (2011) 6. Codd, E.F., Codd, S.B., Salley, C.T.: Providing olap (on-line ana- lytical processing) to user-analysts: an IT mandate. Codd and Date 32 (1993) Publisher’s Note Springer Nature remains neutral with regard to juris- 7. Coronel, C., Morris, S.: Database systems: design, implementation, dictional claims in published maps and institutional affiliations. and management. Cengage Learn. (2016) 8. Cuzzocrea, A., Bellatreche, L., Song, I.Y.: Data warehousing and olap over big data: current challenges and future research direc-

Journal

Vietnam Journal of Computer ScienceSpringer Journals

Published: May 19, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off