Association rule mining algorithms on high-dimensional datasets

Association rule mining algorithms on high-dimensional datasets The science of bioinformatics has been accelerating at a fast pace, introducing more features and handling bigger volumes. However, these swift changes have, at the same time, posed challenges to data mining applications, in particular efficient association rule mining. Many data mining algorithms for high-dimensional datasets have been put forward, but the sheer numbers of these algorithms with varying features and application scenarios have complicated making suitable choices. Therefore, we present a general survey of multiple association rule mining algorithms applicable to high-dimensional data- sets. The main characteristics and relative merits of these algorithms are explained, as well, pointing out areas for improve- ment and optimization strategies that might be better adapted to high-dimensional datasets, according to previous studies. Generally speaking, association rule mining algorithms that merge diverse optimization methods with advanced computer techniques can better balance scalability and interpretability. Keywords Data mining algorithms · Association rule mining · High-dimensional datasets · Frequent itemset mining 1 Introduction huge datasets with little prior knowledge according to co- occurrence features. When applied to biomedical data, ARM Association rules mining (ARM), an important branch of can obtain rules that provide a better understanding of bio- data mining, has been extensively used in many areas since logical associations among different covariates or between Agrawal first introduced it in 1993 [1 ]. In general, ARM covariates and response variables. Bioinformatics techniques can be seen as a method aimed at discovering groups of have been developing with increased speed and so have items that co-occur with high frequency. In contrast to other several high-throughput biotechnologies, such as genomic data mining methods involved with statistical models, ARM microarray and Next Generation Sequencing (NGS). High- can extract possible relationships between variables from dimensional data are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, high-dimensional data are the data with anywhere from a few dozen to many * Dongmei Ai aidongmei@sina.com thousands of dimensions. A dataset is a collection of data. Most commonly a dataset corresponds to the contents of Hongfei Pan 18813128340@163.com a single database table, or a single statistical data matrix. It follows that a concern among researchers has been the Xiaoxin Li 851482181@qq.com efficient and effective discovery of latent information under - lying huge amounts of data. As a possible solution to this Yingxin Gao gaoyingxin16@outlook.com problem, ARM has been extensively applied in this field. A typical application of ARM on such high-throughput data- Di He 15810251026@163.com sets is gene association analysis (GAA) [2, 3], in which the goal is to exploit the relationships among different genes School of Mathematics and Physics, University of Science based on corresponding expression levels. and Technology Beijing, No.30, Xueyuan Road, Data from these high-throughput techniques often share Beijing 100083, China in common the feature of high dimensionality. For example, Computer Science, City University of New York Graduate gene expression data typically take the form of an N × M School and University Center, CUNY, 365 Fifth Avenue, New York 10016, USA Vol.:(0123456789) 1 3 Artificial Life and Robotics matrix, where each row of the matrix represents a sample, by concatenating suffix patterns, starting from frequent and each column corresponds to the expression level of a 1-pattern to least frequent itemsets, with frequent patterns certain gene. The number of genes in a given study can be from a conditional FP-tree, which is a subtree consisting of in the thousands, while the number of specimens is generally the set of prefix-paths in the FP-tree co-occurring with the dozens or hundreds. suffix, recursively. In other words, this method only involves Such high dimensionality is also true for other kinds frequent patter (k + 1)-itemset n growth instead of Apriori- of biomedical datasets, e.g., Operational Taxonomic Unit like generation-and-test. In this sense, then, it applies a par- (OTU) abundance datasets that have different levels of extra titioning-based divide-and-conquer strategy, and efficiency environmental factors in metagenomics analysis [4], as well studies demonstrated that this method has substantially as multiple datasets, including mRNA/miRNA expression reduced search time. Subsequently, multiple algorithms were data and Copy Number Variations (CNV) data from The proposed as extensions to the FP-growth approach, such as Cancer Genome Atlas (TCGA) project [5]. generating frequent itemset in a depth-first manner [14], Based on the high dimensionality of such datasets, the use mining with devised hyperstructure [15], pattern-growth of traditional methods of association rules mining directly mining by traversing FP-tree-like structure in both direc- applied to these datasets could result in unsatisfactory per- tions (top-down and bottom-up) [16], and pattern-growth formance [6]. To improve performance brought by high- mining with tree structure in an array-based implementation dimensional datasets, multiple specialized algorithms have form [17, 18]. Recursive searches on the tree could result in been proposed in the last decade. enormous costs in certain cases; nevertheless, such methods did lay a solid foundation for the further application of tree structure in association rule mining algorithms. 2 Mining algorithms on high‑dimensional Apriori and FP-growth both adopted a horizontal format datasets for mining frequent itemsets. In contrast, Zaki proposed an Equivalence CLASS Transformation (Eclat) algorithm 2.1 Basic association rule mining algorithms employing a vertical data format [19]. Eclat also utilized Apriori’s candidate generation property of (k + 1)-itemset Apriori, the first ARM algorithm, was proposed by Agrawal candidates. In Eclat, however, the support computation of [7], and it successfully reduced the search space size with a candidate can be done by just intersecting the sample id sets downward closure Apriori property that says a k-itemset is of the corresponding frequent (k + 1)-itemset . More simply frequent only if all of its subsets are frequent. The Apriori stated, the support of any itemset can be obtained directly algorithm is characterized by a feature called candidate gen- from the vertical sampleID without any further computa- eration whereby (k + 1)-candidate-itemsets are generated tion. Thus, additional scanning of the original dataset can iteratively by combining any two frequent k-itemsets which be saved, again reducing the cost of search time. share a common prefix of length (k-1) . Further computation Generally speaking, finding all frequent itemsets of a of the supports of each candidate itemset is then performed specific dataset can be regarded as a process consisting of to determine if the candidate is frequent or not. Finally, the search space traversal, itemset support computation, and algorithm terminates if no more frequent itemsets can be search path pruning. A common strategy of traversing the generated. search space includes breadth-first search (BFS). For exam- Based on the standard Apriori algorithm, several ple, in Apriori, the frequent (k + 1)-itemset is not generated improved variations were proposed. The performance- until all frequent (k + 1)-itemsets have been discovered. enhancing strategies include the hashing technique [8], par- Another common strategy is depth-first search (DFS). For titioning technique [9], sampling approach [10], dynamic example, in FP-growth, longer frequent patterns are gener- counting [11], and incremental mining [12]. As previous ated recursively until no more can be done. Common strate- studies demonstrated, these Apriori-based approaches gies for support computation include counting (e.g., Apriori, achieved good performance when the dataset was sparse FP-growth) and intersection (e.g., Eclat). In sum, the meth- and the patterns discovered were of short length. However, odologies adopted by the three basic association rule mining such methods suffer nontrivial costs caused by generating algorithms described (Apriori [7], FP-growth [13] and Eclat huge numbers of candidate itemsets and extra scans over the [19]) serve as landmarks for the development of association datasets for support computation. rules mining and constitute the basis for subsequent associa- We then saw the emergence of new algorithms like FP tion algorithms. (frequent pattern)-growth without candidate itemsets [13]. First, an FP-tree, which retains information associating the itemsets, is constructed according to the frequency of 1-itemset. Next, patterns of different lengths are generated 1 3 Artificial Life and Robotics In summary, frequent closed itemsets can provide analyti- 2.2 Maximal frequent itemset mining and frequent closed itemset mining cal power equivalent to that of complete frequent itemsets, but with much smaller size. Substantial approaches have ver- According to the Apriori Property, it is obvious that (k-2) ified the higher efficiency and better interpretability obtained by frequent closed itemset mining. However, most of the subsets (except itself and  ) of a certain frequent k-itemset are also frequent. Such characteristic will result in massive above-mentioned algorithms adopt the column-enumer- ation strategy. Therefore, when applying such approaches unnecessary redundancy of frequent itemsets. To limit the redundancy, two alternative concepts were advanced, namely over high-dimensional datasets, the search space will tend to expand exponentially, according to the feature size, thus maximal frequent itemset mining and frequent closed item- set mining [20]. making computational cost prohibitive. Therefore, it is easy to see why most of the algorithms discussed thus far cannot Many algorithms were developed to mine these two cat- egories of itemsets. For example, MaxMiner, the very first be applied to high-dimensional datasets, again underscoring the need to develop algorithms applicable to high-dimen- study on maximal frequent itemset mining, was proposed in 1998 [21]. Based on Apriori, MaxMiner adopted a breadth- sional datasets to keep pace with advancements in sequenc- ing and computer technology. first search (BFS) strategy and reduced the search space by both superset and subset frequency pruning. As another 2.3 Algorithms applicable to high‑dimensional efficient maximal frequent itemset mining method, MAFIA improved support counting efficiency by adopting vertical datasets bitmaps to compress the transaction id list [22]. For frequent closed itemset mining, numerous methods From previous sections, we can see that the applicability of the above-mentioned algorithms to high-dimensional data- have been proposed since 1999, when A-Close, an Apriori- based frequent closed itemset mining approach, was reported sets is limited. In this section, we will discuss approaches better able to meet this challenge. [23]. CLOSET explored frequent closed itemset mining based on FP-tree structure [24]. Another typical frequent Among them, approaches incorporating frequent closed itemset mining and row enumeration can serve as a possible closed itemset mining approach is CHARM, which adopted a hybrid search strategy, known as the diffsets technique solution. This idea was first explored in 2003 [28]. Based on data in vertical format, CARPENTER constructs a row- (compact form of tID list information), and a hash-based “non-close item disposal” approach to enhance both com- enumeration tree and adopts a depth-first search (DFS) strat- egy to traverse it. Additionally, several pruning strategies are putation and memory efficiency [25]. In addition, AFOPT- close presented a method which can adaptively use three employed during the search process to cut off the branches incapable of generating frequent closed itemsets. Previous different structures, including array, AFOPT-tree (FP-tree like) and buckets, to represent conditional databases accord- study [28] has shown that CARPENTER gained better per- formance, compared to its rivals as CHARM and CLOSE+, ing to their respective densities [26]. To integrate previous effective approaches and some newly developed techniques, when applied to high-dimensional microarray datasets [27]. Adopting similar strategies, other methods were devel- CLOSET+ was proposed [27]. After thorough performance studies on diverse datasets, CLOSET+ was considered as oped. For instance, RERII is like CARPENTER, but instead of searching frequent itemsets from the whole original data- one of the most efficient methods at the time. Previous studies have shown that algorithms of these sets, RERII explored frequent closed itemsets in the opposite direction, starting from the nodes that represent the com- two categories are usually more efficient against previous iterations. However, maximal frequent itemset has a critical plete rowsets [29]. This strategy has the potential to enhance overall performance by reducing the cost of searching short defect in that the supports of its subitemsets may be different from its own. This would, in turn, result in extra scans over rowsets and I-item rowsets. To make CARPENTER more adaptable to more complex the dataset for support computation and its ultimate unfitness for rule extraction. Frequent closed itemset mining does not datasets, COBBLER integrated the strategies of both CAR- PENTER and CLOSE+ [30]. Accordingly, COBBLER can encounter such problems, essentially because all subsets of a certain frequent closed itemset must have precisely the same dynamically switch between row-enumeration and column- enumeration to meet estimated cost conditions. Its ec ffi iency support as that of the frequent closed itemset. Furthermore, frequent closed itemsets can be regarded as a compressed has been verified in experiments over datasets with high dimensionality and a relatively large number of rows. form of the complete frequent itemsets without informa- tion loss. Based on these features and properties, we can TD-CLOSE adopts a top-down row-enumeration search strategy that enables the support of a stronger pruning power conclude that frequent closed itemset mining is more likely to play a vital role in the development of association rules against the bottom-up style adopted by CARPENTER [31]. To guarantee closeness during the mining process, an mining. 1 3 Artificial Life and Robotics additional closeness checking method was included in TD- application of such algorithms was suppressed. On the other CLOSE. Moreover, in 2009, an improved version of TD- hand, based on the exuberance over cloud computing and CLOSE, called TTD-CLOSE, was proposed [32]. With distributed computing techniques, parallelized association optimized pruning strategy and data structure, TTD-CLOSE rule mining algorithms were revived with the opportunity to obtained better performance than the original TD-CLOSE. show their power. Specifically, as the most recognized large- To extend the applications of frequent pattern mining, scale data analysis technique, Hadoop has been broadly uti- new classification methods based on ARM over high-dimen- lized in modern biomedical studies [38, 39]. Characterized sional datasets, such as FARMER and TOPKRGS, emerged by mapper and reducer functions [40], the Hadoop MapRe- [33, 34]. With additional class information attached to the duce framework is especially good at processing gigabytes, original datasets, both algorithms can extract classification or even terabytes, of data. Moreover, by hiding details of rules in the form of X ⇒ C , where C is a class label, and underlying controls, Hadoop can enable users to just concen- X is a set of items. Based on previous analysis, it can be trate on algorithm design. All of these features make Hadoop concluded that a rule extracted from frequent closed item- a novel promising candidate to propel the development of set, consisting of k items in total, also implies the existence ARM in the “big data” era. of other 2 -2 rules. To reduce rule redundancy, these two Typical examples of adapting Apriori on Hadoop MapRe- algorithms only extract “interesting” rule groups instead of duce include SPC (Single Pass Counting), FPC (Fixed Pass all rules. Specifically, FARMER adopted the concept of a Combined-Counting) and DPC (Dynamic Pass Counting) rule group that consists of a unique upper bound rule and a [41]. These algorithms share common procedures of dis- set of lower bound rules for clustering the complete results tributing data to different mappers and parallel counting of rules, while TOPKRGS just selects the most significant supports, but differ in candidate generation. Typically, SPC top-k covering rule groups. In addition, FARMER reinforced generates frequent itemsets of only a single length after one interestingness measures with Chi square in addition to sup- phase of MapReduce, but FPC and DPC generate frequent port and confidence, while TOPKRGS adopts a prefix tree itemsets with different lengths after phases. In addition, as structure to speed up the frequency computation and uti- the names suggest, and are fixed parameters in FPC, while lizes a dynamic minimum confidence generation strategy to in DPC, they are dynamically determined by the number better-fit different datasets. of generated candidate itemsets at each MapReduce phase. To enhance mining efficiency, HDminer employs effective Other Hadoop-based Apriori algorithms, which work in a search space partitioning and pruning strategies. HDminer similar manner, but different forms of implementation, were gradually narrows down the search space by pruning off the also proposed [42, 43]. false-valued cells based on the space partition tree instead of To solve the problem that the traditional association accumulating the true-valued cells like the FP-tree- or enu- rules mining algorithm has been unable to meet the mining meration tree-based methods. Owing to fewer false-valued needs of large amount of data in the aspect of efficiency cells compared to true-valued cells, HDminer works much and scalability, take FP-Growth as an example, the algo- more efficiently than the FP-tree- or enumeration tree-based rithm of the parallelization was realized in based on Hadoop methods [35]. HDMiner shows superiority, especially on framework and Map Reduce model. It can be better to meet synthetic data and dense microarray data. the requirements of big data mining and efficiently mine To summarize, previous studies have verified the rela- frequent item sets and association rules from large dataset tively high efficiency of row-enumeration algorithms for [44]. MRFP-Growth (MapReduce Frequent Pattern Growth) mining frequent closed itemset over high-dimensional is also implementing to solve the problem of discovering datasets. However, with the advancement of biomedical frequent patterns with massive datasets. The efficiency and data acquisition techniques, the volume of data has grown performance of this method have been increased compared larger and larger, and the row size of a certain dataset may, with other mining algorithms [45]. Also, implementations therefore, become as large as the column size. In this case, of Eclat on MapReduce were proposed, such as Dist-Eclat, methods such as COBBLER or TD-CLOSE, as described focusing on speed acceleration, and its optimized version above, may still have trouble handling such large datasets. BigFIM which adopts hybrid approaches incorporating both Consequently, instead of sequential algorithms, increased Apriori and Eclat, thus making BigFIM better suited to very attention has focused on parallel and distributed algorithms. large datasets [46]. Experiments on real datasets have proven Actually, parallel association rule mining algorithms were their scalability. An MapReduce algorithm for mining closed proposed quite early in the 1990s [36, 37]. However, since frequent itemsets was implemented, as well [47]. the effectiveness of these algorithms was challenged by Another noteworthy approach is PARMA [48]. It applies complicated strategies of workload balance, fault tolerance parallel mining algorithms to randomly selected subsets and data distribution, as well as interconnection costs and of the original large dataset. Owing to its random sam- limited computer hardware capacity at that time, extensive ple property, its mined results can be considered as the 1 3 Artificial Life and Robotics approximation of the exact results according to the whole full-scale patterns, appeared as a solution. The most typical dataset. The approximation quality was verified by both both approach, known as Pattern-Fusion, is based on a novel con- statistical analysis and real-time application. cept called core pattern. Pattern-Fusion is able to discover Based on CARPENTER, a new algorithm called PaMPa- approximate colossal patterns, i.e., the colossal pattern min- HD was developed. This algorithm adopts the depth-first ing algorithm based on pattern fusion improve seed pattern search process, as well, but the process is broken up into selection method, which is select pattern that the distant big, independent subprocesses to which a centralized version of rather than random seed pattern [51]. CARPENTER is applied so that it can autonomously evalu- Recently, the Graphics Processor Units (GPU) has ate subtrees of the search space. Then the final closed item- emerged as one of the most used parallel hardware to solve sets of each subprocess can be extracted in order to compute large scientific complex problems. An approach benefits the whole closed itemset result [49]. Since the subprocesses from the massively parallel power of GPU by using a large are independent, they can be executed in parallel by means number of threads to evaluate association rule mining was of a distributed computing platform such as Hadoop. proposed [52]. Then a new algorithm called MWBSO- To achieve compressed storage and avoid building condi- MEGPU was proposed. This method combine both GPU and tional pattern bases, FiDoop was brought forward. FiDoop cluster computing to improve a Bees Swarm Optimization utilizes a frequent itemset ultrametric tree. In FiDoop, the (BSO) metaheuristic. Several tests have been carried out to mappers independently decompose itemsets, the reduc- evaluate this approach. The results reveal that MWBSO- ers perform combination operations by constructing small MEGPU outperforms the HPC-based ARM approaches in ultrametric trees, and the actual mining of these trees is per- terms of speed up when exploring Webdocs instance [53]. formed separately, which can speed up the mining perfor- mance for high-dimensional datasets analysis [50]. Exten- sive experiments using real-world celestial spectral data 3 Discussion indicate that FiDoop is scalable and efficient. In addition to the huge computation cost, it is typical for All algorithms reviewed as applicable approaches for high- the size of derived patterns from high-dimensional datasets dimensional datasets are summarized below in Table 1. to be enormous. Such growth of derived patterns makes their The performance evaluation of association rule mining effective use difficult. Therefore, a new methodology aimed algorithms raises two major concerns. The first is scalabil- at mining approximate or representative patterns, instead of ity, which refers to the ability of an algorithm to handle a Table 1 Overall compilation of association rule mining algorithms on high-dimensional datasets Methods Category Feature Reference CARPENTER Row-enumeration closed pattern Bottom-up [28] RERII Top-down [29] COBBLER Hybrid of CARPENTER&CLOSE+ [30] TD-CLOSE Top-down [31] TTD-CLOSE Top-down [32] FARMER Classification rules Rule group [33] TOPARGS Row-enumeration closed pattern TopK-rules [34] HDMiner Space partition tree Search space partition [35] SPC/FPC/DPC Hadoop-based Apriori [41] PFP FP-growth [44] MRFP FP-growth [45] Dist-Eclat Eclat [46] BigFIM Hybrid of Apriori and Eclat [46] An Improved Algorithms Closed pattern [47] PARMA Approximate pattern [48] PaMPa-HD Sub-process [49] Fidoop Frequent items ultrametric tree [50] Pattern-Fusion Colossal pattern mining Pattern fusion [51] Bioarm-Gpu-Ga GPU-based Bio-inspired [52] MWBSO-MEGPU Bio-inspired [53] 1 3 Artificial Life and Robotics large amount of data in a suitably efficient way. The other 4 Conclusion is interpretability, or the capacity to translate the results to real-world issues, such as biological meaning. Generally speaking, ARM has been widely utilized in bio- With respect to scalability, our primary focus in this informatics studies. ARM can be used to identify the most paper, numerous approaches to efficiently process high- relevant covariates in a certain biological process and thus dimensional datasets have been proposed. with advantage construct the underlying intrinsic latent network. When of advanced computer technology and seemingly unlimited applied over high-dimensional datasets, many older meth- cloud computing resources, as well as “big data” process- ods cannot manage the issue of high dimensionality. Many ing techniques, parallelized association rules mining might of the most recent methods have been proposed to address be the most promising candidate to lead further devel- this problem, each with its merits and faults, but no perfect opment of association in the new “big data” era. More solution has been achieved. For better usage in this area, new efforts should primarily concentrate on a more appropriate algorithms that can better balance scalability and interpret- data distribution model, more efficient mining methods ability are still in demand. that take better advantage of the key-value feature and Acknowledgements This research is partially supported by National a more reliable load balance scheme. Furthermore, the Natural Science Foundation of China (61370131).We thank David idea adopted by Pattern-Fusion whereby approximate or Martin for editorial assistance in English language. representative patterns are extracted instead of full-scale patterns is a promising methodology to address the high- Open Access This article is distributed under the terms of the Crea- dimensionality problem. By employing such methodology, tive Commons Attribution 4.0 International License (http://creat iveco mmons.or g/licenses/b y/4.0/), which permits unrestricted use, distribu- the mining process realizes a cost savings by identifying tion, and reproduction in any medium, provided you give appropriate shorter patterns. It also yields a much smaller size of result credit to the original author(s) and the source, provide a link to the sets, consisting of longer patterns preferred in practical Creative Commons license, and indicate if changes were made. use. For example, in gene expression analysis, longer pat- terns are usually more favorable. Still, approximation qual- ity needs to be guaranteed to avoid major latent informa- tion loss, which may involve more statistical analysis and References theoretical proof. Interpretability is another critical issue in biomedical 1. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. Acm Sigmod Rec research. Typically, incorporating previously known bio- 22(2):207–216 logical knowledge with association rule mining algorithms 2. Creighton C, Hanash S (2003) Mining gene expression databases is seen as providing better biologically meaningful results for association rules. Bioinformatics 19(1):79–86 [6]; however, in this review, we have suggested that taking 3. Liu YC, Cheng CP, Tseng VS (2011) Discovering relational-based association rules with multiple minimum supports on microarray too much knowledge into account might lower the ability datasets. Bioinformatics 27(22):3142–3148 of ARM to obtain undiscovered rules because the algo- 4. Kunin V, Copeland A, Lapidus A et al (2008) A bioinformatician’s rithm would tend to fit the biological knowledge more. guide to metagenomics. Microbiol Mol Biol Rev 72(4):557–578 Additionally, such approach may increase the cost and, in 5. Network CGA (2012) Comprehensive molecular characterization of human colon and rectal cancer. Nature 487(7407):330–337 turn, reduce scalability. Instead, toolkits that include fuzzy 6. Alves R, Rodriguez-Baena DS, Aguilar-Ruiz JS (2010) Gene asso- set theory, genetic algorithms, ant colony algorithms, and ciation analysis: a survey of frequent pattern mining from gene particle swarm optimization and other heuristic algorithms expression data. Brief Bioinform 11(2):210–224 can be utilized to optimize association rule mining algo- 7. Agrawal R, Srikant R (1994) Fast algorithms for mining associa- tion rules. In: Proceeding 20th international conference on very rithms for better interpretability. Such optimization meth- large data bases, VLDB, pp 487–499 ods have been mainly used over quantitative data with- 8. Park JS, Chen M-S, Yu PS (1995) An effective hash-based out a typical pre-discretization procedure. For example, algorithm for mining association rules. Acm Sigmod Rec fuzzy set theory can be used to generate fuzzy association 24(2):175–186 9. Savasere A, Omiecinski ER, Navathe SB (1995) An efficient algo- rules with more practical meanings, genetic algorithms to rithm for mining association rules in large databases. In: Interna- dynamically specify appropriate support threshold and ant tional conference on very large data bases, pp 432–444 colony algorithms to reduce the scale of the result rules 10. Toivonen H (1996) Sampling large databases for association rules. [54, 55]. As also suggested in this review, association VLDB, pp 134–145 11. Brin S, Motwani R, Ullman JD et al (1997) Dynamic itemset rule mining algorithms that merge diverse optimization counting and implication rules for market basket data. Proc Sig- methods might, when combined with improved computer mod 26(2):255–264 techniques, provide researchers with tools of more practi- 12. Cheung DW, Wong CYHan J, Ng VT (1996) Maintenance of cal value. discovered association rules in large databases: An incremental 1 3 Artificial Life and Robotics updating technique. In: Proceedings of the twelfth international 34. Cong G, Tan K-L, Tung AK et al (2005) Mining top-k covering conference on data engineering, pp 106–114 rule groups for gene expression data. In: Proceedings of the 2005 13. Han J, Pei J, Yin Y (2000) Mining frequent patterns without can- ACM SIGMOD international conference on management of data, didate generation. In: Proceeding of the 2000 ACM SIGMOD pp 670-681 international conference on management of data, pp 1–12 35. Xu J, Ji S (2014) HDminer: ec ffi ient mining of high dimensional fre - 14. Agarwal RC, Aggarwal CC, Prasad V (2001) A tree projection quent closed patterns from dense data. In: 2014 IEEE international algorithm for generation of frequent item sets. J Parallel Distrib conference on data mining workshop, pp 1061–1067 Comput 61(3):350–371 36. Agrawal R, Shafer JC (1996) Parallel mining of association rules. 15. Pei J, Han J, Lu H et al (2007) H-Mine: Fastand space-preserving IEEE Trans Knowl Data Eng 8(6):962–969 frequent pattern mining in large databases. IIE Trans 39(6):593–605 37. Zaki MJ (1999) Parallel and distributed association mining: a survey. 16. Liu J, Pan Y, Wang K et al.(2002) Mining frequent item sets by IEEE Concurr 7(4):14–25 opportunistic projection. In: Proceedings of the eighth ACM Sigkdd 38. Ferraro Petrillo U, Roscigno G, Cattaneo G, Giancarlo R (2017) international conference on knowledge discovery and data mining, FASTdoop: a versatile and efficient library for the input of FASTA pp 229–238 and FASTQ files for MapReduce Hadoop bioinformatics applica- 17. Grahne G, Zhu J (2003) Efficiently using prefix-trees in mining fre- tions. Bioinformatics 33(10):1575–1577 quent itemsets. In: Proceeding IEEE ICSM workshop on frequent 39. O’Driscoll A, Daugelaite J, Sleator RD (2013) ‘Big data’, Hadoop itemset mining implementations and cloud computing in genomics. J Biomed Inf 46(5):774–781 18. Grahne G, Zhu J (2005) Fast algorithms for frequent itemset mining 40. Dean J, Ghemawat S (2010) MapReduce: a flexible data processing using fp-trees. IEEE Trans Knowl Data Eng 17(10):1347–1362 tool. Commun ACM 53(1):72–77 19. Zaki MJ (2000) Scalable algorithms for association mining. IEEE 41. Lin M-Y, Lee P-Y, Hsueh S-C (2012) Apriori-based frequent item- Trans Knowl Data Eng 12(3):372–390 set mining algorithms on MapReduce. In: Proceedings of the 6th 20. Han J, Cheng H, Xin D et al (2007) Frequent pattern mining: current international conference on ubiquitous information management and status and future directions. Data Min Knowl Disc 15(1):55–86 communication, p 76 21. Bayardo RJ Jr (1998) Efficiently mining long patterns from data- 42. Li N, Zeng L, He Q et al.(2012) Parallel implementation of apriori bases. ACM Sigmod Int Conf Manag Data 27(2):85–93 algorithm based on mapreduce. In: IEEE 13th ACIS international 22. Burdick D, Calimlim M, Gehrke J (2001) MAFIA: a maximal fre- conference on software engineering, artificial intelligence, network - quent itemset algorithm for transactional databases. In: International ing and parallel and distributed computing, pp 236–241 conference on data engineering, pp 443–452 43. Kovacs F, Illés J (2013) Frequent itemset mining on hadoop. In: 23. Pasquier N, Bastide Y, Taouil R et al (1999) Discovering frequent Computational cybernetics (ICCC), 2013 IEEE 9th international closed itemsets for association rules. Lect Notes Comput Sci conference on IEEE, pp 241–245 1540:398–416 44. Fu C, Wang X, Zhang L, Qiao L (2018) Mining algorithm for asso- 24. Pei J, Han J, Mao R (2000) CLOSET: an efficient algorithm for min - ciation rules in big data based on Hadoop. In: AIP conference pro- ing frequent closed itemsets. ACM SIGMOD workshop on research ceedings. AIP Publishing: 040035 issues in data mining and knowledge discovery, pp 21–30 45. Al-Hamodi AA, Lu S (2016) MRFP: discovery frequent patterns 25. Zaki MJ, Hsiao C-J (2002) CHARM: an efficient algorithm for using MapReduce frequent pattern growth. In: Network and infor- closed itemset mining. In: Proceedings of the 2002 SIAM interna- mation systems for computers (ICNISC), 2016 international confer- tional conference on data mining, pp 457–473 ence on. IEEE: 298–301 26. Liu G, Lu H, Yu JX et al.(2003) AFOPT: an efficient implementation 46. Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining of pattern growth approach. In: Proceeding of the Icdm Workshop for big data. In: Big Data, 2013 IEEE international conference on 27. Wang J, Han J, Pei J (2003) ClOSET+: Searching for the best strate- IEEE: pp 111–118 gies for mining frequent closed itemsets. In: Proceedings of the ninth 47. Gonen Y, Gudes E (2016) An improved mapreduce algorithm for ACM SIGKDD international conference on knowledge discovery mining closed frequent itemsets. In: Software science, technology and data mining, pp 236–245 and engineering (SWSTE), IEEE international conference on: 2016. 28. Pan F, Cong G, Tung AK et al.(2003) Carpenter: Finding closed IEEE: 77–83 patterns in long biological datasets. In: Proceedings of the ninth 48. Riondato M, DeBrabant JA, Fonseca R et al (2012) PARMA: a par- ACM SIGKDD international conference on knowledge discovery allel randomized algorithm for approximate association rules min- and data mining, pp 637–642 ing in MapReduce. In: Proceedings of the 21st ACM international 29. Cong G, Tan K-L, Tung AK et al.(2004) Mining frequent closed conference on Information and knowledge management, pp 85–94 patterns in microarray data. In: 2004 ICDM’04 fourth IEEE inter- 49. Apiletti D, Baralis E, Cerquitelli T et al (2015) PaMPa-HD: a Paral- national conference on data mining, pp 363–366 lel MapReduce-based frequent pattern miner for high-dimensional 30. Pan F, Tung AK, Cong G et  al.(2004) COBBLER: combining data. In: IEEE international conference on data mining workshop, column and row enumeration for closed pattern discovery. In: Pro- pp 839–846 ceedings 16th international conference on scientific and statistical 50. Xun Y, Zhang J, Qin X (2016) Fidoop: Parallel mining of frequent database management, pp 21–30 itemsets using mapreduce. IEEE Trans Syst Man Cybern Syst 31. Liu H, Han J, Xin D et al.(2006) Mining frequent patterns from very 46(3):313–325 high dimensional data: a top-down row enumeration approach. In: 51. Wang Z. (2014) The colossal pattern mining algorithm based on Proceedings of the 2006 SIAM international conference on data pattern fusion., Tianjin Polytechnic University, Tianjin mining, pp 282–293 52. Djenouri Y, Bendjoudi A, Djenouri D, Comuzzi M (2017) GPU- 32. Liu H, Wang X, He J et  al (2009) Top-down mining of fre- based bio-inspired model for solving association rules mining prob- quent closed patterns from very high dimensional data. Inf Sci lem. In: Parallel, distributed and network-based processing (PDP), 179(7):899–924 2017 25th Euromicro International Conference on. IEEE: 262–269 33. Cong G, Tung AK, Xu X et al (2004) FARMER: finding interest- 53. Djenouri Y, Djenouri D, Habbas Z (2018) Intelligent mapping ing rule groups in microarray datasets. In: Proceedings of the 2004 between GPU and cluster computing for discovering big associa- ACM SIGMOD international conference on management of data, tion rules. Appl Soft Comput 65:387–399 pp 143–154 54. Mangalampalli A, Pudi V (2013) FAR-HD:A fast and efficient algo- rithm for mining fuzzy association rules in large high-dimensional 1 3 Artificial Life and Robotics datasets. Parallel implementation of apriori algorithm based on discover quantitative association rules in large-scale datasets. mapreduce Fuzzy Systems, pp 1–6 Integr Comput Aided Eng 22(1):21–39 55. Martínez-Ballesteros M, Bacardit J, Troncoso A, Riquelme JC (2015) Enhancing the scalability of a genetic algorithm to 1 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Artificial Life and Robotics Springer Journals

Association rule mining algorithms on high-dimensional datasets

Free
8 pages
Loading next page...
 
/lp/springer_journal/association-rule-mining-algorithms-on-high-dimensional-datasets-TB8lqkyi0a
Publisher
Springer Japan
Copyright
Copyright © 2018 by The Author(s)
Subject
Computer Science; Artificial Intelligence (incl. Robotics); Computation by Abstract Devices; Control, Robotics, Mechatronics
ISSN
1433-5298
eISSN
1614-7456
D.O.I.
10.1007/s10015-018-0437-y
Publisher site
See Article on Publisher Site

Abstract

The science of bioinformatics has been accelerating at a fast pace, introducing more features and handling bigger volumes. However, these swift changes have, at the same time, posed challenges to data mining applications, in particular efficient association rule mining. Many data mining algorithms for high-dimensional datasets have been put forward, but the sheer numbers of these algorithms with varying features and application scenarios have complicated making suitable choices. Therefore, we present a general survey of multiple association rule mining algorithms applicable to high-dimensional data- sets. The main characteristics and relative merits of these algorithms are explained, as well, pointing out areas for improve- ment and optimization strategies that might be better adapted to high-dimensional datasets, according to previous studies. Generally speaking, association rule mining algorithms that merge diverse optimization methods with advanced computer techniques can better balance scalability and interpretability. Keywords Data mining algorithms · Association rule mining · High-dimensional datasets · Frequent itemset mining 1 Introduction huge datasets with little prior knowledge according to co- occurrence features. When applied to biomedical data, ARM Association rules mining (ARM), an important branch of can obtain rules that provide a better understanding of bio- data mining, has been extensively used in many areas since logical associations among different covariates or between Agrawal first introduced it in 1993 [1 ]. In general, ARM covariates and response variables. Bioinformatics techniques can be seen as a method aimed at discovering groups of have been developing with increased speed and so have items that co-occur with high frequency. In contrast to other several high-throughput biotechnologies, such as genomic data mining methods involved with statistical models, ARM microarray and Next Generation Sequencing (NGS). High- can extract possible relationships between variables from dimensional data are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, high-dimensional data are the data with anywhere from a few dozen to many * Dongmei Ai aidongmei@sina.com thousands of dimensions. A dataset is a collection of data. Most commonly a dataset corresponds to the contents of Hongfei Pan 18813128340@163.com a single database table, or a single statistical data matrix. It follows that a concern among researchers has been the Xiaoxin Li 851482181@qq.com efficient and effective discovery of latent information under - lying huge amounts of data. As a possible solution to this Yingxin Gao gaoyingxin16@outlook.com problem, ARM has been extensively applied in this field. A typical application of ARM on such high-throughput data- Di He 15810251026@163.com sets is gene association analysis (GAA) [2, 3], in which the goal is to exploit the relationships among different genes School of Mathematics and Physics, University of Science based on corresponding expression levels. and Technology Beijing, No.30, Xueyuan Road, Data from these high-throughput techniques often share Beijing 100083, China in common the feature of high dimensionality. For example, Computer Science, City University of New York Graduate gene expression data typically take the form of an N × M School and University Center, CUNY, 365 Fifth Avenue, New York 10016, USA Vol.:(0123456789) 1 3 Artificial Life and Robotics matrix, where each row of the matrix represents a sample, by concatenating suffix patterns, starting from frequent and each column corresponds to the expression level of a 1-pattern to least frequent itemsets, with frequent patterns certain gene. The number of genes in a given study can be from a conditional FP-tree, which is a subtree consisting of in the thousands, while the number of specimens is generally the set of prefix-paths in the FP-tree co-occurring with the dozens or hundreds. suffix, recursively. In other words, this method only involves Such high dimensionality is also true for other kinds frequent patter (k + 1)-itemset n growth instead of Apriori- of biomedical datasets, e.g., Operational Taxonomic Unit like generation-and-test. In this sense, then, it applies a par- (OTU) abundance datasets that have different levels of extra titioning-based divide-and-conquer strategy, and efficiency environmental factors in metagenomics analysis [4], as well studies demonstrated that this method has substantially as multiple datasets, including mRNA/miRNA expression reduced search time. Subsequently, multiple algorithms were data and Copy Number Variations (CNV) data from The proposed as extensions to the FP-growth approach, such as Cancer Genome Atlas (TCGA) project [5]. generating frequent itemset in a depth-first manner [14], Based on the high dimensionality of such datasets, the use mining with devised hyperstructure [15], pattern-growth of traditional methods of association rules mining directly mining by traversing FP-tree-like structure in both direc- applied to these datasets could result in unsatisfactory per- tions (top-down and bottom-up) [16], and pattern-growth formance [6]. To improve performance brought by high- mining with tree structure in an array-based implementation dimensional datasets, multiple specialized algorithms have form [17, 18]. Recursive searches on the tree could result in been proposed in the last decade. enormous costs in certain cases; nevertheless, such methods did lay a solid foundation for the further application of tree structure in association rule mining algorithms. 2 Mining algorithms on high‑dimensional Apriori and FP-growth both adopted a horizontal format datasets for mining frequent itemsets. In contrast, Zaki proposed an Equivalence CLASS Transformation (Eclat) algorithm 2.1 Basic association rule mining algorithms employing a vertical data format [19]. Eclat also utilized Apriori’s candidate generation property of (k + 1)-itemset Apriori, the first ARM algorithm, was proposed by Agrawal candidates. In Eclat, however, the support computation of [7], and it successfully reduced the search space size with a candidate can be done by just intersecting the sample id sets downward closure Apriori property that says a k-itemset is of the corresponding frequent (k + 1)-itemset . More simply frequent only if all of its subsets are frequent. The Apriori stated, the support of any itemset can be obtained directly algorithm is characterized by a feature called candidate gen- from the vertical sampleID without any further computa- eration whereby (k + 1)-candidate-itemsets are generated tion. Thus, additional scanning of the original dataset can iteratively by combining any two frequent k-itemsets which be saved, again reducing the cost of search time. share a common prefix of length (k-1) . Further computation Generally speaking, finding all frequent itemsets of a of the supports of each candidate itemset is then performed specific dataset can be regarded as a process consisting of to determine if the candidate is frequent or not. Finally, the search space traversal, itemset support computation, and algorithm terminates if no more frequent itemsets can be search path pruning. A common strategy of traversing the generated. search space includes breadth-first search (BFS). For exam- Based on the standard Apriori algorithm, several ple, in Apriori, the frequent (k + 1)-itemset is not generated improved variations were proposed. The performance- until all frequent (k + 1)-itemsets have been discovered. enhancing strategies include the hashing technique [8], par- Another common strategy is depth-first search (DFS). For titioning technique [9], sampling approach [10], dynamic example, in FP-growth, longer frequent patterns are gener- counting [11], and incremental mining [12]. As previous ated recursively until no more can be done. Common strate- studies demonstrated, these Apriori-based approaches gies for support computation include counting (e.g., Apriori, achieved good performance when the dataset was sparse FP-growth) and intersection (e.g., Eclat). In sum, the meth- and the patterns discovered were of short length. However, odologies adopted by the three basic association rule mining such methods suffer nontrivial costs caused by generating algorithms described (Apriori [7], FP-growth [13] and Eclat huge numbers of candidate itemsets and extra scans over the [19]) serve as landmarks for the development of association datasets for support computation. rules mining and constitute the basis for subsequent associa- We then saw the emergence of new algorithms like FP tion algorithms. (frequent pattern)-growth without candidate itemsets [13]. First, an FP-tree, which retains information associating the itemsets, is constructed according to the frequency of 1-itemset. Next, patterns of different lengths are generated 1 3 Artificial Life and Robotics In summary, frequent closed itemsets can provide analyti- 2.2 Maximal frequent itemset mining and frequent closed itemset mining cal power equivalent to that of complete frequent itemsets, but with much smaller size. Substantial approaches have ver- According to the Apriori Property, it is obvious that (k-2) ified the higher efficiency and better interpretability obtained by frequent closed itemset mining. However, most of the subsets (except itself and  ) of a certain frequent k-itemset are also frequent. Such characteristic will result in massive above-mentioned algorithms adopt the column-enumer- ation strategy. Therefore, when applying such approaches unnecessary redundancy of frequent itemsets. To limit the redundancy, two alternative concepts were advanced, namely over high-dimensional datasets, the search space will tend to expand exponentially, according to the feature size, thus maximal frequent itemset mining and frequent closed item- set mining [20]. making computational cost prohibitive. Therefore, it is easy to see why most of the algorithms discussed thus far cannot Many algorithms were developed to mine these two cat- egories of itemsets. For example, MaxMiner, the very first be applied to high-dimensional datasets, again underscoring the need to develop algorithms applicable to high-dimen- study on maximal frequent itemset mining, was proposed in 1998 [21]. Based on Apriori, MaxMiner adopted a breadth- sional datasets to keep pace with advancements in sequenc- ing and computer technology. first search (BFS) strategy and reduced the search space by both superset and subset frequency pruning. As another 2.3 Algorithms applicable to high‑dimensional efficient maximal frequent itemset mining method, MAFIA improved support counting efficiency by adopting vertical datasets bitmaps to compress the transaction id list [22]. For frequent closed itemset mining, numerous methods From previous sections, we can see that the applicability of the above-mentioned algorithms to high-dimensional data- have been proposed since 1999, when A-Close, an Apriori- based frequent closed itemset mining approach, was reported sets is limited. In this section, we will discuss approaches better able to meet this challenge. [23]. CLOSET explored frequent closed itemset mining based on FP-tree structure [24]. Another typical frequent Among them, approaches incorporating frequent closed itemset mining and row enumeration can serve as a possible closed itemset mining approach is CHARM, which adopted a hybrid search strategy, known as the diffsets technique solution. This idea was first explored in 2003 [28]. Based on data in vertical format, CARPENTER constructs a row- (compact form of tID list information), and a hash-based “non-close item disposal” approach to enhance both com- enumeration tree and adopts a depth-first search (DFS) strat- egy to traverse it. Additionally, several pruning strategies are putation and memory efficiency [25]. In addition, AFOPT- close presented a method which can adaptively use three employed during the search process to cut off the branches incapable of generating frequent closed itemsets. Previous different structures, including array, AFOPT-tree (FP-tree like) and buckets, to represent conditional databases accord- study [28] has shown that CARPENTER gained better per- formance, compared to its rivals as CHARM and CLOSE+, ing to their respective densities [26]. To integrate previous effective approaches and some newly developed techniques, when applied to high-dimensional microarray datasets [27]. Adopting similar strategies, other methods were devel- CLOSET+ was proposed [27]. After thorough performance studies on diverse datasets, CLOSET+ was considered as oped. For instance, RERII is like CARPENTER, but instead of searching frequent itemsets from the whole original data- one of the most efficient methods at the time. Previous studies have shown that algorithms of these sets, RERII explored frequent closed itemsets in the opposite direction, starting from the nodes that represent the com- two categories are usually more efficient against previous iterations. However, maximal frequent itemset has a critical plete rowsets [29]. This strategy has the potential to enhance overall performance by reducing the cost of searching short defect in that the supports of its subitemsets may be different from its own. This would, in turn, result in extra scans over rowsets and I-item rowsets. To make CARPENTER more adaptable to more complex the dataset for support computation and its ultimate unfitness for rule extraction. Frequent closed itemset mining does not datasets, COBBLER integrated the strategies of both CAR- PENTER and CLOSE+ [30]. Accordingly, COBBLER can encounter such problems, essentially because all subsets of a certain frequent closed itemset must have precisely the same dynamically switch between row-enumeration and column- enumeration to meet estimated cost conditions. Its ec ffi iency support as that of the frequent closed itemset. Furthermore, frequent closed itemsets can be regarded as a compressed has been verified in experiments over datasets with high dimensionality and a relatively large number of rows. form of the complete frequent itemsets without informa- tion loss. Based on these features and properties, we can TD-CLOSE adopts a top-down row-enumeration search strategy that enables the support of a stronger pruning power conclude that frequent closed itemset mining is more likely to play a vital role in the development of association rules against the bottom-up style adopted by CARPENTER [31]. To guarantee closeness during the mining process, an mining. 1 3 Artificial Life and Robotics additional closeness checking method was included in TD- application of such algorithms was suppressed. On the other CLOSE. Moreover, in 2009, an improved version of TD- hand, based on the exuberance over cloud computing and CLOSE, called TTD-CLOSE, was proposed [32]. With distributed computing techniques, parallelized association optimized pruning strategy and data structure, TTD-CLOSE rule mining algorithms were revived with the opportunity to obtained better performance than the original TD-CLOSE. show their power. Specifically, as the most recognized large- To extend the applications of frequent pattern mining, scale data analysis technique, Hadoop has been broadly uti- new classification methods based on ARM over high-dimen- lized in modern biomedical studies [38, 39]. Characterized sional datasets, such as FARMER and TOPKRGS, emerged by mapper and reducer functions [40], the Hadoop MapRe- [33, 34]. With additional class information attached to the duce framework is especially good at processing gigabytes, original datasets, both algorithms can extract classification or even terabytes, of data. Moreover, by hiding details of rules in the form of X ⇒ C , where C is a class label, and underlying controls, Hadoop can enable users to just concen- X is a set of items. Based on previous analysis, it can be trate on algorithm design. All of these features make Hadoop concluded that a rule extracted from frequent closed item- a novel promising candidate to propel the development of set, consisting of k items in total, also implies the existence ARM in the “big data” era. of other 2 -2 rules. To reduce rule redundancy, these two Typical examples of adapting Apriori on Hadoop MapRe- algorithms only extract “interesting” rule groups instead of duce include SPC (Single Pass Counting), FPC (Fixed Pass all rules. Specifically, FARMER adopted the concept of a Combined-Counting) and DPC (Dynamic Pass Counting) rule group that consists of a unique upper bound rule and a [41]. These algorithms share common procedures of dis- set of lower bound rules for clustering the complete results tributing data to different mappers and parallel counting of rules, while TOPKRGS just selects the most significant supports, but differ in candidate generation. Typically, SPC top-k covering rule groups. In addition, FARMER reinforced generates frequent itemsets of only a single length after one interestingness measures with Chi square in addition to sup- phase of MapReduce, but FPC and DPC generate frequent port and confidence, while TOPKRGS adopts a prefix tree itemsets with different lengths after phases. In addition, as structure to speed up the frequency computation and uti- the names suggest, and are fixed parameters in FPC, while lizes a dynamic minimum confidence generation strategy to in DPC, they are dynamically determined by the number better-fit different datasets. of generated candidate itemsets at each MapReduce phase. To enhance mining efficiency, HDminer employs effective Other Hadoop-based Apriori algorithms, which work in a search space partitioning and pruning strategies. HDminer similar manner, but different forms of implementation, were gradually narrows down the search space by pruning off the also proposed [42, 43]. false-valued cells based on the space partition tree instead of To solve the problem that the traditional association accumulating the true-valued cells like the FP-tree- or enu- rules mining algorithm has been unable to meet the mining meration tree-based methods. Owing to fewer false-valued needs of large amount of data in the aspect of efficiency cells compared to true-valued cells, HDminer works much and scalability, take FP-Growth as an example, the algo- more efficiently than the FP-tree- or enumeration tree-based rithm of the parallelization was realized in based on Hadoop methods [35]. HDMiner shows superiority, especially on framework and Map Reduce model. It can be better to meet synthetic data and dense microarray data. the requirements of big data mining and efficiently mine To summarize, previous studies have verified the rela- frequent item sets and association rules from large dataset tively high efficiency of row-enumeration algorithms for [44]. MRFP-Growth (MapReduce Frequent Pattern Growth) mining frequent closed itemset over high-dimensional is also implementing to solve the problem of discovering datasets. However, with the advancement of biomedical frequent patterns with massive datasets. The efficiency and data acquisition techniques, the volume of data has grown performance of this method have been increased compared larger and larger, and the row size of a certain dataset may, with other mining algorithms [45]. Also, implementations therefore, become as large as the column size. In this case, of Eclat on MapReduce were proposed, such as Dist-Eclat, methods such as COBBLER or TD-CLOSE, as described focusing on speed acceleration, and its optimized version above, may still have trouble handling such large datasets. BigFIM which adopts hybrid approaches incorporating both Consequently, instead of sequential algorithms, increased Apriori and Eclat, thus making BigFIM better suited to very attention has focused on parallel and distributed algorithms. large datasets [46]. Experiments on real datasets have proven Actually, parallel association rule mining algorithms were their scalability. An MapReduce algorithm for mining closed proposed quite early in the 1990s [36, 37]. However, since frequent itemsets was implemented, as well [47]. the effectiveness of these algorithms was challenged by Another noteworthy approach is PARMA [48]. It applies complicated strategies of workload balance, fault tolerance parallel mining algorithms to randomly selected subsets and data distribution, as well as interconnection costs and of the original large dataset. Owing to its random sam- limited computer hardware capacity at that time, extensive ple property, its mined results can be considered as the 1 3 Artificial Life and Robotics approximation of the exact results according to the whole full-scale patterns, appeared as a solution. The most typical dataset. The approximation quality was verified by both both approach, known as Pattern-Fusion, is based on a novel con- statistical analysis and real-time application. cept called core pattern. Pattern-Fusion is able to discover Based on CARPENTER, a new algorithm called PaMPa- approximate colossal patterns, i.e., the colossal pattern min- HD was developed. This algorithm adopts the depth-first ing algorithm based on pattern fusion improve seed pattern search process, as well, but the process is broken up into selection method, which is select pattern that the distant big, independent subprocesses to which a centralized version of rather than random seed pattern [51]. CARPENTER is applied so that it can autonomously evalu- Recently, the Graphics Processor Units (GPU) has ate subtrees of the search space. Then the final closed item- emerged as one of the most used parallel hardware to solve sets of each subprocess can be extracted in order to compute large scientific complex problems. An approach benefits the whole closed itemset result [49]. Since the subprocesses from the massively parallel power of GPU by using a large are independent, they can be executed in parallel by means number of threads to evaluate association rule mining was of a distributed computing platform such as Hadoop. proposed [52]. Then a new algorithm called MWBSO- To achieve compressed storage and avoid building condi- MEGPU was proposed. This method combine both GPU and tional pattern bases, FiDoop was brought forward. FiDoop cluster computing to improve a Bees Swarm Optimization utilizes a frequent itemset ultrametric tree. In FiDoop, the (BSO) metaheuristic. Several tests have been carried out to mappers independently decompose itemsets, the reduc- evaluate this approach. The results reveal that MWBSO- ers perform combination operations by constructing small MEGPU outperforms the HPC-based ARM approaches in ultrametric trees, and the actual mining of these trees is per- terms of speed up when exploring Webdocs instance [53]. formed separately, which can speed up the mining perfor- mance for high-dimensional datasets analysis [50]. Exten- sive experiments using real-world celestial spectral data 3 Discussion indicate that FiDoop is scalable and efficient. In addition to the huge computation cost, it is typical for All algorithms reviewed as applicable approaches for high- the size of derived patterns from high-dimensional datasets dimensional datasets are summarized below in Table 1. to be enormous. Such growth of derived patterns makes their The performance evaluation of association rule mining effective use difficult. Therefore, a new methodology aimed algorithms raises two major concerns. The first is scalabil- at mining approximate or representative patterns, instead of ity, which refers to the ability of an algorithm to handle a Table 1 Overall compilation of association rule mining algorithms on high-dimensional datasets Methods Category Feature Reference CARPENTER Row-enumeration closed pattern Bottom-up [28] RERII Top-down [29] COBBLER Hybrid of CARPENTER&CLOSE+ [30] TD-CLOSE Top-down [31] TTD-CLOSE Top-down [32] FARMER Classification rules Rule group [33] TOPARGS Row-enumeration closed pattern TopK-rules [34] HDMiner Space partition tree Search space partition [35] SPC/FPC/DPC Hadoop-based Apriori [41] PFP FP-growth [44] MRFP FP-growth [45] Dist-Eclat Eclat [46] BigFIM Hybrid of Apriori and Eclat [46] An Improved Algorithms Closed pattern [47] PARMA Approximate pattern [48] PaMPa-HD Sub-process [49] Fidoop Frequent items ultrametric tree [50] Pattern-Fusion Colossal pattern mining Pattern fusion [51] Bioarm-Gpu-Ga GPU-based Bio-inspired [52] MWBSO-MEGPU Bio-inspired [53] 1 3 Artificial Life and Robotics large amount of data in a suitably efficient way. The other 4 Conclusion is interpretability, or the capacity to translate the results to real-world issues, such as biological meaning. Generally speaking, ARM has been widely utilized in bio- With respect to scalability, our primary focus in this informatics studies. ARM can be used to identify the most paper, numerous approaches to efficiently process high- relevant covariates in a certain biological process and thus dimensional datasets have been proposed. with advantage construct the underlying intrinsic latent network. When of advanced computer technology and seemingly unlimited applied over high-dimensional datasets, many older meth- cloud computing resources, as well as “big data” process- ods cannot manage the issue of high dimensionality. Many ing techniques, parallelized association rules mining might of the most recent methods have been proposed to address be the most promising candidate to lead further devel- this problem, each with its merits and faults, but no perfect opment of association in the new “big data” era. More solution has been achieved. For better usage in this area, new efforts should primarily concentrate on a more appropriate algorithms that can better balance scalability and interpret- data distribution model, more efficient mining methods ability are still in demand. that take better advantage of the key-value feature and Acknowledgements This research is partially supported by National a more reliable load balance scheme. Furthermore, the Natural Science Foundation of China (61370131).We thank David idea adopted by Pattern-Fusion whereby approximate or Martin for editorial assistance in English language. representative patterns are extracted instead of full-scale patterns is a promising methodology to address the high- Open Access This article is distributed under the terms of the Crea- dimensionality problem. By employing such methodology, tive Commons Attribution 4.0 International License (http://creat iveco mmons.or g/licenses/b y/4.0/), which permits unrestricted use, distribu- the mining process realizes a cost savings by identifying tion, and reproduction in any medium, provided you give appropriate shorter patterns. It also yields a much smaller size of result credit to the original author(s) and the source, provide a link to the sets, consisting of longer patterns preferred in practical Creative Commons license, and indicate if changes were made. use. For example, in gene expression analysis, longer pat- terns are usually more favorable. Still, approximation qual- ity needs to be guaranteed to avoid major latent informa- tion loss, which may involve more statistical analysis and References theoretical proof. Interpretability is another critical issue in biomedical 1. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. Acm Sigmod Rec research. Typically, incorporating previously known bio- 22(2):207–216 logical knowledge with association rule mining algorithms 2. Creighton C, Hanash S (2003) Mining gene expression databases is seen as providing better biologically meaningful results for association rules. Bioinformatics 19(1):79–86 [6]; however, in this review, we have suggested that taking 3. Liu YC, Cheng CP, Tseng VS (2011) Discovering relational-based association rules with multiple minimum supports on microarray too much knowledge into account might lower the ability datasets. Bioinformatics 27(22):3142–3148 of ARM to obtain undiscovered rules because the algo- 4. Kunin V, Copeland A, Lapidus A et al (2008) A bioinformatician’s rithm would tend to fit the biological knowledge more. guide to metagenomics. Microbiol Mol Biol Rev 72(4):557–578 Additionally, such approach may increase the cost and, in 5. Network CGA (2012) Comprehensive molecular characterization of human colon and rectal cancer. Nature 487(7407):330–337 turn, reduce scalability. Instead, toolkits that include fuzzy 6. Alves R, Rodriguez-Baena DS, Aguilar-Ruiz JS (2010) Gene asso- set theory, genetic algorithms, ant colony algorithms, and ciation analysis: a survey of frequent pattern mining from gene particle swarm optimization and other heuristic algorithms expression data. Brief Bioinform 11(2):210–224 can be utilized to optimize association rule mining algo- 7. Agrawal R, Srikant R (1994) Fast algorithms for mining associa- tion rules. In: Proceeding 20th international conference on very rithms for better interpretability. Such optimization meth- large data bases, VLDB, pp 487–499 ods have been mainly used over quantitative data with- 8. Park JS, Chen M-S, Yu PS (1995) An effective hash-based out a typical pre-discretization procedure. For example, algorithm for mining association rules. Acm Sigmod Rec fuzzy set theory can be used to generate fuzzy association 24(2):175–186 9. Savasere A, Omiecinski ER, Navathe SB (1995) An efficient algo- rules with more practical meanings, genetic algorithms to rithm for mining association rules in large databases. In: Interna- dynamically specify appropriate support threshold and ant tional conference on very large data bases, pp 432–444 colony algorithms to reduce the scale of the result rules 10. Toivonen H (1996) Sampling large databases for association rules. [54, 55]. As also suggested in this review, association VLDB, pp 134–145 11. Brin S, Motwani R, Ullman JD et al (1997) Dynamic itemset rule mining algorithms that merge diverse optimization counting and implication rules for market basket data. Proc Sig- methods might, when combined with improved computer mod 26(2):255–264 techniques, provide researchers with tools of more practi- 12. Cheung DW, Wong CYHan J, Ng VT (1996) Maintenance of cal value. discovered association rules in large databases: An incremental 1 3 Artificial Life and Robotics updating technique. In: Proceedings of the twelfth international 34. Cong G, Tan K-L, Tung AK et al (2005) Mining top-k covering conference on data engineering, pp 106–114 rule groups for gene expression data. In: Proceedings of the 2005 13. Han J, Pei J, Yin Y (2000) Mining frequent patterns without can- ACM SIGMOD international conference on management of data, didate generation. In: Proceeding of the 2000 ACM SIGMOD pp 670-681 international conference on management of data, pp 1–12 35. Xu J, Ji S (2014) HDminer: ec ffi ient mining of high dimensional fre - 14. Agarwal RC, Aggarwal CC, Prasad V (2001) A tree projection quent closed patterns from dense data. In: 2014 IEEE international algorithm for generation of frequent item sets. J Parallel Distrib conference on data mining workshop, pp 1061–1067 Comput 61(3):350–371 36. Agrawal R, Shafer JC (1996) Parallel mining of association rules. 15. Pei J, Han J, Lu H et al (2007) H-Mine: Fastand space-preserving IEEE Trans Knowl Data Eng 8(6):962–969 frequent pattern mining in large databases. IIE Trans 39(6):593–605 37. Zaki MJ (1999) Parallel and distributed association mining: a survey. 16. Liu J, Pan Y, Wang K et al.(2002) Mining frequent item sets by IEEE Concurr 7(4):14–25 opportunistic projection. In: Proceedings of the eighth ACM Sigkdd 38. Ferraro Petrillo U, Roscigno G, Cattaneo G, Giancarlo R (2017) international conference on knowledge discovery and data mining, FASTdoop: a versatile and efficient library for the input of FASTA pp 229–238 and FASTQ files for MapReduce Hadoop bioinformatics applica- 17. Grahne G, Zhu J (2003) Efficiently using prefix-trees in mining fre- tions. Bioinformatics 33(10):1575–1577 quent itemsets. In: Proceeding IEEE ICSM workshop on frequent 39. O’Driscoll A, Daugelaite J, Sleator RD (2013) ‘Big data’, Hadoop itemset mining implementations and cloud computing in genomics. J Biomed Inf 46(5):774–781 18. Grahne G, Zhu J (2005) Fast algorithms for frequent itemset mining 40. Dean J, Ghemawat S (2010) MapReduce: a flexible data processing using fp-trees. IEEE Trans Knowl Data Eng 17(10):1347–1362 tool. Commun ACM 53(1):72–77 19. Zaki MJ (2000) Scalable algorithms for association mining. IEEE 41. Lin M-Y, Lee P-Y, Hsueh S-C (2012) Apriori-based frequent item- Trans Knowl Data Eng 12(3):372–390 set mining algorithms on MapReduce. In: Proceedings of the 6th 20. Han J, Cheng H, Xin D et al (2007) Frequent pattern mining: current international conference on ubiquitous information management and status and future directions. Data Min Knowl Disc 15(1):55–86 communication, p 76 21. Bayardo RJ Jr (1998) Efficiently mining long patterns from data- 42. Li N, Zeng L, He Q et al.(2012) Parallel implementation of apriori bases. ACM Sigmod Int Conf Manag Data 27(2):85–93 algorithm based on mapreduce. In: IEEE 13th ACIS international 22. Burdick D, Calimlim M, Gehrke J (2001) MAFIA: a maximal fre- conference on software engineering, artificial intelligence, network - quent itemset algorithm for transactional databases. In: International ing and parallel and distributed computing, pp 236–241 conference on data engineering, pp 443–452 43. Kovacs F, Illés J (2013) Frequent itemset mining on hadoop. In: 23. Pasquier N, Bastide Y, Taouil R et al (1999) Discovering frequent Computational cybernetics (ICCC), 2013 IEEE 9th international closed itemsets for association rules. Lect Notes Comput Sci conference on IEEE, pp 241–245 1540:398–416 44. Fu C, Wang X, Zhang L, Qiao L (2018) Mining algorithm for asso- 24. Pei J, Han J, Mao R (2000) CLOSET: an efficient algorithm for min - ciation rules in big data based on Hadoop. In: AIP conference pro- ing frequent closed itemsets. ACM SIGMOD workshop on research ceedings. AIP Publishing: 040035 issues in data mining and knowledge discovery, pp 21–30 45. Al-Hamodi AA, Lu S (2016) MRFP: discovery frequent patterns 25. Zaki MJ, Hsiao C-J (2002) CHARM: an efficient algorithm for using MapReduce frequent pattern growth. In: Network and infor- closed itemset mining. In: Proceedings of the 2002 SIAM interna- mation systems for computers (ICNISC), 2016 international confer- tional conference on data mining, pp 457–473 ence on. IEEE: 298–301 26. Liu G, Lu H, Yu JX et al.(2003) AFOPT: an efficient implementation 46. Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining of pattern growth approach. In: Proceeding of the Icdm Workshop for big data. In: Big Data, 2013 IEEE international conference on 27. Wang J, Han J, Pei J (2003) ClOSET+: Searching for the best strate- IEEE: pp 111–118 gies for mining frequent closed itemsets. In: Proceedings of the ninth 47. Gonen Y, Gudes E (2016) An improved mapreduce algorithm for ACM SIGKDD international conference on knowledge discovery mining closed frequent itemsets. In: Software science, technology and data mining, pp 236–245 and engineering (SWSTE), IEEE international conference on: 2016. 28. Pan F, Cong G, Tung AK et al.(2003) Carpenter: Finding closed IEEE: 77–83 patterns in long biological datasets. In: Proceedings of the ninth 48. Riondato M, DeBrabant JA, Fonseca R et al (2012) PARMA: a par- ACM SIGKDD international conference on knowledge discovery allel randomized algorithm for approximate association rules min- and data mining, pp 637–642 ing in MapReduce. In: Proceedings of the 21st ACM international 29. Cong G, Tan K-L, Tung AK et al.(2004) Mining frequent closed conference on Information and knowledge management, pp 85–94 patterns in microarray data. In: 2004 ICDM’04 fourth IEEE inter- 49. Apiletti D, Baralis E, Cerquitelli T et al (2015) PaMPa-HD: a Paral- national conference on data mining, pp 363–366 lel MapReduce-based frequent pattern miner for high-dimensional 30. Pan F, Tung AK, Cong G et  al.(2004) COBBLER: combining data. In: IEEE international conference on data mining workshop, column and row enumeration for closed pattern discovery. In: Pro- pp 839–846 ceedings 16th international conference on scientific and statistical 50. Xun Y, Zhang J, Qin X (2016) Fidoop: Parallel mining of frequent database management, pp 21–30 itemsets using mapreduce. IEEE Trans Syst Man Cybern Syst 31. Liu H, Han J, Xin D et al.(2006) Mining frequent patterns from very 46(3):313–325 high dimensional data: a top-down row enumeration approach. In: 51. Wang Z. (2014) The colossal pattern mining algorithm based on Proceedings of the 2006 SIAM international conference on data pattern fusion., Tianjin Polytechnic University, Tianjin mining, pp 282–293 52. Djenouri Y, Bendjoudi A, Djenouri D, Comuzzi M (2017) GPU- 32. Liu H, Wang X, He J et  al (2009) Top-down mining of fre- based bio-inspired model for solving association rules mining prob- quent closed patterns from very high dimensional data. Inf Sci lem. In: Parallel, distributed and network-based processing (PDP), 179(7):899–924 2017 25th Euromicro International Conference on. IEEE: 262–269 33. Cong G, Tung AK, Xu X et al (2004) FARMER: finding interest- 53. Djenouri Y, Djenouri D, Habbas Z (2018) Intelligent mapping ing rule groups in microarray datasets. In: Proceedings of the 2004 between GPU and cluster computing for discovering big associa- ACM SIGMOD international conference on management of data, tion rules. Appl Soft Comput 65:387–399 pp 143–154 54. Mangalampalli A, Pudi V (2013) FAR-HD:A fast and efficient algo- rithm for mining fuzzy association rules in large high-dimensional 1 3 Artificial Life and Robotics datasets. Parallel implementation of apriori algorithm based on discover quantitative association rules in large-scale datasets. mapreduce Fuzzy Systems, pp 1–6 Integr Comput Aided Eng 22(1):21–39 55. Martínez-Ballesteros M, Bacardit J, Troncoso A, Riquelme JC (2015) Enhancing the scalability of a genetic algorithm to 1 3

Journal

Artificial Life and RoboticsSpringer Journals

Published: May 30, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off