Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

Optimized distributed systems achieve significant performance improvement on sorted merging of... Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)–based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all Received: 12 October 2017; Revised: 6 February 2018; Accepted: 5 May 2018 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 2 Sorted Merging using Distributed Systems three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems. Keywords: sorted merging; whole-genome sequencing; MapReduce; Hadoop; HBase; Spark need to be merged (into a VCF or TPED file) as required by down- Introduction stream analysis tools such as PLINK [24] and BlueSNP [25, 26]. Either a VCF or TPED file requires the data to be sorted by their With the rapid development of high-throughput biotechnolo- genomic locations, thus these tasks are equivalent to the well- gies, genetic studies have entered the Big Data era. Studies like known sorted full-outer-joining problem [27]. Currently, they are genome-wide association studies (GWASs), whole-genome se- handled by software such as VCFTools [28] and PLINK, which quencing (WGS), and whole-exome sequencing studies have become considerably inefficient even in the face of a moderate produced massive amounts of data. The ability to efficiently number of VCF files. The main reason is that these tools adopt manage and process such massive amounts of data becomes the multiway-merge-like method [29] with a priority queue as increasingly important for successful large-scale genetics stud- the underlying data structure to ensure the correct output or- ies [1–3]. Single-machine based methods are inefficient when der. Although such a method only requires one round of read processing such large amounts of data due to the prohibitive through of the input files, a key deficiency is that it can only computation time, Input/Output bottleneck, as well as central have one consumer access items from the data queue, which processing unit (CPU) and memory limitations. Traditional high- makes it sequential upon writing. This problem cannot be elim- performance computing (HPC) techniques based on message inated even if the multiway-merging is implemented as paral- passing interface (MPI)/OpenMP also suffer from limitations lel processes due to I/O saturation, workload imbalance among such as not allowing addition of computing nodes at runtime, computing units, and memory limits. Therefore, these single- shortage of a fault-tolerant and high available file system, and machine based tools are inefficient and time-consuming when inflexibility of customizing the computing environment without handling large datasets. administrator permission of a cluster [3, 4]. It becomes increas- In this study, we use the case of sorted-merging multiple VCF ingly attractive for investigators to take advantage of more pow- files to demonstrate the benefits of using Apache distributed erful distributed computing resources or the cloud to perform platforms. However, simply running sorted merging on such dis- data processing and analyses [3, 5]. The Apache Foundation has tributed systems runs into problems of bottlenecks, hot spots, been a leading force in this endeavor and has developed multiple and unordered results commonly seen in parallel computations. platforms and systems including Hadoop [6, 7], HBase [8], and Rather, we believe working schemas custom designed for each Spark [9]. All these three Apache platforms have gained substan- specific distributed platform are required to unleash their full tial popularity in recent years and have been endorsed and sup- potential. To overcome the limitations of single-machine, tradi- ported by major vendors such as Amazon Web Services (AWS). tional parallel/distributed, and simple Apache distributed sys- In bioinformatics, researchers have already started to em- tem based methods, we propose and implement three schemas brace Apache distributed systems to manage and process large running on Hadoop, Spark, and HBase, respectively. We chose amounts of high throughput “-omics” data. For example, the these three platforms because they represent cloud distributed Cancer Genome Atlas project makes use of the Hadoop frame- systems providing data partitioning based parallelism with dis- work to split genome data into chunks distributed over the clus- tributed storage, data partitioning based parallelism with in- ter for parallel processing [3, 10]. The CloudBurst [11], Seal [12], memory based processing, and high dimensional tables like dis- Hadoop-BAM [13], and Crossbow software [14] take advantage of tributed storage, respectively. Hadoop [6] is the open source the Hadoop framework to accelerate sequencing read mapping, implementation of MapReduce [7] based on the parallel key- aligning, and manipulations as well as single-nucleotide poly- value processing technique and has the advantage of trans- morphism (SNP) calling. The Collaborative Genomic Data Model parency and simplicity. HBase [8] is a data warehousing plat- [15] adopts HBase to boost the querying speed for the main form that adopts Google’s BigTable data storing structure [30] classes of queries on genomic databases. MetaSpark [16]uti- to achieve high efficiency in storing and reading/writing large lizes Spark’s distributed dataset to recruit large scales of metage- scales of sparse data. Spark [9] introduces the concept of re- nomics reads to reference genomes and achieves better scala- silient distributed dataset (RDD) and directed acyclic graph ex- bility and sensitivity than single-machine based programs [17]. ecution to parallel key-value processing, thus enabling fast, ro- Industry cloud computing vendors such as Amazon [18]and bust, and repetitive in-memory data manipulations. Specifically, Google [19] are also beginning to provide specialized environ- our schemas involve dividing the job into multiple phases corre- ments to ease genomics data processing in the cloud. sponding to tasks of loading, mapping, filtering, sampling, parti- Although numerous Apache cluster-based applications have tioning, shuffling, merging, and outputting. Within each phase, already been developed for processing and analyzing large-scale data and tasks are evenly distributed across the cluster, enabling genomics data including ADAM [1], VariantSpark [20], SparkSeq processing large amounts of data in a parallel and scalable man- [21], Halvade [22], and SeqHBase [23], among others, we believe ner, which in turn improves both speed and scalability. there are still many opportunities in biomedical data analyses to take advantage of distributed systems as the scale and scope of data become larger and more complex. A particular example Methods is sorted merging, which is a ubiquitous operation in processing Overview genetics and genomics data. As an example, in WGS, variants identified from individuals are often called and stored in sep- The benefits of using these three Apache distributed platforms arate variant call format (VCF) files. Eventually these VCF files to perform sorted merging are four-fold when compared to using Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 3 the multiway-merge method [29], a relational database based only qualified records with a “PASS” filter value are retained. approach, or an HPC framework. First, with genomic locations Second, genotypes in VCF files are stored in the form of allele as keys and genotypes as values, it is readily transformed into values, where 0 stands for the reference allele, 1 stands for the the key-value model in which all three platforms offer a rich first mutant allele, 2 stands for the second mutant allele, and so set of parallel operations. Second, data in VCF files are semi- on. Allele values must be translated into corresponding types of structured. This type of data is an ideal fit for the three platforms nucleotides in the TPED file. Third, all individuals need to have that allow defining the schema during data loading, avoiding a genotype for genomic locations appearing in at least one VCF the preprocessing of raw data into a rigid schema as in a rela- file. The default genotype for a missing value is a pair of ho- tional database. Third, all of these platforms provide built-in ef- mozygous reference alleles. The merging of multiple VCF files ficient task coordination, high fault tolerance, data availability, into a single VCF file follows the rules as follows: first, the ALT and locality, which are absent in the traditional HPC framework. and INFO columns of a genomic location in the merged file are Fourth, the merged results are directly saved onto a distributed set as the concatenated values of the corresponding columns on file system such as Hadoop distributed file systems (HDFS) or that location from all input files with duplicated values removed. Amazon S3, which can be directly used for subsequent cluster- Second, the QUAL column of a genomic location in the merged based GWAS or WGS analytical tools such as BlueSNP. file is set as a weight-averaged quality value of all individuals Despite these advantages, simply performing sorted merg- on that location. Third, a genomic location is kept only when it ing on these Apache distributed systems will not deliver the appears in at least one input file and has a FILTER column value expected results for the following reasons. First, it can lead to of “PASS.” Fourth, if an individual does not have allele values on globally unsorted results. Hash-based shuffling of input data is a genomic location in the input file, their missing allele values the default mechanism for distributing data to parallel work- are designated as “.” in the merged file. ing units in the system. However, shuffling will lead to globally For our Apache cluster-based schemas, the merging of mul- unsorted results. Second, bottlenecks and hot spots can hap- tiple VCF files into a single TPED file and the merging of multiple pen during the processing in the cluster. Bypassing the hash- VCF files into a single VCF file differ only in the value contents ing based shuffling can lead to unbalanced workloads across of the key-value pairs, so they should have the same scalabil- the cluster and result in straggling computing units that become ity property. Although we implement the applications of both bottlenecks for response time. In addition, for parallel loading of merging types using our Apache cluster-based schemas, which presorted data into HBase, data being loaded from all the loading are available on our project website, we focused our experiments tasks access the same node simultaneously, while other nodes on the merging of multiple VCF files into a single TPED file and may be idling, creating an I/O hot spot. Third, sampling costs only evaluate the execution speed of the merging of multiple could become prohibitive. Although Hadoop provides a built-in VCF files into a single VCF file with VCFTools as the benchmark. utility named total-order-merging [27] to achieve both workload balance and global order, it involves transferring to and sam- MapReduce (Hadoop) schema pling all the data on a single node. The communication costs This schema is built on Hadoop’s underlying model MapReduce over the network and disk I/O can be prohibitive when data size and running on Hadoop clusters. MapReduce [7]isaparallel becomes very large. In the following sections, we will illustrate computing model based on a split-apply-combine strategy for data how our custom designed schemas are able to overcome these analysis, in which data are mapped to key-values for splitting limitations in detail. (mapping), shuffling, and combining (reducing) for final results. We use Apache Hadoop-2.7 as the system for our implementa- Data formats and operations tion. Our optimized schema consists of two MapReduce phases, In a typical WGS experiment, data analysis often starts from in- as shown in Fig. 2 (the pseudocodes are shown in Supplemen- tary Fig. S1). dividual genotype files in the VCF format [ 31]. A VCF file con- tains data arranged into a table consisting of eight mandatory fields including chromosome (CHROM), the genomic coordinate First MapReduce phase of the start of the variant (POS), the reference allele (REF), and Raw data are loaded from HDFS into parallel mappers to per- a comma separated list of alternate alleles (ALT), among others. form the following tasks. First, unqualified data are filtered out In our experiments, we use a dataset consisting of the VCF files and qualified data are mapped to key-value pairs. The mapper of 186 individuals [32] generated from Illumina’s BaseSpace soft- output key is the genomic location and the output value is the ware (left tables in Fig. 1). Each VCF file has around 4–5 million genotype and individual ID. Second, key-value pairs are grouped rows, each row contains information on one of the individual’s together by chromosomes and temporarily saved as compressed genomic variants. Each VCF file is about 300 MB in size. In an Hadoop sequence files [ 33] for faster I/O in the second MapRe- attempt to protect privacy of study subjects, we apply the fol- duce phase. With this grouping, if SNPs of interest are located in lowing strategy to conceal their real genetic variant information a few selected chromosomes only, we can choose to just merge contained in the VCF files: we first transform each original ge- records from these selected chromosomes rather than from all nomic location by multiplying it with an undisclosed constant chromosomes. Meanwhile, these records are sampled to explore real number, taking the floor integer of the result, and then add their distribution profile of keys along chromosomes to deter- another undisclosed constant integer number. mine boundaries. The boundaries are determined so there is an It is common that multiple VCF files need to be merged into approximately equal number of records within each segment. a single TPED file for analysis tools such as PLINK. A TPED file re- Because all records falling in the same segment will be assigned sembles a big table, aggregating genotypes of all individuals un- to the same reducer in a later phase, boundaries calculated in der investigation by genomic locations (right table in Fig. 1). The this way ensure the workload of each reducer is balanced. There merging follows several rules. First, each record is associated are two rounds of samplings. The first one happens in each map- with a data quality value in the FILTER column, which records per with a pre-specified sampling rate, which in our case is set to the status of this genomic position passing all filters. Usually be 0.0001. Sampled records are then separated and distributed to Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 4 Sorted Merging using Distributed Systems Figure 1: Merging multiple VCF files into a single TPED file. Left tables represent input VCF files. The table on the right represents the merged TPED file. Records ar e filtered out if their Filter value is not equal to “PASS” (Pos 10 147). Individual genotypes from multiple VCF files with the same genomic location are agg regated together in one row. The resulting TPED file thus has an inclusive set of sorted genomic locations of all variants found in the input VCF files. Figure 2: The workflow chart of the MapReduce schema. The workflow is divided into two phases. In the first phase, variants are filtered, grouped by chromosomes into bins, and mapped into key-value records. Two sampling steps are implemented to generate partition lists of all chromosomes. In the second phase, parallel jobs of specified chromosomes are launched. Within each job, records from corresponding bins are loaded, partitioned, sorted, and merged by genomic locat ions before being saved into a TPED file. different reducers in this phase by chromosomes, where they are nomic locations before saving them to a TPED file. In this way, sampled again with a rate equal to the reciprocal of the number globally sorted merging can be fulfilled. of input files. This second sampling effectively limits the num- ber of final sampled records even in the face of a very large num- ber of input files. Because the number of reducers instantiated in the second phase equals the number of boundaries, which HBase schema in turn is decided by the number of sampled records, we can therefore avoid launching unnecessary reducers, thus minimiz- HBase [8] is a column-oriented database where data are grouped into column families and split horizontally into regions spread- ing task overheads. ing across the cluster. This data storing structure supports effi- cient sequential reading and writing of large-scale data as well Second MapReduce phase as fast random data accessing. Also, HBase is storage efficient In this phase, multiple parallel MapReduce jobs are created, one because it can remember null values without saving them on for each chromosome, to handle all the records in sequence files disk. These features make HBase an ideal platform for manag- generated from the first phase. Within each job, a partitioner ing large, sparse data with relatively low latency, which naturally redirects records to the appropriate reducer by referring to the fits the sorted merging case. We use HBase-1.3 as the system for splitting boundaries from the previous phase, so records falling our implementation. As shown in Fig. 3, our optimized HBase in between the same pair of boundaries are aggregated together. schema is divided into three phases as discussed next (refer to Finally, each reducer sorts and merges aggregated records by ge- Supplementary Fig. S2 for pseudocodes). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 5 Figure 3: The workflow chart of the HBase schema. The workflow is divided into three phases. The first is a sampling, filtering, and mapping phase. A MapReduce job samples out variants whose genomic positions are used as region boundaries when creating the HBase table. Only qualified records are mapped as key-val ues and saved as Hadoop sequence files. The second is the HBase bulk loading phase in which a MapReduce job loads and writes records generated from the previous p hase, aggregating them into corresponding regional HFiles in the form of HBase’s row key and column families. Finished HFiles are moved into HBase data storage folders on region servers. In the third phase, parallel scans were launched over regions of the whole table to retrieve desired records that are subsequently merged and exported to the TPED file. Sampling phase via HBase servers’ I/O routines. The order of records in the ta- The main challenge of HBase is to avoid computational hot spots ble is guaranteed because they are internally sorted by writing in the cluster, which can happen when it starts loading a table reducers and HBase’s Log-Structured Merge-tree [34]. It is worth from a single region hosted by a single node. Therefore, we need mentioning that VCF records are always sparse, thus HBase is to presplit the table into regions of approximately equal size be- very storage efficient. fore loading. The sampling phase is introduced to determine reasonable presplitting regional boundaries. The total number Exporting phase of regions is set to half of the number of input files so the size A scan of a specified genomic window is performed on the table. of each region is approximately 1 GB. Meanwhile, mappers of It involves launching parallel mappers each receiving records this phase also save qualified records as compressed Hadoop se- from a single HBase region, filling in missing genotypes, con- quence files on HDFS that are used as inputs in the next phase. catenating records with the same row-key, and outputting final In addition, filtering and key-value mapping also take place in results into TPED files. this phase. Spark schema Bulk loading phase Even when the table has been presplit evenly, the hot spot prob- Spark [9] is a distributed engine built upon the ideas of MapRe- lem of loading sorted inputs can still emerge because sorted duce and RDD. It can save intermediate results in the form of records are loaded sequentially and, at any instant, they still RDD in memory and perform computations on them. Also, its access the same region and server. During the bulk loading, computations are lazily evaluated, which means the execution the key and value of each record produced from the previous plan can be optimized to include as many computational steps phase is converted into HBase’s binary row-key and column- as possible. As a result, it is ideal for iterative computations value, respectively, and saved into an HFile, HBase’s native stor- such as sorted merging. We implement our optimized Spark age format. The row-key here is in the form of chromosome- schema on Spark-2.1. It has three stages, which we describe be- genomic location, and column-value refers to reference allele, low and present in Fig. 4 (refer to Supplementary Fig. S3 for pseu- individual ID, and genotype. The bulk loading populates each docodes). HFile with records falling in the same pair of presplit regional boundaries. Because HFiles are written simultaneously by par- allel mappers/reducers, all working nodes are actively involved, RDD preprocessing stage and the regional hotspot is thus circumvented. Upon finishing This stage involves loading raw data as RDDs, filtering, and map- writing, the HBase can readily load HFiles in parallel into the ping RDDs to paired-RDDs with keys (chromosome and genomic table by simply moving them into local HBase storage folders. position) and values (reference allele, sample ID, and genotype). This procedure is therefore at least an order of magnitude faster This stage ends with a sorting-by-key action that extends to the than the normal loading in which data are loaded sequentially next stage. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 6 Sorted Merging using Distributed Systems Figure 4: The workflow chart of the Spark schema. The workflow is divided into three stages. In the first stage, VCF records are loaded, filtered, and mapped to pairRD Ds with keys of genomic position and values of genotype. The sort-by-key shuffling spans across the first two stages, sorting and grouping together record s by keys. Then, grouped records with the same key are locally merged into one record in TPED format. Finally, merged records are exported to the TPED file. Sorting and merging stage multiway-merge because many existing bioinformatics tools, in- The sort-by-key shuffling operation repartitions and sorts Pair- cluding VCFTools and PLINK, adopt it as the underlying algo- RDD records so records with the same key are aggregated to- rithm for sorted merging. Multiway-merge is highly efficient on gether, which are then merged into the TPED format and con- a single machine as it requires only one scan of sorted input files, verted back to RDD records for outputting. However, Spark’s na- so it can theoretically run at the speed of disk I/O. tive family of group-by-key functions for merging should not Generally, there are two types of parallelism—data paral- be used here because their default partitioner is hash based lelism and task parallelism. The former splits data horizontally and different from the range-based partitioner used by previous into blocks of roughly equal sizes (the size of genomic intervals sort-by-key function. Consequently, the merged results would in our case) before assigning them to all available processes; the be reshuffled into an unsorted status. We therefore optimize the latter assigns a roughly equal number of input files to each pro- merging to bypass these functions so merging can be performed cess. For parallel multiway-merge, we chose data parallelism locally without data reshuffling to ensure both order and high because the implementation of task parallelism would be the speed. same as the HPC-based implementation running on a single node. Perhaps the most difficult part of data parallelism is un- certainty about the data distribution across all input files, which Exporting stage usually leads to the problem of workload imbalance among pro- In this stage, merged RDD records are saved as TPED files on cesses. If we pre-sample all the input files to estimate the record HDFS. distribution, then a full scan of the input files is required that Execution parallelism has an important impact on the perfor- will almost certainly take more time than the single-process mance. To maximize performance, the number of parallel tasks multiway-merge method. As a compromise, we assume the dis- is set to be the number of input files. In this way, data local- tributions of SNP locations in all VCF files are uniform and the ity is maximized, and each task is assigned a proper amount of input files can be split into regions of approximately equal sizes. work. In addition, unlike using MapReduce or HBase, when per- The total number of regions is set to be the number of concur- forming sorting by keys, no explicit sampling is needed because rent processes, so that each region is specifically handled by a Spark keeps track of the number of records before determining process. To avoid seeking of a process’s file reader to its starting repartition boundaries. offset from the beginning of the file, we take advantage of the Tabix indexer [36], which builds indices on data blocks of the in- put file and place the reader’s pointer directly onto the desired Parallel multiway-merge and MPI-based HPC offset. One important aspect of the Tabix indexer is that it re- implementations quires the input file to be compressed in bgzip format, which For most bioinformatics researchers, their daily working en- is not supported by Hadoop, HBase, or Spark. The compression vironment is still traditional in-house HPC clusters or stand- and decompression of a file in bgzip format can be much faster alone powerful servers (with cores ≥16 and memory ≥200 GB) than in the bz2 format used in our cluster-based schemas, sin- rather than heterogeneous cloud-based clusters. Therefore, we gle multiway-merge, and HPC-based implementations, so par- also implement a parallel multiway-merge program running on allel multiway-merge can run much faster than other meth- a single machine and an MPI-based (mpi4py v3.0) “single pro- ods/schemas when input data size is small. gram, multiple data” program running on an HPC cluster as For the HPC-based implementation, we adopt the task paral- benchmarks. The source codes are available at our GitHub web- lelism (Fig. 5) to avoid sampling and workload imbalance. Oth- site [35] (CloudMerge; RRID:SCR 016051). We chose to implement Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 7 Figure 5: The execution plan of the HPC-based implementation. The execution plan resembles a branched tree. In the first round, each process is assigned an appro x- imately equal number of files to merge locally. In the second round, the even-numbered process retrieves the merged file of its right adjacent process to merge with its local merged file. In the third round, processes with ID numbers that can be fully divided by 4 retrieve the merged file of its right adjacent process in the second round and do the merging. This process continues recursively until all files are merged into a single TPED file (round four). erwise, the workflow of HPC-based implementation is the same For the strong scalability test, we fix the number of input as that of the MapReduce-based schema with the same oper- files at 93 and increase the computing resources up to 16-fold ations and the same order: sampling in parallel, dividing the from the baseline. The baseline is a single node (four cores) for dataset into splits of equal sizes, and assigning the splits to pro- all methods/schemas except for the parallel multiway-merge in cesses to perform the merging. However, this implementation is which only a single core is used because it can only run on a without data locality offered by HDFS and task coordination of- single machine. For the weak scalability test, we increase both fered by YARN and thus has a performance no better than the computing resources and input data size at the same pace. The MapReduce-based schema. Specifically, input files are shared ratio is 10 file/core for parallel multiway-merge and 10 file/node across all nodes in the cluster via a network file system (NFS). for all others. In the first round, each core/process fetches roughly the same number of files from the NFS and performs multiway-merging Results locally. In the following rounds, we adopted a tree-structured execution strategy. In the second round, processes with even ID We conducted experiments of Apache cluster-based schemas numbers (process ID starts from 0) retrieve the merged file from using Amazon’s Elastic MapReduce service and experiments TM its adjacent process to the right, which is then merged with its of the HPC-based implementation using MIT’s StarCluster local merged file. Processes with odd ID numbers are terminated. toolkit, which launches an AWS openMP virtual private cluster. In the third round, processes with ID numbers divisible by 4 re- Within both infrastructures, we choose EC2 working nodes of trieve the merged file from its adjacent process to the right in m3.xlarge type, which has four high-frequency Intel Xeon E5- the second round to merge with its local merged file. This pro- 2670 v2 (Ivy Bridge) processors and 15 GB ofmemory. We con- cess continues until all the files are merged into a single file for ducted experiments of parallel multiway-merge on a single EC2 a total of log(n) rounds, where n is the number of the input files. r4.8xlarge instance with 32 high-frequency Intel Xeon E5–2686 v4 (Broadwell) processors and 244 GB memory. We used a dataset Strong and weak scalabilities consisting of 186 VCF files [ 32] generated from Illumina’s BaseS- pace software. In this study, we quantify scalability by measuring computing efficiency in tests of strong and weak scalabilities. We define ef- Overall performance analysis of cluster-based schemas ficiency as the average time cost of processing a file per core: Our primary goal is to explore the scalabilities of the three schemas on input data size and available computing resources, Efficiency = (T ∗C /N ) / (T ∗C /N ) b b b i i i namely, CPUs. To achieve this, in this experiment we adjusted the number of input files from 10 to 186, with an approximate where T is the baseline running time, C is the baseline num- total uncompressed size from 2.5 GB to 40 GB, and used a varying b b ber of cores, N is the baseline number of input files, T is the number of working nodes from 3 to 18, namely, 12 to 72 cores. current running time, C is the current number of cores, N is As Fig. 6 shows, for all three schemas and given a fixed num- i i the current number of input files. We also incorporated the par- ber of cores, the execution time increases at a slower pace than allel multiway-merge and MPI-based HPC implementations as that of the input data size. On the one hand, the increase of exe- benchmarks in the tests. cution time is more obvious with fewer cores because each core Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 8 Sorted Merging using Distributed Systems (a) (b) (c) Figure 6: The scalability of Apache cluster-based schemas on input data size. As the number of input files increases from 10 to 186, the time costs of all three sche mas with 12, 24, or 72 cores increase at a slower pace than that of the input data size, especially when the number of cores is relatively large. The HBase schema with 12 cores has the largest increase (from 375 to 5479 seconds, ∼14.6 fold). is fully utilized. As the number of input files increases, so does indexer to index data blocks of input files. While reading, each the number of parallel tasks assigned to each core. For example, process needs to maintain a full copy of file descriptors, in- given 12 cores, as the number of input files increases from 10 dices, and uncompressed current data blocks of all input files to 186 (18.6 fold), the execution time increases from 739 to 4366 in memory. When both the number of processes and input files seconds (∼5.9 fold) for the MapReduce schema, from 375 to 5479 are large, great pressure is placed on memory management. For seconds (∼14.6 fold) for the HBase schema, and from 361 to 1699 instance, a test with 93 files and 16 processes requires more seconds (∼4.7 fold) for the Spark schema. On the other hand, than 100 GB of memory, which results in a very long memory with relatively more cores such as 72, this linear increasing trend swap and garbage collection time. In contrast, the MapReduce- is less pronounced because there are more cores than tasks so based schema has the best efficiency. Surprisingly, its efficiency that all cores are assigned at most one task. We also notice when even improves when the number of cores doubles from the base- input data size is small or moderate, the Spark schema does not line. This is because it has many parallel tasks in its second always show a consistent improvement in terms of execution MapReduce phase; when the core allowance is low, the over- time with more cores. This is reflected, e.g., in the intersection heads of repetitive task launching and terminating on a sin- of curves that occurred between 24 and 72 cores in Fig. 6c. This gle core become non-negligible. Consequently, as the number phenomenon is attributed to the limitation of Spark’s internal of cores starts to increase, the actual proportion of overheads task assignment policy, which gives rise to the possibility that in the total running time decreases, leading to an improved ef- some nodes are assigned more than one tasks while others re- ficiency. Nonetheless, as the number of cores further increases, main idle. the unparalleled parts of the schema gradually dominate the to- tal running time, leading to a reduced efficiency eventually. For the weak scalability test (Fig. 8), following Gustafson’s law Comparing strong and weak scalabilities between [38], all methods/schemas show a much better efficiency than Apache cluster-based schemas and traditional parallel in the strong scalability test. Meanwhile, for the same reasons methods as the strong scalability, the MapReduce-based schema enjoys the best efficiency while the HPC-based implementation has the Figure 7 shows the results of the strong scalability. In accor- dance with Amdahl’s law [37], all schemas/methods show de- worst. This is because for the HPC-based implementation, as the number of input files increases, the total number of merg- graded efficiency with increasing computing nodes/cores. Par- allel multiway-merge has the steepest degradation because the ing rounds also increases, leading to a significantly reduced effi- ciency. Finally, all three Apache cluster-based schemas demon- more parallel processes, the higher likelihood of workload im- balances among them. In addition, disk I/O reaches saturation as strate significantly better weak scalability than the two tradi- tional parallel methods. more processes write simultaneously. Furthermore, to achieve data parallelism and improve execution speed, we used Tabix Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 9 Figure 7: Comparing the strong scalability between traditional parallel/distributed methods and Apache cluster-based schemas. We fix the number of files at 93 and increase the number of nodes/cores. The baseline for the parallel multiway-merge is one single core, while for the others it is one single node (four cores). All methods/schemas show a degraded efficiency as computing resources increase 16 fold from the baseline. Specifically, the efficiency of MapReduce-, HBa se-, and Spark-based schemas drops to 0.83, 0.63, and 0.61, respectively, while the efficiency of parallel multiway-merge and HPC-based implementations dro ps to 0.06 and 0.53, respectively. Figure 8: Comparing the weak scalability between traditional parallel/distributed methods and Apache cluster-based schemas. We simultaneously increase the num- ber of cores and input data sizes while fixing the ratio of file/core (parallel multiway-merge) or file/node (all others) at 10. The baseline is the same as in the test of strong scalability. All but the MapReduce-based schema have degraded efficiency, among which the HPC-based implementation has the steepest degrada tion. Specif- ically, when computing resource increases 16 fold from the baseline, the efficiency of MapReduce-, HBase-, and Spark-based schemas changes to 3.1, 0. 87, and 0.75, respectively, and for parallel multiway-merge and HPC-based implementations, the efficiency reduces to 0.42 and 0.35, respectively. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 10 Sorted Merging using Distributed Systems (a) (b) (c) Figure 9: The performance anatomy of cluster-based schemas on increasing input data size. The number of cores in these experiments is fixed at 48. Time costs of al l phases of the three schemas have a linear or sublinear correlation with the input data size. (a) MapReduce schema: The two MapReduce phases have a comparable time cost, increasing 6.3- and 3.1-fold, respectively, as the number of input files increases from 10 to 186. (b) HBase schema: The time spent in each pha se increases 4.2-, 5.6-,and 5.0-fold, respectively, as the number of input files increases from 10 to 186. The bulk loading and exporting phases together take up mor e than 80% of total time expense. (c) Spark schema: The time cost increases 5.8-, 6.0-,and 6.0-fold, respectively, for the three stages as the number of input files i ncreases from 10 to 186 files. Like the HBase schema, the first two stages of the Spark schema together account for more than 80% of the total time cost. Anatomic performance analysis of Apache of the first stage, data are sampled to determine the boundaries used by sort-by-key’s range partitioner. This operation demands cluster-based schemas a considerable execution time because it scans all the data and Another important goal of our study is to identify potential per- balances them if necessary. formance bottlenecks, so we evaluate the execution time of each Given that no super-linear increasing trend is observed in phase/stage of all three schemas. Figure 9 shows the trends of running time for all phases/stages of the three schemas, and the anatomic computing time spent on merging an increasing they generally scale well with the input data size, we conclude number of VCF files (from 10 to 186) using 48 cores. For the that although the performances of these schemas might de- MapReduce schema (Fig. 9a), its two phases account for a com- grade to some extent when dealing with even larger input data parable proportion of total time and both show a linear or sub- due to overheads such as data transmission over network, we linear scalability. The reason that the time cost of the first phase would not expect any significant bottleneck. between 40 and 93 input files remains flat is because both runs use two rounds of mappers. As the number of files doubles to 186, four rounds of mappers are required, which results in about Comparing execution speed between Apache a 2-fold increase in the time cost as expected. For the three cluster-based schemas and traditional methods phases of the HBase schema (Fig. 9b), they are scalable with in- Another intriguing question is: how does the speed of the put data size. Meanwhile, the second phase becomes more dom- Apache cluster-based schemas compare to single-machine inant with more input files owing to the larger amount of shuf- based and traditional parallel/distributed methods/applications fled data during the writing of HFiles. However, we do not con- on merging multiple VCF files into a single VCF or TPED file? sider it as a bottleneck since all tasks of this phase are paral- To answer this question, we choose the widely used VCFTools lelized with no workload or computational hot spot. Also, we do (v4.2) and a single-process multiway-merge implementation not observe a super-linear (relative to input data size) increment as single-process benchmarks and parallel multiway-merge pattern from the figure. Finally, Fig. 9c shows the time costs of and HPC-based implementations as parallel/distributed bench- the three stages of the Spark schema. They show a uniform in- marks, which are the same ones used in the experiments of creasing trend with the number of input files. Among them, the strong and weak scalabilities described above. second stage takes up a considerable proportion of the total exe- In the first experiment, we merged 40 VCF files into 1 VCF cution time as it has a relatively expensive sort-by-key shuffling file using VCFTools as the benchmark. As shown in Table 1, operation. Although no data are shuffled in the first stage, its VCFTools takes 30,189 seconds, while the fastest Apache cluster- time lapse is close to the second stage. This is because at the end based schema among the three, MapReduce based, takes only Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 11 Table 1: Performance comparisons between VCTools and Apache cluster-based schemas VCFTools MapReduce HBase Spark Time cost (seconds) 30,189 484 577 596 Fold (faster) - 62.4 52.3 50.7 Table 2: Pros and cons of MapReduce, HBase, and Spark schemas Schema Pros Cons MapReduce • Good for large input data size and sufficient • Merging is not incremental computing resources • Simple architecture and least overheads given • Large overhead when computing resources are sufficient computing resources limited • Best parallelism • Good for one-time merging • Performance is stable HBase • Good for intermediate input data size (≥ 20 and • Users must determine region number in ≤ 100 VCF files) advance • Supports incremental merging • Has most local I/O • Supports on-line analytical processing • Complex performance tuning • Best storage efficiency Spark • Good for large input data size (>100 VCF files) • Possibly weakened data locality during loading and relative limited computing resources • Keeps intermediate results in memory and • Slight unstable performance when computing least local I/O resources exceeds needs of input data size • Good for subsequent statistical analysis on • Actual execution plan is not transparent merged results • Complex performance tuning Figure 10: Execution speed comparison among Apache cluster-based schemas and traditional methods. First, we compare the speeds of the three Apache schemas with that of three traditional methods, which are single-process multiway-merge, parallel multiway-merge, and HPC-based implementations. As the number of input files increases from 10 to 186, the speeds of Apache cluster-based schemas improve much more significantly than that of traditional methods. The number sinthe figures indicate the ratio of the time cost of each traditional method to that of the fastest Apache cluster-based schema. Second, we compare the proces sing speed among the three Apache cluster-based schemas, which are comparable to each other regardless of the input data size. The MapReduce schema performs the best in merging 10 and 186 files; the HBase schema performs the best in merging 20, 40, and 60 files; and the Spark schema performs the best in merging 93 files. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 12 Sorted Merging using Distributed Systems 484 seconds using 72 cores, being about 62-fold faster. In the fied from WGS. We show that all three schemas are scalable second experiment (Fig. 10), we tested the time costs of merging on both input data size and computing resources, suggesting of multiple VCF files into a single TPED file using single/parallel large-scale “-omics” data can be merged efficiently given the multiway-merge and HPC-based implementations as bench- computing resources readily available in the cloud. Further- marks. The single multiway-merger is run on a node with the more, the three schemas show better strong and weak scalabil- hardware configuration (four cores and 15 GB ofmemory) iden- ities than traditional single machine-based parallel multiway- tical to the nodes on which the Apache cluster-based schemas merge and cluster-based HPC methods owing to the absence are run. The parallel multiway merger is run on a node with a of I/O bottleneck, better workload balance among nodes, and maximum of 18 simultaneously running processes. The HPC- less pressure on memory, as well as data locality and efficient based implementation is run on an 18-node cluster with the task coordination mechanisms provided by HDFS and YARN. same hardware configuration as the cluster where the Apache We also show that even with a moderate-sized cluster and in- cluster-based schemas are run. Initially, with 10 input files, the put data, all three schemas significantly outperform the broadly parallel multiway-merge (∼30 seconds) is much faster than all used, single-machine based VCFTools, single-process multiway- the other methods; it is about 7.3-fold faster than the fastest merge, and HPC-based implementations. Although initially the Apache cluster-based schema (MapReduce, 221 seconds). On the parallel multiway-merge implementation is much faster than other hand, the slowest method is the single-process multiway- the Apache schemas owing to its advantage of local I/O and light merger, which takes 620 seconds to finish (about 2.8-fold slower compression of input files, its poor scalability diminishes its ini- than the MapReduce-based schema). It is worth mentioning that tial advantage as the number of concurrent processes and in- in this test, the parallel multiway-merge is essentially the same put files increases. Consequently, we expect the Apache cluster- as the single-process multiway-merge and that the speed dif- based schemas to eventually outperform the parallel multiway- ference (∼378 seconds) between them is the result of a different merge when merging a much larger scale of data using a larger compression format (bz2 vs bgzip) of the input files as explained number of cores. above. As we gradually increase the number of input files to Unlike normal merging, efficient sorted merging of many 186, the difference in speed between the fastest overall method large tables has always been a difficult problem in the field of (parallel multiway merger, 602 seconds) and the fastest Apache data management. Multiway-merge is the most efficient single- cluster-based schema (MapReduce, 809 seconds) decreases to machine based method for sorted merging, but its performance about 1.3-fold, while the difference between the slowest over- is limited by the disk I/O [39]. Sorted merging also places chal- all method (single multiway-merger, 13,219 seconds) and the lenges to distributed system-based solutions because neither MapReduce-based schema increases to 16.3-fold. In addition, all the efficient hash-based merging nor caching the intermediate three Apache schemas significantly outperform the HPC-based table in shared memory is feasible [40]. Although a utility named implementation. As explained in the strong and weak scalabili- total-order-joining is provided by the Hadoop for addressing this ties section above, we expect that the larger the input data size, problem, it suffers from both network communication and local the faster the Apache cluster-based schemas would run com- disk I/O bottlenecks, and thus is not scalable [27, 41]. In contrast, pared to the other traditional methods. our schemas divide this problem into different phases/stages of We also compare the time cost among the three schemas tasks, each conquered in parallel to bypass these bottlenecks (Fig. 10). They have a comparable speed. More specifically, the and achieve maximum parallelism. Furthermore, in addition to MapReduce schema performs best if enough cores are available merging sequencing variant data, these schemas can be gener- and the input data size is large; the HBase schema performs best alized for other key-based, sorted merging problems that are fre- with moderate input data size; the Spark schema performs best quently encountered in genetics and genomics data processing. if only a limited number of cores are available and the input data As an example, they can be slightly modified to merge multi- size is large. The rationale behind this observation is that when ple BED format files such as ChIP-seq peak lists [ 42] and other the number of cores is sufficient, the MapReduce-based schema genomic regions of interest. Other potentially useful features can make the most use of the available computing resources be- include the following: unlike traditional sorted merging algo- cause it runs a constant 25 parallel jobs (one for each of chromo- rithms that usually require presorted inputs for a better perfor- somes 1–22, X, Y, and M [mitochondria]) in its second phase. In mance, our schemas are free of such a requirement; and our im- contrast, the Spark-based schema has fewer tasks whose num- plementations automatically take care of multi-allelic positions, bers equal the number of input files to achieve maximum data- which are frequent in large-scale VCF flies, by retaining the in- task locality. When the input data size is moderate, the HBase- formation of all alleles until the merging actually occurs. schema triumphs because of its internal sorting and relatively Finally, in light of these different features and specialties of compact storage format of intermediate data. When the input these three platforms, each of the three schemas we developed data size is large and computing resources are relatively limited, has its own advantages and disadvantages under different ap- the Spark-based schema outperforms the other two owing to its plication scenarios, as summarized in Table 2. For example, the small number of data shuffling (only one), execution plan opti- MapReduce schema is good for a static one-time, nonincremen- mization, and ability to cache intermediate results in memory. tal merging on large-size data provided sufficient cores are avail- We caution, however, that the computing time may fluctuate de- able since it has the most parallel jobs, the least overheads, and pending on the distribution of genomic locations in the input the most transparent workflow. The HBase schema, supported files as well as the data loading balance of the HDFS. by data warehousing technologies, fits for an incremental merg- ing since it does not need to re-merge existing results with new ones from scratch only if the incremental merging is performed on the same chromosomes. Also, it provides highly efficient stor- Discussion age and on-line analytical processing on merged results. The In this report, we describe three cluster-based schemas run- Spark schema is ideal for merging large-scale data with rela- tively limited computing resources because it has the least data ning on the Apache Hadoop (MapReduce), HBase, and Spark platforms for performing sorted merging of variants identi- shuffling and keeps intermediate results in memory. A bonus Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 13 brought by Spark is that subsequent statistical analyses can be Competing interests carried out directly on the merged results using its rich set of The authors declare that they have no competing interests. parallel statistical utilities. Funding Availability and requirements This study was supported by grants from the National Project name: CloudMerge Heart, Lung, and Blood Institute (R01HL104608, R01HL117004, Project home page: https://github.com/xsun28/CloudMerge R01HL128439, R01HL135156, X01HL134589); National Institute Operating system(s): Linux of Environmental Health Sciences (R01ES015794, R21ES24844); Programming language: Java, Python National Institute on Minority Health and Health Dispari- Other requirements: Java 1.7 or higher, Python 2.7 or 3.6, ties (P60MD006902, R01MD010443, RL5GM118984); National In- Hadoop-2.7, HBase-1.3, Spark-2.1, StarCluster 0.95, MPI for stitute of Neurological Disorders and Stroke (R01NS051630, Python 3.0.0 P01NS097206, U54NS091859); National Science Foundation (ACI License: Apache License 2.0 1443054, IIS 1350885); Tobacco-Related Disease Research Pro- gram (24RT-0025). The Genes-Environments and Admixture in Latino Americans (GALA II) Study; the Study of African Ameri- Availability of supporting data cans, Asthma, Genes and Environments (SAGE) Study; and E.G.B. are supported by the Sandler Family Foundation, the American The source codes for the project are available in GitHub. The 186 Asthma Foundation, the RWJF Amos Medical Faculty Develop- individual VCF files used in our experiments are modified from ment Program, and the Harry Wm. and Diana V. Hind Distin- the original VCF files obtained from WGS conducted by the Con- guished Professor in Pharmaceutical Sciences II. sortium on Asthma among the African-Ancestry Population in the Americas (CAAPA) [32]. To conceal the potential individual identifiable genotype information from the public, we encrypt Authors’ Contributions the authentic genomic location of the original 93 VCF files to J.G. introduced the problem. X.S. and F.W. initiated the project. generate a new batch of encrypted VCF files for test purposes. Please refer to the Data formats and operations section for de- X.S. designed and implemented the CloudMerge project. X.S. drafted the manuscript. X.S., J.P., F.W., and Z.Q. revised the tails. These supporting data and a snapshot of project codes are manuscript. available at the GigaScience database, GigaDB [43]. Via GigaDB, K.C.B. conceived the initial consortium design, acquired we also provide sample results of merging 93VCF files into ei- biospecimens for Next Generation Sequencing (NGS), and facili- ther one VCF or one TPED file using our Apache cluster-based schemas. tated generation of NGS data. K.C.B., R.A.M., I.R., and T.H.B. con- ceived initial experiments and interpreted NGS data. E.G.B. and C.E. acquired biospecimens for NGS and facilitated generation Additional file of NGS data. Figure S1. Pseudocodes of the MapReduce schema. Figure S2. Pseudocodes of the HBase schema. CAAPA Consortium Members and Their Figure S3. Pseudocodes of the Spark schema. Affiliations 1,2 2 Kathleen C. Barnes, PhD, Terri H. Beaty, PhD , Meher Preethi 1 1 1 Boorgula MS , Monica Campbell, BS , Sameer Chavan, MS , Jean Abbreviations 2,3 1 1 G. Ford, MD , Cassandra Foster, CCRP ,LiGao,MD, PhD , Na- AWS: Amazon Web Service; CAAPA: Consortium on Asthma 1 1 dia N. Hansel, MD, MPH , Edward Horowitz, BA , Lili Huang, among African-ancestry Population in the Americas; CPU: cen- 1 1,2 1 MPH , Rasika Ann Mathias, ScD , Romina Ortiz, MA , Joseph tral processing unit; GWAS: genome-wide association stud- 1 1 4 Potee, MS , Nicholas Rafaels, MS , Ingo Ruczin-ski, PhD ,Alan ies; HDFS: Hadoop distributed Systems; HPC: high-performance 1 4 F. Scott, PhD , Margaret A. Taub, PhD , Candelaria Ver-gara, computing; I/O: Input/Output; MPI: message passing interface; 1 5 6 PhD , Jingjing Gao, PhD , Yijuan Hu, PhD , Henry Richard John- NFS: network file system; NGS: next generation sequencing; 6 6 7 ston, PhD , Zhaohui S. Qin, PhD ,AlbertM.Levin,PhD , Badri RDD: resilient distributed dataset; SNP: single-nucleotide poly- 8 8,9 Padhukasahas-ram, PhD , L. Keoki Williams, MD, MPH ,Geor- morphism; VCF: variant call format; WGS: whole-genome se- 10,11 11 gia M. Dunston, PhD , Mezbah U. Faruque, MD, PhD , Eimear quencing. 12,13 14 E. Kenny, PhD , Kimberly Gi-etzen, PhD , Mark Hansen, 14 14 14 PhD , Rob Genuario, PhD , Dave Bullis, MBA ,Cindy Law- 14 15 15 ley, PhD , Aniket Deshpande, MS , Wendy E. Grus, PhD , Ethics approval and consent to participate 15 16 DevinP.Locke,PhD , Marilyn G. Foreman, MD , Pedro C. Avila, 17 17 18 MD , Leslie Grammer, MD , Kwang-Youn A. Kim, PhD ,Ra- Ethics approval for the CAAPA program was provided by the 19,20 21 jesh Kumar, MD , Robert Schleimer, PhD , Carlos Busta- Johns Hopkins University Institutional Review Board (IRB) fol- 12 12 lowing commencement of the study in 2011 (IRB00045892, mante, PhD ,FranciscoM.DeLaVega, DS , Chris R. Gignoux, 12 12 12 MS , Suyash S. Shringarpure, PhD , Shaila Musharo, MS , CAAPA ) and included study team members from each CAAPA 12 22,23 site, including Emory University (site PI, Zhaohui Qin). Access to Genevieve Wojcik, PhD , Esteban G. Burchard, MD, MPH ,Ce- 23 24 leste Eng, BS , Pierre-Antoine Gourraud, PhD , Ryan D. Hernan- the raw data as CAAPA team members is granted according to 22,25,26 24 23,27 the guideline of the IRB-approved study. Informed consent has dez, PhD , Antoine Lizee, PhD , Maria Pino-Yanes, PhD , 23 22 Dara G. Torgerson, PhD , Zachary A. Szpiech, PhD , Raul Torres, been obtained from all study participants of CAAPA. 28 29,30 31 BS , Dan L. Nicolae, PhD , Carole Ober, PhD , Christopher O 32 29 Olopade, MD, MPH , Olufunmilayo Olopade, MD , Oluwafemi Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 14 Sorted Merging using Distributed Systems 29 33 25 Oluwole, MSc , Ganiyu Arinola, PhD , Timothy D. O’Connor, Institute for Human Genetics, Institute for Human Genetics, 34,35,36 34,35,36 37 PhD ,Wei Song,PhD , Goncalo Abecasis, DPhil , University of California, San Francisco, San Francisco, CA, 38 38 26 Adolfo Correa, MD, MPH, PhD , Solomon Musani, PhD , James California Institute for Quantitative Biosciences, University of 39 40 41 G. Wilson, MD , Leslie A. Lange, PhD , Joshua Akey, PhD , California, San Francisco, San Francisco, CA, 42 42 27 Michael Bamshad, MD , Jessica Chong, PhD , Wenqing Fu, CIBER de Enfermedades Respiratorias, Instituto de Salud Car- 41 41 43 PhD , Deborah Nickerson, PhD , Alexander Reiner, MD, MSc , los III, Madrid, 44 44,45 28 Tina Hartert, MD, MPH , Lorraine B. Ware, MD , Eugene Biomedical Sciences Graduate Program, University of Califor- 46 46 46 Bleecker, MD , Deborah Meyers, PhD ,VictorE.Ortega, MD , nia, San Francisco, San Francisco, CA, 47 47 29 Pissamai Maul, BSc, RN ,TrevorMaul, RN , Harold Watson, Department of Medicine, University of Chicago, Chicago, IL, 48,49 50 30 MD , Maria Ilma Araujo, MD, PhD , Ricardo Riccio Oliveira, Department of Statistics, University of Chicago, Chicago, IL, 51 52 53 31 PhD , Luis Caraballo, MD, PhD , Javier Marrugo, MD , Beatriz Department of Human Genetics, University of Chicago, 52 52 54 Martinez, MSc , Catherine Meza, LB , Gerardo Ayestas ,Ed- Chicago, IL, 55,56,57 32 win Francisco Herrera-Paz, MD, MSc , Pamela Landaverde- Department of Medicine and Center for Global Health, Univer- 55 55 55 Torres , Said Omar Leiva Erazo , Rosella Martinez, BSc ,varo sity of Chicago, Chicago, IL, 56 55 33 Mayorga, MD ,LuisF.Mayorga,MD , Delmy-Aracely Mejia- Department of Chemical Pathology, University of Ibadan, 56,57 55 54 54 Mejia, MD , Hector Ramos , Allan Saenz , Gloria Varela , Ibadan, Nigeria, 57 58 34 Olga Marina Vasquez , Trevor Ferguson, MBBS, DM, MSc ,Jen- Institute for Genome Sciences, University of Maryland School nifer Knight-Madden, MBBS, PhD , Maureen Samms-Vaughan, of Medicine, Baltimore, MD, 59 58 35 MBBS, DM, Ph , Rainford J. Wilks, MBBS, DM, MSc ,AkimAdeg- Program in Personalized and Genomic Medicine, University of 60,61,62 60,61,62 nika, MD, PhD , Ulysse Ateba-Ngoa, MD ,Maria Yaz- Maryland School of Medicine, Baltimore, MD, 62 36 danbakhsh, PhD Department of Medicine, University of Maryland School of Department of Medicine, Johns Hopkins University, Baltimore, Medicine, Baltimore, MD, MD, Department of Biostatistics, SPH II, University of Michigan, Department of Epidemiology, Bloomberg School of Public Ann Arbor, MI, Health, JHU, Baltimore, MD, Department of Medicine, University of Mississippi Medical Department of Medicine, The Brooklyn Hospital Center, Brook- Center, Jackson, MS, lyn, NY, Department of Physiology and Biophysics, University of Mis- Department of Biostatistics, Bloomberg School of Public Health, sissippi Medical Center, Jackson, MS, JHU, Baltimore, MD, Department of Genetics, University of North Carolina, Chapel Data and Statistical Sciences, AbbVie, North Chicago, IL, Hill, NC, 6 41 Department of Biostatistics and Bioinformatics, Emory Univer- Department of Genomic Sciences, University of Washington, sity, Atlanta, GA, Seattle, WA, 7 42 Department of Public Health Sciences, Henry Ford Health Sys- Department of Pediatrics, University of Washington, Seattle, tem, Detroit, MI, WA, 8 43 Center for Health Policy & Health Services Research, Henry Ford University of Washington, Seattle, WA, Health System, Detroit, MI, Department of Medicine, Vanderbilt University, Nashville, TN, 9 45 Department of Internal Medicine, Henry Ford Health System, Department of Pathology, Microbiology and Immunology, Van- Detroit, MI, Department of Microbiology, Howard University derbilt University, Nashville, TN, College of Medicine, Washington, DC, Center for Human Genomics and Personalized Medicine, Wake National Human Genome Center, Howard University College Forest School of Medicine, Winston-Salem, NC, of Medicine, Washington, DC, Genetics and Epidemiology of Asthma in Barbados, The Uni- Department of Genetics, Stanford University School of versity of the West Indies, Medicine, Stanford, CA, Faculty of Medical Sciences Cave Hill Campus, The University Department of Genetics and Genomics, Icahn School of of the West Indies, Medicine at Mount Sinai, New York, Queen Elizabeth Hospital, Queen Elizabeth Hospital, The Uni- Illumina, Inc., San Diego, CA, versity of the West Indies, 15 50 Knome Inc., Cambridge, MA, Immunology Service, Universidade Federal da Bahia, Salvador, Pulmonary and Critical Care Medicine, Morehouse School of BA, Medicine, Atlanta, GA, Laboratrio de Patologia Experimental, Centro de Pesquisas Go- Department of Medicine, Northwestern University, Chicago, IL, nalo Moniz, Salvador, BA, 18 52 Department of Preventive Medicine, Northwestern University, Institute for Immunological Research, Universidad de Carta- Chicago, IL, gena, Cartagena, 19 53 Department of Pediatrics, Northwestern University, Chicago, Instituto de Investigaciones Immunologicas, Universidad de IL, Cartagena, Cartagena, 20 54 The Ann & Robert H. Lurie Children’s Hospital of Chicago, Faculty of Medicine, Universidad Nacional Autonoma de Hon- Chicago, IL, duras en el Valle de Sula, San Pedro Sula, 21 55 Department of Medicine, Northwestern Feinberg School of Facultad de Medicina, Universidad Catolica de Honduras, San Medicine, Chicago, IL, Pedro Sula, 22 56 Department of Bioengineering and Therapeutic Sciences, Uni- Centro de Neumologia y Alergias, San Pedro Sula, versity of California, San Francisco, San Francisco, CA, Faculty of Medicine, Centro Medico de la Familia, San Pedro Department of Medicine, University of California, San Fran- Sula, cisco, San Francisco, CA, Tropical Medicine Research Institute, The University of the Department of Neurology, University of California, San Fran- West Indies, cisco, San Francisco, CA, Department of Child Health, The University of the West Indies, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 15 Centre de Recherches Mdicales de Lambarn, 13. Niemenmaa M, Kallio A, Schumacher A, et al. Hadoop-BAM: Institut fr Tropenmedizin, Universitt Tbingen, directly manipulating next generation sequencing data in Department of Parasitology, Leiden University Medical Center, the cloud. Bioinformatics 2012;28(6):876–7. Netherlands. 14. Langmead B, Schatz MC, Lin J, et al. Searching for SNPs with cloud computing. Genome Biol 2009;10(11):R134. 15. Wang S, Mares MA, Guo YK. CGDM: collaborative genomic Acknowledgements data model for molecular profiling data using NoSQL. Bioin- formatics 2016;32(23):3654–60. We thank the three referees for their constructive critiques and 16. Zhou W, Li R, Yuan S et al. MetaSpark: a spark-based dis- detailed comments. We are grateful to Mary Taylor Mann and tributed processing tool to recruit metagenomic reads to ref- Alyssa Leann Duck for their editorial help during writing and erence genomes. Bioinformatics 2017;33(7):1090–2. revising of the manuscript. E.G.B. acknowledges the following 17. Niu B, Zhu Z, Fu L, et al. FR-HIT, a very fast program to re- GALA II and SAGE co-investigators for subject recruitment, sam- cruit metagenomic reads to homologous reference genomes. ple processing, and quality control: Sandra Salazar, Scott Hunts- Bioinformatics 2011;27(12):1704–5. man, MSc, Donglei Hu, PhD, Lisa Caine, Shannon Thyne, MD, 18. AWS Genomics Guide. https://d0.awsstatic.com/Industries/ Harold J. Farber, MD, MSPH, Pedro C. Avila, MD, Denise Serebrisky, HCLS/Resources/AWS Genomics WP.pdf. Accessed 10 April MD, William Rodriguez-Cintron, MD, Jose R. Rodriguez-Santana, MD, Rajesh Kumar, MD, Luisa N. Borrell, DDS, PhD, Emerita 19. Gruber K. Google for genomes. Nature Biotechnology 2014, Brigino-Buenaventura, MD, Adam Davis, MA, MPH, Michael A. 32, 6, 508. LeNoir, MD, Kelley Meade, MD, Saunak Sen, PhD, and Fred Lur- 20. O’Brien AR, Saunders NF, Guo Y, et al. VariantSpark: pop- mann, MS. ulation scale clustering of genotype information. BMC Ge- nomics 2015;16(1):1052. 21. Wiewiorka ´ MS, Messina A, Pacholewska A, et al. SparkSeq: References fast, scalable and cloud-ready tool for the interactive ge- 1. Massie M, Nothaft F, Hartl C, et al. Adam: Genomics formats nomic data analysis with nucleotide precision. Bioinformat- ics 2014;30(18):2652–3. and processing patterns for cloud scale computing. Univer- sity of California, Berkeley Technical Report, No UCB/EECS- 22. Decap D, Reumers J, Herzeel C, et al. Halvade: scal- 2013. 2013;207. able sequence analysis with MapReduce. Bioinformatics 2. Siretskiy A, Sundqvist T, Voznesenskiy M, et al. A quan- 2015;31(15):2482–8. titative assessment of the hadoop framework for analyz- 23. He M, Person TN, Hebbring SJ, et al. SeqHBase: a big data ing massively parallel DNA sequencing data. GigaScience toolset for family based sequencing data analysis. J Med 2015;4(1):26. Genet 2015, 52, 282–288. 3. Merelli I, Per ´ ez-Sanc ´ hez H, Gesing S, et al. Managing, 24. Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set analysing, and integrating big data in medical bioinformat- for whole-genome association and population-based linkage ics: open problems and future perspectives. BioMed Re- analyses. Am J Hum Genet 2007;81(3):559–75. search International 2014: 134023. 25. Mohammed EA, Far BH, Naugler C. Applications of the 4. Reyes-Ortiz JL, Oneto L, Anguita D. Big data analytics in the MapReduce programming framework to clinical big data cloud: Spark on hadoop vs mpi/openmp on beowulf. Proce- analysis: current landscape and future trends. BioData Min 2014;7:22. dia Computer Science 2015;53:121–30. 5. Burren OS, Guo H, Wallace C. VSEAMS: a pipeline for variant 26. Huang H, Tata S, Prill RJ. BlueSNP: R package for highly scal- set enrichment analysis using summary GWAS data identi- able genome-wide association studies using Hadoop clus- fies IKZF3, BATF and ESRRA as key transcription factors in ters. Bioinformatics 2013;29(1):135–6. type 1 diabetes. Bioinformatics 2014;30(23):3342–8. 27. White T. Hadoop: The Definitive Guide, Sebastopol, CA. 6. Apache Hadoop. http://hadoop.apache.org/. Accessed 10 “O’Reilly Media, Inc.”; 2012. April 2018. 28. Danecek P, Auton A, Abecasis G, et al. The variant call format 7. Dean J, Ghemawat S. Mapreduce: simplified data processing and VCFtools. Bioinformatics 2011;27(15):2156–8. on large clusters. Commun Acm 2008;51(1):107–13. 29. Multiway-Merge Algorithm. https://en.wikipedia.org/wiki/ 8. Vora MN. Hadoop-HBase for large-scale data. In: Computer K-Way Merge Algorithms. Accessed 10 April 2018. Science and Network Technology (ICCSNT), 2011 Interna- 30. Chang F, , Dean J, Ghemawat S, et al. Bigtable: a distributed tional Conference on. IEEE; 2011, pp. 601–5. storage system for structured data. Acm T Comput Syst 9. Zaharia M, Chowdhury M, Das T, et al. Resilient distributed 2008;26(2): 1365815–1365816. 31. Genomes Project C, Abecasis GR, Altshuler D, et al. A map datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference of human genome variation from population-scale sequenc- on Networked Systems Design and Implementation. USENIX ing. Nature 2010;467(7319):1061–73. Association; 2012, pp. 2. 32. Mathias RA, Taub MA, Gignoux CR, et al. A continuum of ad- 10. McKenna A, Hanna M, Banks E, et al. The Genome mixture in the Western Hemisphere revealed by the African Analysis Toolkit: a MapReduce framework for analyz- Diaspora genome. Nat Commun 2016;7:12522. ing next-generation DNA sequencing data. Genome Res 33. Kwon Y, Balazinska M, Howe B, et al. A study of skew in 2010;20(9):1297–303. Mapreduce applications. Open Cirrus Summit 2011;11. 11. Schatz MC. CloudBurst: highly sensitive read mapping with 34. O’Neil P, Cheng E, Gawlick D, et al. The Log-Structured MapReduce. Bioinformatics 2009;25(11):1363–9. Merge-tree (LSM-tree). Acta Informatica 1996;33(4):351–85. 12. Pireddu L, Leo S, Zanetti G. SEAL: a distributed short 35. CloudMerge. https://github.com/xsun28/CloudMerge., Last access: 05-25-2018. read mapping and duplicate removal tool. Bioinformatics 2011;27(15):2159–60. 36. Li H. Tabix: fast retrieval of sequence features from generic Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 16 Sorted Merging using Distributed Systems TAB-delimited files. Bioinformatics 2011; 27(5):718–9. 41. Miner D, Shook A. MapReduce Design Patterns: Building Ef- 37. Amdahl GM. Validity of the single processor approach to fective Algorithms and Analytics for Hadoop and Other Sys- achieving large scale computing capabilities. In: Proceedings tems, Sebastopol, CA. “O’Reilly Media, Inc.”; 2012. of the April 18–20, 1967, Spring Joint Computer Conference. 42. Chen L, Wang C, Qin ZS, et al. A novel statistical method ACM; 1967, pp. 483–5. for quantitative comparison of multiple ChIP-seq datasets. 38. Gustafson JL. Reevaluating Amdahl’s law. Commun Acm Bioinformatics 2015;31(12):1889–96. 1988;31(5):532–3. 43. Sun X, Gao J, Jin P, et al. Supporting data for “Optimized dis- 39. Sedgewick R, Flajolet P. An Introduction to the Analysis of tributed systems achieve significant performance improve- Algorithms, Boston, MA. Addison-Wesley; 2013. ment on sorted merging of massive VCF Files” GigaScience 40. Ozsu MT, Valduriez P. Principles of Distributed Database Sys- Database 2018. http://dx.doi.org/10.5524/100423. tems, Berlin/Heidelberg, Germany. Springer Science & Busi- ness Media; 2011. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

Free
16 pages

Loading next page...
 
/lp/ou_press/optimized-distributed-systems-achieve-significant-performance-cAq86YjZwb
Publisher
Oxford University Press
Copyright
© The Authors 2018. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/giy052
Publisher site
See Article on Publisher Site

Abstract

Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)–based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all Received: 12 October 2017; Revised: 6 February 2018; Accepted: 5 May 2018 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 2 Sorted Merging using Distributed Systems three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems. Keywords: sorted merging; whole-genome sequencing; MapReduce; Hadoop; HBase; Spark need to be merged (into a VCF or TPED file) as required by down- Introduction stream analysis tools such as PLINK [24] and BlueSNP [25, 26]. Either a VCF or TPED file requires the data to be sorted by their With the rapid development of high-throughput biotechnolo- genomic locations, thus these tasks are equivalent to the well- gies, genetic studies have entered the Big Data era. Studies like known sorted full-outer-joining problem [27]. Currently, they are genome-wide association studies (GWASs), whole-genome se- handled by software such as VCFTools [28] and PLINK, which quencing (WGS), and whole-exome sequencing studies have become considerably inefficient even in the face of a moderate produced massive amounts of data. The ability to efficiently number of VCF files. The main reason is that these tools adopt manage and process such massive amounts of data becomes the multiway-merge-like method [29] with a priority queue as increasingly important for successful large-scale genetics stud- the underlying data structure to ensure the correct output or- ies [1–3]. Single-machine based methods are inefficient when der. Although such a method only requires one round of read processing such large amounts of data due to the prohibitive through of the input files, a key deficiency is that it can only computation time, Input/Output bottleneck, as well as central have one consumer access items from the data queue, which processing unit (CPU) and memory limitations. Traditional high- makes it sequential upon writing. This problem cannot be elim- performance computing (HPC) techniques based on message inated even if the multiway-merging is implemented as paral- passing interface (MPI)/OpenMP also suffer from limitations lel processes due to I/O saturation, workload imbalance among such as not allowing addition of computing nodes at runtime, computing units, and memory limits. Therefore, these single- shortage of a fault-tolerant and high available file system, and machine based tools are inefficient and time-consuming when inflexibility of customizing the computing environment without handling large datasets. administrator permission of a cluster [3, 4]. It becomes increas- In this study, we use the case of sorted-merging multiple VCF ingly attractive for investigators to take advantage of more pow- files to demonstrate the benefits of using Apache distributed erful distributed computing resources or the cloud to perform platforms. However, simply running sorted merging on such dis- data processing and analyses [3, 5]. The Apache Foundation has tributed systems runs into problems of bottlenecks, hot spots, been a leading force in this endeavor and has developed multiple and unordered results commonly seen in parallel computations. platforms and systems including Hadoop [6, 7], HBase [8], and Rather, we believe working schemas custom designed for each Spark [9]. All these three Apache platforms have gained substan- specific distributed platform are required to unleash their full tial popularity in recent years and have been endorsed and sup- potential. To overcome the limitations of single-machine, tradi- ported by major vendors such as Amazon Web Services (AWS). tional parallel/distributed, and simple Apache distributed sys- In bioinformatics, researchers have already started to em- tem based methods, we propose and implement three schemas brace Apache distributed systems to manage and process large running on Hadoop, Spark, and HBase, respectively. We chose amounts of high throughput “-omics” data. For example, the these three platforms because they represent cloud distributed Cancer Genome Atlas project makes use of the Hadoop frame- systems providing data partitioning based parallelism with dis- work to split genome data into chunks distributed over the clus- tributed storage, data partitioning based parallelism with in- ter for parallel processing [3, 10]. The CloudBurst [11], Seal [12], memory based processing, and high dimensional tables like dis- Hadoop-BAM [13], and Crossbow software [14] take advantage of tributed storage, respectively. Hadoop [6] is the open source the Hadoop framework to accelerate sequencing read mapping, implementation of MapReduce [7] based on the parallel key- aligning, and manipulations as well as single-nucleotide poly- value processing technique and has the advantage of trans- morphism (SNP) calling. The Collaborative Genomic Data Model parency and simplicity. HBase [8] is a data warehousing plat- [15] adopts HBase to boost the querying speed for the main form that adopts Google’s BigTable data storing structure [30] classes of queries on genomic databases. MetaSpark [16]uti- to achieve high efficiency in storing and reading/writing large lizes Spark’s distributed dataset to recruit large scales of metage- scales of sparse data. Spark [9] introduces the concept of re- nomics reads to reference genomes and achieves better scala- silient distributed dataset (RDD) and directed acyclic graph ex- bility and sensitivity than single-machine based programs [17]. ecution to parallel key-value processing, thus enabling fast, ro- Industry cloud computing vendors such as Amazon [18]and bust, and repetitive in-memory data manipulations. Specifically, Google [19] are also beginning to provide specialized environ- our schemas involve dividing the job into multiple phases corre- ments to ease genomics data processing in the cloud. sponding to tasks of loading, mapping, filtering, sampling, parti- Although numerous Apache cluster-based applications have tioning, shuffling, merging, and outputting. Within each phase, already been developed for processing and analyzing large-scale data and tasks are evenly distributed across the cluster, enabling genomics data including ADAM [1], VariantSpark [20], SparkSeq processing large amounts of data in a parallel and scalable man- [21], Halvade [22], and SeqHBase [23], among others, we believe ner, which in turn improves both speed and scalability. there are still many opportunities in biomedical data analyses to take advantage of distributed systems as the scale and scope of data become larger and more complex. A particular example Methods is sorted merging, which is a ubiquitous operation in processing Overview genetics and genomics data. As an example, in WGS, variants identified from individuals are often called and stored in sep- The benefits of using these three Apache distributed platforms arate variant call format (VCF) files. Eventually these VCF files to perform sorted merging are four-fold when compared to using Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 3 the multiway-merge method [29], a relational database based only qualified records with a “PASS” filter value are retained. approach, or an HPC framework. First, with genomic locations Second, genotypes in VCF files are stored in the form of allele as keys and genotypes as values, it is readily transformed into values, where 0 stands for the reference allele, 1 stands for the the key-value model in which all three platforms offer a rich first mutant allele, 2 stands for the second mutant allele, and so set of parallel operations. Second, data in VCF files are semi- on. Allele values must be translated into corresponding types of structured. This type of data is an ideal fit for the three platforms nucleotides in the TPED file. Third, all individuals need to have that allow defining the schema during data loading, avoiding a genotype for genomic locations appearing in at least one VCF the preprocessing of raw data into a rigid schema as in a rela- file. The default genotype for a missing value is a pair of ho- tional database. Third, all of these platforms provide built-in ef- mozygous reference alleles. The merging of multiple VCF files ficient task coordination, high fault tolerance, data availability, into a single VCF file follows the rules as follows: first, the ALT and locality, which are absent in the traditional HPC framework. and INFO columns of a genomic location in the merged file are Fourth, the merged results are directly saved onto a distributed set as the concatenated values of the corresponding columns on file system such as Hadoop distributed file systems (HDFS) or that location from all input files with duplicated values removed. Amazon S3, which can be directly used for subsequent cluster- Second, the QUAL column of a genomic location in the merged based GWAS or WGS analytical tools such as BlueSNP. file is set as a weight-averaged quality value of all individuals Despite these advantages, simply performing sorted merg- on that location. Third, a genomic location is kept only when it ing on these Apache distributed systems will not deliver the appears in at least one input file and has a FILTER column value expected results for the following reasons. First, it can lead to of “PASS.” Fourth, if an individual does not have allele values on globally unsorted results. Hash-based shuffling of input data is a genomic location in the input file, their missing allele values the default mechanism for distributing data to parallel work- are designated as “.” in the merged file. ing units in the system. However, shuffling will lead to globally For our Apache cluster-based schemas, the merging of mul- unsorted results. Second, bottlenecks and hot spots can hap- tiple VCF files into a single TPED file and the merging of multiple pen during the processing in the cluster. Bypassing the hash- VCF files into a single VCF file differ only in the value contents ing based shuffling can lead to unbalanced workloads across of the key-value pairs, so they should have the same scalabil- the cluster and result in straggling computing units that become ity property. Although we implement the applications of both bottlenecks for response time. In addition, for parallel loading of merging types using our Apache cluster-based schemas, which presorted data into HBase, data being loaded from all the loading are available on our project website, we focused our experiments tasks access the same node simultaneously, while other nodes on the merging of multiple VCF files into a single TPED file and may be idling, creating an I/O hot spot. Third, sampling costs only evaluate the execution speed of the merging of multiple could become prohibitive. Although Hadoop provides a built-in VCF files into a single VCF file with VCFTools as the benchmark. utility named total-order-merging [27] to achieve both workload balance and global order, it involves transferring to and sam- MapReduce (Hadoop) schema pling all the data on a single node. The communication costs This schema is built on Hadoop’s underlying model MapReduce over the network and disk I/O can be prohibitive when data size and running on Hadoop clusters. MapReduce [7]isaparallel becomes very large. In the following sections, we will illustrate computing model based on a split-apply-combine strategy for data how our custom designed schemas are able to overcome these analysis, in which data are mapped to key-values for splitting limitations in detail. (mapping), shuffling, and combining (reducing) for final results. We use Apache Hadoop-2.7 as the system for our implementa- Data formats and operations tion. Our optimized schema consists of two MapReduce phases, In a typical WGS experiment, data analysis often starts from in- as shown in Fig. 2 (the pseudocodes are shown in Supplemen- tary Fig. S1). dividual genotype files in the VCF format [ 31]. A VCF file con- tains data arranged into a table consisting of eight mandatory fields including chromosome (CHROM), the genomic coordinate First MapReduce phase of the start of the variant (POS), the reference allele (REF), and Raw data are loaded from HDFS into parallel mappers to per- a comma separated list of alternate alleles (ALT), among others. form the following tasks. First, unqualified data are filtered out In our experiments, we use a dataset consisting of the VCF files and qualified data are mapped to key-value pairs. The mapper of 186 individuals [32] generated from Illumina’s BaseSpace soft- output key is the genomic location and the output value is the ware (left tables in Fig. 1). Each VCF file has around 4–5 million genotype and individual ID. Second, key-value pairs are grouped rows, each row contains information on one of the individual’s together by chromosomes and temporarily saved as compressed genomic variants. Each VCF file is about 300 MB in size. In an Hadoop sequence files [ 33] for faster I/O in the second MapRe- attempt to protect privacy of study subjects, we apply the fol- duce phase. With this grouping, if SNPs of interest are located in lowing strategy to conceal their real genetic variant information a few selected chromosomes only, we can choose to just merge contained in the VCF files: we first transform each original ge- records from these selected chromosomes rather than from all nomic location by multiplying it with an undisclosed constant chromosomes. Meanwhile, these records are sampled to explore real number, taking the floor integer of the result, and then add their distribution profile of keys along chromosomes to deter- another undisclosed constant integer number. mine boundaries. The boundaries are determined so there is an It is common that multiple VCF files need to be merged into approximately equal number of records within each segment. a single TPED file for analysis tools such as PLINK. A TPED file re- Because all records falling in the same segment will be assigned sembles a big table, aggregating genotypes of all individuals un- to the same reducer in a later phase, boundaries calculated in der investigation by genomic locations (right table in Fig. 1). The this way ensure the workload of each reducer is balanced. There merging follows several rules. First, each record is associated are two rounds of samplings. The first one happens in each map- with a data quality value in the FILTER column, which records per with a pre-specified sampling rate, which in our case is set to the status of this genomic position passing all filters. Usually be 0.0001. Sampled records are then separated and distributed to Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 4 Sorted Merging using Distributed Systems Figure 1: Merging multiple VCF files into a single TPED file. Left tables represent input VCF files. The table on the right represents the merged TPED file. Records ar e filtered out if their Filter value is not equal to “PASS” (Pos 10 147). Individual genotypes from multiple VCF files with the same genomic location are agg regated together in one row. The resulting TPED file thus has an inclusive set of sorted genomic locations of all variants found in the input VCF files. Figure 2: The workflow chart of the MapReduce schema. The workflow is divided into two phases. In the first phase, variants are filtered, grouped by chromosomes into bins, and mapped into key-value records. Two sampling steps are implemented to generate partition lists of all chromosomes. In the second phase, parallel jobs of specified chromosomes are launched. Within each job, records from corresponding bins are loaded, partitioned, sorted, and merged by genomic locat ions before being saved into a TPED file. different reducers in this phase by chromosomes, where they are nomic locations before saving them to a TPED file. In this way, sampled again with a rate equal to the reciprocal of the number globally sorted merging can be fulfilled. of input files. This second sampling effectively limits the num- ber of final sampled records even in the face of a very large num- ber of input files. Because the number of reducers instantiated in the second phase equals the number of boundaries, which HBase schema in turn is decided by the number of sampled records, we can therefore avoid launching unnecessary reducers, thus minimiz- HBase [8] is a column-oriented database where data are grouped into column families and split horizontally into regions spread- ing task overheads. ing across the cluster. This data storing structure supports effi- cient sequential reading and writing of large-scale data as well Second MapReduce phase as fast random data accessing. Also, HBase is storage efficient In this phase, multiple parallel MapReduce jobs are created, one because it can remember null values without saving them on for each chromosome, to handle all the records in sequence files disk. These features make HBase an ideal platform for manag- generated from the first phase. Within each job, a partitioner ing large, sparse data with relatively low latency, which naturally redirects records to the appropriate reducer by referring to the fits the sorted merging case. We use HBase-1.3 as the system for splitting boundaries from the previous phase, so records falling our implementation. As shown in Fig. 3, our optimized HBase in between the same pair of boundaries are aggregated together. schema is divided into three phases as discussed next (refer to Finally, each reducer sorts and merges aggregated records by ge- Supplementary Fig. S2 for pseudocodes). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 5 Figure 3: The workflow chart of the HBase schema. The workflow is divided into three phases. The first is a sampling, filtering, and mapping phase. A MapReduce job samples out variants whose genomic positions are used as region boundaries when creating the HBase table. Only qualified records are mapped as key-val ues and saved as Hadoop sequence files. The second is the HBase bulk loading phase in which a MapReduce job loads and writes records generated from the previous p hase, aggregating them into corresponding regional HFiles in the form of HBase’s row key and column families. Finished HFiles are moved into HBase data storage folders on region servers. In the third phase, parallel scans were launched over regions of the whole table to retrieve desired records that are subsequently merged and exported to the TPED file. Sampling phase via HBase servers’ I/O routines. The order of records in the ta- The main challenge of HBase is to avoid computational hot spots ble is guaranteed because they are internally sorted by writing in the cluster, which can happen when it starts loading a table reducers and HBase’s Log-Structured Merge-tree [34]. It is worth from a single region hosted by a single node. Therefore, we need mentioning that VCF records are always sparse, thus HBase is to presplit the table into regions of approximately equal size be- very storage efficient. fore loading. The sampling phase is introduced to determine reasonable presplitting regional boundaries. The total number Exporting phase of regions is set to half of the number of input files so the size A scan of a specified genomic window is performed on the table. of each region is approximately 1 GB. Meanwhile, mappers of It involves launching parallel mappers each receiving records this phase also save qualified records as compressed Hadoop se- from a single HBase region, filling in missing genotypes, con- quence files on HDFS that are used as inputs in the next phase. catenating records with the same row-key, and outputting final In addition, filtering and key-value mapping also take place in results into TPED files. this phase. Spark schema Bulk loading phase Even when the table has been presplit evenly, the hot spot prob- Spark [9] is a distributed engine built upon the ideas of MapRe- lem of loading sorted inputs can still emerge because sorted duce and RDD. It can save intermediate results in the form of records are loaded sequentially and, at any instant, they still RDD in memory and perform computations on them. Also, its access the same region and server. During the bulk loading, computations are lazily evaluated, which means the execution the key and value of each record produced from the previous plan can be optimized to include as many computational steps phase is converted into HBase’s binary row-key and column- as possible. As a result, it is ideal for iterative computations value, respectively, and saved into an HFile, HBase’s native stor- such as sorted merging. We implement our optimized Spark age format. The row-key here is in the form of chromosome- schema on Spark-2.1. It has three stages, which we describe be- genomic location, and column-value refers to reference allele, low and present in Fig. 4 (refer to Supplementary Fig. S3 for pseu- individual ID, and genotype. The bulk loading populates each docodes). HFile with records falling in the same pair of presplit regional boundaries. Because HFiles are written simultaneously by par- allel mappers/reducers, all working nodes are actively involved, RDD preprocessing stage and the regional hotspot is thus circumvented. Upon finishing This stage involves loading raw data as RDDs, filtering, and map- writing, the HBase can readily load HFiles in parallel into the ping RDDs to paired-RDDs with keys (chromosome and genomic table by simply moving them into local HBase storage folders. position) and values (reference allele, sample ID, and genotype). This procedure is therefore at least an order of magnitude faster This stage ends with a sorting-by-key action that extends to the than the normal loading in which data are loaded sequentially next stage. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 6 Sorted Merging using Distributed Systems Figure 4: The workflow chart of the Spark schema. The workflow is divided into three stages. In the first stage, VCF records are loaded, filtered, and mapped to pairRD Ds with keys of genomic position and values of genotype. The sort-by-key shuffling spans across the first two stages, sorting and grouping together record s by keys. Then, grouped records with the same key are locally merged into one record in TPED format. Finally, merged records are exported to the TPED file. Sorting and merging stage multiway-merge because many existing bioinformatics tools, in- The sort-by-key shuffling operation repartitions and sorts Pair- cluding VCFTools and PLINK, adopt it as the underlying algo- RDD records so records with the same key are aggregated to- rithm for sorted merging. Multiway-merge is highly efficient on gether, which are then merged into the TPED format and con- a single machine as it requires only one scan of sorted input files, verted back to RDD records for outputting. However, Spark’s na- so it can theoretically run at the speed of disk I/O. tive family of group-by-key functions for merging should not Generally, there are two types of parallelism—data paral- be used here because their default partitioner is hash based lelism and task parallelism. The former splits data horizontally and different from the range-based partitioner used by previous into blocks of roughly equal sizes (the size of genomic intervals sort-by-key function. Consequently, the merged results would in our case) before assigning them to all available processes; the be reshuffled into an unsorted status. We therefore optimize the latter assigns a roughly equal number of input files to each pro- merging to bypass these functions so merging can be performed cess. For parallel multiway-merge, we chose data parallelism locally without data reshuffling to ensure both order and high because the implementation of task parallelism would be the speed. same as the HPC-based implementation running on a single node. Perhaps the most difficult part of data parallelism is un- certainty about the data distribution across all input files, which Exporting stage usually leads to the problem of workload imbalance among pro- In this stage, merged RDD records are saved as TPED files on cesses. If we pre-sample all the input files to estimate the record HDFS. distribution, then a full scan of the input files is required that Execution parallelism has an important impact on the perfor- will almost certainly take more time than the single-process mance. To maximize performance, the number of parallel tasks multiway-merge method. As a compromise, we assume the dis- is set to be the number of input files. In this way, data local- tributions of SNP locations in all VCF files are uniform and the ity is maximized, and each task is assigned a proper amount of input files can be split into regions of approximately equal sizes. work. In addition, unlike using MapReduce or HBase, when per- The total number of regions is set to be the number of concur- forming sorting by keys, no explicit sampling is needed because rent processes, so that each region is specifically handled by a Spark keeps track of the number of records before determining process. To avoid seeking of a process’s file reader to its starting repartition boundaries. offset from the beginning of the file, we take advantage of the Tabix indexer [36], which builds indices on data blocks of the in- put file and place the reader’s pointer directly onto the desired Parallel multiway-merge and MPI-based HPC offset. One important aspect of the Tabix indexer is that it re- implementations quires the input file to be compressed in bgzip format, which For most bioinformatics researchers, their daily working en- is not supported by Hadoop, HBase, or Spark. The compression vironment is still traditional in-house HPC clusters or stand- and decompression of a file in bgzip format can be much faster alone powerful servers (with cores ≥16 and memory ≥200 GB) than in the bz2 format used in our cluster-based schemas, sin- rather than heterogeneous cloud-based clusters. Therefore, we gle multiway-merge, and HPC-based implementations, so par- also implement a parallel multiway-merge program running on allel multiway-merge can run much faster than other meth- a single machine and an MPI-based (mpi4py v3.0) “single pro- ods/schemas when input data size is small. gram, multiple data” program running on an HPC cluster as For the HPC-based implementation, we adopt the task paral- benchmarks. The source codes are available at our GitHub web- lelism (Fig. 5) to avoid sampling and workload imbalance. Oth- site [35] (CloudMerge; RRID:SCR 016051). We chose to implement Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 7 Figure 5: The execution plan of the HPC-based implementation. The execution plan resembles a branched tree. In the first round, each process is assigned an appro x- imately equal number of files to merge locally. In the second round, the even-numbered process retrieves the merged file of its right adjacent process to merge with its local merged file. In the third round, processes with ID numbers that can be fully divided by 4 retrieve the merged file of its right adjacent process in the second round and do the merging. This process continues recursively until all files are merged into a single TPED file (round four). erwise, the workflow of HPC-based implementation is the same For the strong scalability test, we fix the number of input as that of the MapReduce-based schema with the same oper- files at 93 and increase the computing resources up to 16-fold ations and the same order: sampling in parallel, dividing the from the baseline. The baseline is a single node (four cores) for dataset into splits of equal sizes, and assigning the splits to pro- all methods/schemas except for the parallel multiway-merge in cesses to perform the merging. However, this implementation is which only a single core is used because it can only run on a without data locality offered by HDFS and task coordination of- single machine. For the weak scalability test, we increase both fered by YARN and thus has a performance no better than the computing resources and input data size at the same pace. The MapReduce-based schema. Specifically, input files are shared ratio is 10 file/core for parallel multiway-merge and 10 file/node across all nodes in the cluster via a network file system (NFS). for all others. In the first round, each core/process fetches roughly the same number of files from the NFS and performs multiway-merging Results locally. In the following rounds, we adopted a tree-structured execution strategy. In the second round, processes with even ID We conducted experiments of Apache cluster-based schemas numbers (process ID starts from 0) retrieve the merged file from using Amazon’s Elastic MapReduce service and experiments TM its adjacent process to the right, which is then merged with its of the HPC-based implementation using MIT’s StarCluster local merged file. Processes with odd ID numbers are terminated. toolkit, which launches an AWS openMP virtual private cluster. In the third round, processes with ID numbers divisible by 4 re- Within both infrastructures, we choose EC2 working nodes of trieve the merged file from its adjacent process to the right in m3.xlarge type, which has four high-frequency Intel Xeon E5- the second round to merge with its local merged file. This pro- 2670 v2 (Ivy Bridge) processors and 15 GB ofmemory. We con- cess continues until all the files are merged into a single file for ducted experiments of parallel multiway-merge on a single EC2 a total of log(n) rounds, where n is the number of the input files. r4.8xlarge instance with 32 high-frequency Intel Xeon E5–2686 v4 (Broadwell) processors and 244 GB memory. We used a dataset Strong and weak scalabilities consisting of 186 VCF files [ 32] generated from Illumina’s BaseS- pace software. In this study, we quantify scalability by measuring computing efficiency in tests of strong and weak scalabilities. We define ef- Overall performance analysis of cluster-based schemas ficiency as the average time cost of processing a file per core: Our primary goal is to explore the scalabilities of the three schemas on input data size and available computing resources, Efficiency = (T ∗C /N ) / (T ∗C /N ) b b b i i i namely, CPUs. To achieve this, in this experiment we adjusted the number of input files from 10 to 186, with an approximate where T is the baseline running time, C is the baseline num- total uncompressed size from 2.5 GB to 40 GB, and used a varying b b ber of cores, N is the baseline number of input files, T is the number of working nodes from 3 to 18, namely, 12 to 72 cores. current running time, C is the current number of cores, N is As Fig. 6 shows, for all three schemas and given a fixed num- i i the current number of input files. We also incorporated the par- ber of cores, the execution time increases at a slower pace than allel multiway-merge and MPI-based HPC implementations as that of the input data size. On the one hand, the increase of exe- benchmarks in the tests. cution time is more obvious with fewer cores because each core Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 8 Sorted Merging using Distributed Systems (a) (b) (c) Figure 6: The scalability of Apache cluster-based schemas on input data size. As the number of input files increases from 10 to 186, the time costs of all three sche mas with 12, 24, or 72 cores increase at a slower pace than that of the input data size, especially when the number of cores is relatively large. The HBase schema with 12 cores has the largest increase (from 375 to 5479 seconds, ∼14.6 fold). is fully utilized. As the number of input files increases, so does indexer to index data blocks of input files. While reading, each the number of parallel tasks assigned to each core. For example, process needs to maintain a full copy of file descriptors, in- given 12 cores, as the number of input files increases from 10 dices, and uncompressed current data blocks of all input files to 186 (18.6 fold), the execution time increases from 739 to 4366 in memory. When both the number of processes and input files seconds (∼5.9 fold) for the MapReduce schema, from 375 to 5479 are large, great pressure is placed on memory management. For seconds (∼14.6 fold) for the HBase schema, and from 361 to 1699 instance, a test with 93 files and 16 processes requires more seconds (∼4.7 fold) for the Spark schema. On the other hand, than 100 GB of memory, which results in a very long memory with relatively more cores such as 72, this linear increasing trend swap and garbage collection time. In contrast, the MapReduce- is less pronounced because there are more cores than tasks so based schema has the best efficiency. Surprisingly, its efficiency that all cores are assigned at most one task. We also notice when even improves when the number of cores doubles from the base- input data size is small or moderate, the Spark schema does not line. This is because it has many parallel tasks in its second always show a consistent improvement in terms of execution MapReduce phase; when the core allowance is low, the over- time with more cores. This is reflected, e.g., in the intersection heads of repetitive task launching and terminating on a sin- of curves that occurred between 24 and 72 cores in Fig. 6c. This gle core become non-negligible. Consequently, as the number phenomenon is attributed to the limitation of Spark’s internal of cores starts to increase, the actual proportion of overheads task assignment policy, which gives rise to the possibility that in the total running time decreases, leading to an improved ef- some nodes are assigned more than one tasks while others re- ficiency. Nonetheless, as the number of cores further increases, main idle. the unparalleled parts of the schema gradually dominate the to- tal running time, leading to a reduced efficiency eventually. For the weak scalability test (Fig. 8), following Gustafson’s law Comparing strong and weak scalabilities between [38], all methods/schemas show a much better efficiency than Apache cluster-based schemas and traditional parallel in the strong scalability test. Meanwhile, for the same reasons methods as the strong scalability, the MapReduce-based schema enjoys the best efficiency while the HPC-based implementation has the Figure 7 shows the results of the strong scalability. In accor- dance with Amdahl’s law [37], all schemas/methods show de- worst. This is because for the HPC-based implementation, as the number of input files increases, the total number of merg- graded efficiency with increasing computing nodes/cores. Par- allel multiway-merge has the steepest degradation because the ing rounds also increases, leading to a significantly reduced effi- ciency. Finally, all three Apache cluster-based schemas demon- more parallel processes, the higher likelihood of workload im- balances among them. In addition, disk I/O reaches saturation as strate significantly better weak scalability than the two tradi- tional parallel methods. more processes write simultaneously. Furthermore, to achieve data parallelism and improve execution speed, we used Tabix Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 9 Figure 7: Comparing the strong scalability between traditional parallel/distributed methods and Apache cluster-based schemas. We fix the number of files at 93 and increase the number of nodes/cores. The baseline for the parallel multiway-merge is one single core, while for the others it is one single node (four cores). All methods/schemas show a degraded efficiency as computing resources increase 16 fold from the baseline. Specifically, the efficiency of MapReduce-, HBa se-, and Spark-based schemas drops to 0.83, 0.63, and 0.61, respectively, while the efficiency of parallel multiway-merge and HPC-based implementations dro ps to 0.06 and 0.53, respectively. Figure 8: Comparing the weak scalability between traditional parallel/distributed methods and Apache cluster-based schemas. We simultaneously increase the num- ber of cores and input data sizes while fixing the ratio of file/core (parallel multiway-merge) or file/node (all others) at 10. The baseline is the same as in the test of strong scalability. All but the MapReduce-based schema have degraded efficiency, among which the HPC-based implementation has the steepest degrada tion. Specif- ically, when computing resource increases 16 fold from the baseline, the efficiency of MapReduce-, HBase-, and Spark-based schemas changes to 3.1, 0. 87, and 0.75, respectively, and for parallel multiway-merge and HPC-based implementations, the efficiency reduces to 0.42 and 0.35, respectively. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 10 Sorted Merging using Distributed Systems (a) (b) (c) Figure 9: The performance anatomy of cluster-based schemas on increasing input data size. The number of cores in these experiments is fixed at 48. Time costs of al l phases of the three schemas have a linear or sublinear correlation with the input data size. (a) MapReduce schema: The two MapReduce phases have a comparable time cost, increasing 6.3- and 3.1-fold, respectively, as the number of input files increases from 10 to 186. (b) HBase schema: The time spent in each pha se increases 4.2-, 5.6-,and 5.0-fold, respectively, as the number of input files increases from 10 to 186. The bulk loading and exporting phases together take up mor e than 80% of total time expense. (c) Spark schema: The time cost increases 5.8-, 6.0-,and 6.0-fold, respectively, for the three stages as the number of input files i ncreases from 10 to 186 files. Like the HBase schema, the first two stages of the Spark schema together account for more than 80% of the total time cost. Anatomic performance analysis of Apache of the first stage, data are sampled to determine the boundaries used by sort-by-key’s range partitioner. This operation demands cluster-based schemas a considerable execution time because it scans all the data and Another important goal of our study is to identify potential per- balances them if necessary. formance bottlenecks, so we evaluate the execution time of each Given that no super-linear increasing trend is observed in phase/stage of all three schemas. Figure 9 shows the trends of running time for all phases/stages of the three schemas, and the anatomic computing time spent on merging an increasing they generally scale well with the input data size, we conclude number of VCF files (from 10 to 186) using 48 cores. For the that although the performances of these schemas might de- MapReduce schema (Fig. 9a), its two phases account for a com- grade to some extent when dealing with even larger input data parable proportion of total time and both show a linear or sub- due to overheads such as data transmission over network, we linear scalability. The reason that the time cost of the first phase would not expect any significant bottleneck. between 40 and 93 input files remains flat is because both runs use two rounds of mappers. As the number of files doubles to 186, four rounds of mappers are required, which results in about Comparing execution speed between Apache a 2-fold increase in the time cost as expected. For the three cluster-based schemas and traditional methods phases of the HBase schema (Fig. 9b), they are scalable with in- Another intriguing question is: how does the speed of the put data size. Meanwhile, the second phase becomes more dom- Apache cluster-based schemas compare to single-machine inant with more input files owing to the larger amount of shuf- based and traditional parallel/distributed methods/applications fled data during the writing of HFiles. However, we do not con- on merging multiple VCF files into a single VCF or TPED file? sider it as a bottleneck since all tasks of this phase are paral- To answer this question, we choose the widely used VCFTools lelized with no workload or computational hot spot. Also, we do (v4.2) and a single-process multiway-merge implementation not observe a super-linear (relative to input data size) increment as single-process benchmarks and parallel multiway-merge pattern from the figure. Finally, Fig. 9c shows the time costs of and HPC-based implementations as parallel/distributed bench- the three stages of the Spark schema. They show a uniform in- marks, which are the same ones used in the experiments of creasing trend with the number of input files. Among them, the strong and weak scalabilities described above. second stage takes up a considerable proportion of the total exe- In the first experiment, we merged 40 VCF files into 1 VCF cution time as it has a relatively expensive sort-by-key shuffling file using VCFTools as the benchmark. As shown in Table 1, operation. Although no data are shuffled in the first stage, its VCFTools takes 30,189 seconds, while the fastest Apache cluster- time lapse is close to the second stage. This is because at the end based schema among the three, MapReduce based, takes only Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 11 Table 1: Performance comparisons between VCTools and Apache cluster-based schemas VCFTools MapReduce HBase Spark Time cost (seconds) 30,189 484 577 596 Fold (faster) - 62.4 52.3 50.7 Table 2: Pros and cons of MapReduce, HBase, and Spark schemas Schema Pros Cons MapReduce • Good for large input data size and sufficient • Merging is not incremental computing resources • Simple architecture and least overheads given • Large overhead when computing resources are sufficient computing resources limited • Best parallelism • Good for one-time merging • Performance is stable HBase • Good for intermediate input data size (≥ 20 and • Users must determine region number in ≤ 100 VCF files) advance • Supports incremental merging • Has most local I/O • Supports on-line analytical processing • Complex performance tuning • Best storage efficiency Spark • Good for large input data size (>100 VCF files) • Possibly weakened data locality during loading and relative limited computing resources • Keeps intermediate results in memory and • Slight unstable performance when computing least local I/O resources exceeds needs of input data size • Good for subsequent statistical analysis on • Actual execution plan is not transparent merged results • Complex performance tuning Figure 10: Execution speed comparison among Apache cluster-based schemas and traditional methods. First, we compare the speeds of the three Apache schemas with that of three traditional methods, which are single-process multiway-merge, parallel multiway-merge, and HPC-based implementations. As the number of input files increases from 10 to 186, the speeds of Apache cluster-based schemas improve much more significantly than that of traditional methods. The number sinthe figures indicate the ratio of the time cost of each traditional method to that of the fastest Apache cluster-based schema. Second, we compare the proces sing speed among the three Apache cluster-based schemas, which are comparable to each other regardless of the input data size. The MapReduce schema performs the best in merging 10 and 186 files; the HBase schema performs the best in merging 20, 40, and 60 files; and the Spark schema performs the best in merging 93 files. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 12 Sorted Merging using Distributed Systems 484 seconds using 72 cores, being about 62-fold faster. In the fied from WGS. We show that all three schemas are scalable second experiment (Fig. 10), we tested the time costs of merging on both input data size and computing resources, suggesting of multiple VCF files into a single TPED file using single/parallel large-scale “-omics” data can be merged efficiently given the multiway-merge and HPC-based implementations as bench- computing resources readily available in the cloud. Further- marks. The single multiway-merger is run on a node with the more, the three schemas show better strong and weak scalabil- hardware configuration (four cores and 15 GB ofmemory) iden- ities than traditional single machine-based parallel multiway- tical to the nodes on which the Apache cluster-based schemas merge and cluster-based HPC methods owing to the absence are run. The parallel multiway merger is run on a node with a of I/O bottleneck, better workload balance among nodes, and maximum of 18 simultaneously running processes. The HPC- less pressure on memory, as well as data locality and efficient based implementation is run on an 18-node cluster with the task coordination mechanisms provided by HDFS and YARN. same hardware configuration as the cluster where the Apache We also show that even with a moderate-sized cluster and in- cluster-based schemas are run. Initially, with 10 input files, the put data, all three schemas significantly outperform the broadly parallel multiway-merge (∼30 seconds) is much faster than all used, single-machine based VCFTools, single-process multiway- the other methods; it is about 7.3-fold faster than the fastest merge, and HPC-based implementations. Although initially the Apache cluster-based schema (MapReduce, 221 seconds). On the parallel multiway-merge implementation is much faster than other hand, the slowest method is the single-process multiway- the Apache schemas owing to its advantage of local I/O and light merger, which takes 620 seconds to finish (about 2.8-fold slower compression of input files, its poor scalability diminishes its ini- than the MapReduce-based schema). It is worth mentioning that tial advantage as the number of concurrent processes and in- in this test, the parallel multiway-merge is essentially the same put files increases. Consequently, we expect the Apache cluster- as the single-process multiway-merge and that the speed dif- based schemas to eventually outperform the parallel multiway- ference (∼378 seconds) between them is the result of a different merge when merging a much larger scale of data using a larger compression format (bz2 vs bgzip) of the input files as explained number of cores. above. As we gradually increase the number of input files to Unlike normal merging, efficient sorted merging of many 186, the difference in speed between the fastest overall method large tables has always been a difficult problem in the field of (parallel multiway merger, 602 seconds) and the fastest Apache data management. Multiway-merge is the most efficient single- cluster-based schema (MapReduce, 809 seconds) decreases to machine based method for sorted merging, but its performance about 1.3-fold, while the difference between the slowest over- is limited by the disk I/O [39]. Sorted merging also places chal- all method (single multiway-merger, 13,219 seconds) and the lenges to distributed system-based solutions because neither MapReduce-based schema increases to 16.3-fold. In addition, all the efficient hash-based merging nor caching the intermediate three Apache schemas significantly outperform the HPC-based table in shared memory is feasible [40]. Although a utility named implementation. As explained in the strong and weak scalabili- total-order-joining is provided by the Hadoop for addressing this ties section above, we expect that the larger the input data size, problem, it suffers from both network communication and local the faster the Apache cluster-based schemas would run com- disk I/O bottlenecks, and thus is not scalable [27, 41]. In contrast, pared to the other traditional methods. our schemas divide this problem into different phases/stages of We also compare the time cost among the three schemas tasks, each conquered in parallel to bypass these bottlenecks (Fig. 10). They have a comparable speed. More specifically, the and achieve maximum parallelism. Furthermore, in addition to MapReduce schema performs best if enough cores are available merging sequencing variant data, these schemas can be gener- and the input data size is large; the HBase schema performs best alized for other key-based, sorted merging problems that are fre- with moderate input data size; the Spark schema performs best quently encountered in genetics and genomics data processing. if only a limited number of cores are available and the input data As an example, they can be slightly modified to merge multi- size is large. The rationale behind this observation is that when ple BED format files such as ChIP-seq peak lists [ 42] and other the number of cores is sufficient, the MapReduce-based schema genomic regions of interest. Other potentially useful features can make the most use of the available computing resources be- include the following: unlike traditional sorted merging algo- cause it runs a constant 25 parallel jobs (one for each of chromo- rithms that usually require presorted inputs for a better perfor- somes 1–22, X, Y, and M [mitochondria]) in its second phase. In mance, our schemas are free of such a requirement; and our im- contrast, the Spark-based schema has fewer tasks whose num- plementations automatically take care of multi-allelic positions, bers equal the number of input files to achieve maximum data- which are frequent in large-scale VCF flies, by retaining the in- task locality. When the input data size is moderate, the HBase- formation of all alleles until the merging actually occurs. schema triumphs because of its internal sorting and relatively Finally, in light of these different features and specialties of compact storage format of intermediate data. When the input these three platforms, each of the three schemas we developed data size is large and computing resources are relatively limited, has its own advantages and disadvantages under different ap- the Spark-based schema outperforms the other two owing to its plication scenarios, as summarized in Table 2. For example, the small number of data shuffling (only one), execution plan opti- MapReduce schema is good for a static one-time, nonincremen- mization, and ability to cache intermediate results in memory. tal merging on large-size data provided sufficient cores are avail- We caution, however, that the computing time may fluctuate de- able since it has the most parallel jobs, the least overheads, and pending on the distribution of genomic locations in the input the most transparent workflow. The HBase schema, supported files as well as the data loading balance of the HDFS. by data warehousing technologies, fits for an incremental merg- ing since it does not need to re-merge existing results with new ones from scratch only if the incremental merging is performed on the same chromosomes. Also, it provides highly efficient stor- Discussion age and on-line analytical processing on merged results. The In this report, we describe three cluster-based schemas run- Spark schema is ideal for merging large-scale data with rela- tively limited computing resources because it has the least data ning on the Apache Hadoop (MapReduce), HBase, and Spark platforms for performing sorted merging of variants identi- shuffling and keeps intermediate results in memory. A bonus Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 13 brought by Spark is that subsequent statistical analyses can be Competing interests carried out directly on the merged results using its rich set of The authors declare that they have no competing interests. parallel statistical utilities. Funding Availability and requirements This study was supported by grants from the National Project name: CloudMerge Heart, Lung, and Blood Institute (R01HL104608, R01HL117004, Project home page: https://github.com/xsun28/CloudMerge R01HL128439, R01HL135156, X01HL134589); National Institute Operating system(s): Linux of Environmental Health Sciences (R01ES015794, R21ES24844); Programming language: Java, Python National Institute on Minority Health and Health Dispari- Other requirements: Java 1.7 or higher, Python 2.7 or 3.6, ties (P60MD006902, R01MD010443, RL5GM118984); National In- Hadoop-2.7, HBase-1.3, Spark-2.1, StarCluster 0.95, MPI for stitute of Neurological Disorders and Stroke (R01NS051630, Python 3.0.0 P01NS097206, U54NS091859); National Science Foundation (ACI License: Apache License 2.0 1443054, IIS 1350885); Tobacco-Related Disease Research Pro- gram (24RT-0025). The Genes-Environments and Admixture in Latino Americans (GALA II) Study; the Study of African Ameri- Availability of supporting data cans, Asthma, Genes and Environments (SAGE) Study; and E.G.B. are supported by the Sandler Family Foundation, the American The source codes for the project are available in GitHub. The 186 Asthma Foundation, the RWJF Amos Medical Faculty Develop- individual VCF files used in our experiments are modified from ment Program, and the Harry Wm. and Diana V. Hind Distin- the original VCF files obtained from WGS conducted by the Con- guished Professor in Pharmaceutical Sciences II. sortium on Asthma among the African-Ancestry Population in the Americas (CAAPA) [32]. To conceal the potential individual identifiable genotype information from the public, we encrypt Authors’ Contributions the authentic genomic location of the original 93 VCF files to J.G. introduced the problem. X.S. and F.W. initiated the project. generate a new batch of encrypted VCF files for test purposes. Please refer to the Data formats and operations section for de- X.S. designed and implemented the CloudMerge project. X.S. drafted the manuscript. X.S., J.P., F.W., and Z.Q. revised the tails. These supporting data and a snapshot of project codes are manuscript. available at the GigaScience database, GigaDB [43]. Via GigaDB, K.C.B. conceived the initial consortium design, acquired we also provide sample results of merging 93VCF files into ei- biospecimens for Next Generation Sequencing (NGS), and facili- ther one VCF or one TPED file using our Apache cluster-based schemas. tated generation of NGS data. K.C.B., R.A.M., I.R., and T.H.B. con- ceived initial experiments and interpreted NGS data. E.G.B. and C.E. acquired biospecimens for NGS and facilitated generation Additional file of NGS data. Figure S1. Pseudocodes of the MapReduce schema. Figure S2. Pseudocodes of the HBase schema. CAAPA Consortium Members and Their Figure S3. Pseudocodes of the Spark schema. Affiliations 1,2 2 Kathleen C. Barnes, PhD, Terri H. Beaty, PhD , Meher Preethi 1 1 1 Boorgula MS , Monica Campbell, BS , Sameer Chavan, MS , Jean Abbreviations 2,3 1 1 G. Ford, MD , Cassandra Foster, CCRP ,LiGao,MD, PhD , Na- AWS: Amazon Web Service; CAAPA: Consortium on Asthma 1 1 dia N. Hansel, MD, MPH , Edward Horowitz, BA , Lili Huang, among African-ancestry Population in the Americas; CPU: cen- 1 1,2 1 MPH , Rasika Ann Mathias, ScD , Romina Ortiz, MA , Joseph tral processing unit; GWAS: genome-wide association stud- 1 1 4 Potee, MS , Nicholas Rafaels, MS , Ingo Ruczin-ski, PhD ,Alan ies; HDFS: Hadoop distributed Systems; HPC: high-performance 1 4 F. Scott, PhD , Margaret A. Taub, PhD , Candelaria Ver-gara, computing; I/O: Input/Output; MPI: message passing interface; 1 5 6 PhD , Jingjing Gao, PhD , Yijuan Hu, PhD , Henry Richard John- NFS: network file system; NGS: next generation sequencing; 6 6 7 ston, PhD , Zhaohui S. Qin, PhD ,AlbertM.Levin,PhD , Badri RDD: resilient distributed dataset; SNP: single-nucleotide poly- 8 8,9 Padhukasahas-ram, PhD , L. Keoki Williams, MD, MPH ,Geor- morphism; VCF: variant call format; WGS: whole-genome se- 10,11 11 gia M. Dunston, PhD , Mezbah U. Faruque, MD, PhD , Eimear quencing. 12,13 14 E. Kenny, PhD , Kimberly Gi-etzen, PhD , Mark Hansen, 14 14 14 PhD , Rob Genuario, PhD , Dave Bullis, MBA ,Cindy Law- 14 15 15 ley, PhD , Aniket Deshpande, MS , Wendy E. Grus, PhD , Ethics approval and consent to participate 15 16 DevinP.Locke,PhD , Marilyn G. Foreman, MD , Pedro C. Avila, 17 17 18 MD , Leslie Grammer, MD , Kwang-Youn A. Kim, PhD ,Ra- Ethics approval for the CAAPA program was provided by the 19,20 21 jesh Kumar, MD , Robert Schleimer, PhD , Carlos Busta- Johns Hopkins University Institutional Review Board (IRB) fol- 12 12 lowing commencement of the study in 2011 (IRB00045892, mante, PhD ,FranciscoM.DeLaVega, DS , Chris R. Gignoux, 12 12 12 MS , Suyash S. Shringarpure, PhD , Shaila Musharo, MS , CAAPA ) and included study team members from each CAAPA 12 22,23 site, including Emory University (site PI, Zhaohui Qin). Access to Genevieve Wojcik, PhD , Esteban G. Burchard, MD, MPH ,Ce- 23 24 leste Eng, BS , Pierre-Antoine Gourraud, PhD , Ryan D. Hernan- the raw data as CAAPA team members is granted according to 22,25,26 24 23,27 the guideline of the IRB-approved study. Informed consent has dez, PhD , Antoine Lizee, PhD , Maria Pino-Yanes, PhD , 23 22 Dara G. Torgerson, PhD , Zachary A. Szpiech, PhD , Raul Torres, been obtained from all study participants of CAAPA. 28 29,30 31 BS , Dan L. Nicolae, PhD , Carole Ober, PhD , Christopher O 32 29 Olopade, MD, MPH , Olufunmilayo Olopade, MD , Oluwafemi Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 14 Sorted Merging using Distributed Systems 29 33 25 Oluwole, MSc , Ganiyu Arinola, PhD , Timothy D. O’Connor, Institute for Human Genetics, Institute for Human Genetics, 34,35,36 34,35,36 37 PhD ,Wei Song,PhD , Goncalo Abecasis, DPhil , University of California, San Francisco, San Francisco, CA, 38 38 26 Adolfo Correa, MD, MPH, PhD , Solomon Musani, PhD , James California Institute for Quantitative Biosciences, University of 39 40 41 G. Wilson, MD , Leslie A. Lange, PhD , Joshua Akey, PhD , California, San Francisco, San Francisco, CA, 42 42 27 Michael Bamshad, MD , Jessica Chong, PhD , Wenqing Fu, CIBER de Enfermedades Respiratorias, Instituto de Salud Car- 41 41 43 PhD , Deborah Nickerson, PhD , Alexander Reiner, MD, MSc , los III, Madrid, 44 44,45 28 Tina Hartert, MD, MPH , Lorraine B. Ware, MD , Eugene Biomedical Sciences Graduate Program, University of Califor- 46 46 46 Bleecker, MD , Deborah Meyers, PhD ,VictorE.Ortega, MD , nia, San Francisco, San Francisco, CA, 47 47 29 Pissamai Maul, BSc, RN ,TrevorMaul, RN , Harold Watson, Department of Medicine, University of Chicago, Chicago, IL, 48,49 50 30 MD , Maria Ilma Araujo, MD, PhD , Ricardo Riccio Oliveira, Department of Statistics, University of Chicago, Chicago, IL, 51 52 53 31 PhD , Luis Caraballo, MD, PhD , Javier Marrugo, MD , Beatriz Department of Human Genetics, University of Chicago, 52 52 54 Martinez, MSc , Catherine Meza, LB , Gerardo Ayestas ,Ed- Chicago, IL, 55,56,57 32 win Francisco Herrera-Paz, MD, MSc , Pamela Landaverde- Department of Medicine and Center for Global Health, Univer- 55 55 55 Torres , Said Omar Leiva Erazo , Rosella Martinez, BSc ,varo sity of Chicago, Chicago, IL, 56 55 33 Mayorga, MD ,LuisF.Mayorga,MD , Delmy-Aracely Mejia- Department of Chemical Pathology, University of Ibadan, 56,57 55 54 54 Mejia, MD , Hector Ramos , Allan Saenz , Gloria Varela , Ibadan, Nigeria, 57 58 34 Olga Marina Vasquez , Trevor Ferguson, MBBS, DM, MSc ,Jen- Institute for Genome Sciences, University of Maryland School nifer Knight-Madden, MBBS, PhD , Maureen Samms-Vaughan, of Medicine, Baltimore, MD, 59 58 35 MBBS, DM, Ph , Rainford J. Wilks, MBBS, DM, MSc ,AkimAdeg- Program in Personalized and Genomic Medicine, University of 60,61,62 60,61,62 nika, MD, PhD , Ulysse Ateba-Ngoa, MD ,Maria Yaz- Maryland School of Medicine, Baltimore, MD, 62 36 danbakhsh, PhD Department of Medicine, University of Maryland School of Department of Medicine, Johns Hopkins University, Baltimore, Medicine, Baltimore, MD, MD, Department of Biostatistics, SPH II, University of Michigan, Department of Epidemiology, Bloomberg School of Public Ann Arbor, MI, Health, JHU, Baltimore, MD, Department of Medicine, University of Mississippi Medical Department of Medicine, The Brooklyn Hospital Center, Brook- Center, Jackson, MS, lyn, NY, Department of Physiology and Biophysics, University of Mis- Department of Biostatistics, Bloomberg School of Public Health, sissippi Medical Center, Jackson, MS, JHU, Baltimore, MD, Department of Genetics, University of North Carolina, Chapel Data and Statistical Sciences, AbbVie, North Chicago, IL, Hill, NC, 6 41 Department of Biostatistics and Bioinformatics, Emory Univer- Department of Genomic Sciences, University of Washington, sity, Atlanta, GA, Seattle, WA, 7 42 Department of Public Health Sciences, Henry Ford Health Sys- Department of Pediatrics, University of Washington, Seattle, tem, Detroit, MI, WA, 8 43 Center for Health Policy & Health Services Research, Henry Ford University of Washington, Seattle, WA, Health System, Detroit, MI, Department of Medicine, Vanderbilt University, Nashville, TN, 9 45 Department of Internal Medicine, Henry Ford Health System, Department of Pathology, Microbiology and Immunology, Van- Detroit, MI, Department of Microbiology, Howard University derbilt University, Nashville, TN, College of Medicine, Washington, DC, Center for Human Genomics and Personalized Medicine, Wake National Human Genome Center, Howard University College Forest School of Medicine, Winston-Salem, NC, of Medicine, Washington, DC, Genetics and Epidemiology of Asthma in Barbados, The Uni- Department of Genetics, Stanford University School of versity of the West Indies, Medicine, Stanford, CA, Faculty of Medical Sciences Cave Hill Campus, The University Department of Genetics and Genomics, Icahn School of of the West Indies, Medicine at Mount Sinai, New York, Queen Elizabeth Hospital, Queen Elizabeth Hospital, The Uni- Illumina, Inc., San Diego, CA, versity of the West Indies, 15 50 Knome Inc., Cambridge, MA, Immunology Service, Universidade Federal da Bahia, Salvador, Pulmonary and Critical Care Medicine, Morehouse School of BA, Medicine, Atlanta, GA, Laboratrio de Patologia Experimental, Centro de Pesquisas Go- Department of Medicine, Northwestern University, Chicago, IL, nalo Moniz, Salvador, BA, 18 52 Department of Preventive Medicine, Northwestern University, Institute for Immunological Research, Universidad de Carta- Chicago, IL, gena, Cartagena, 19 53 Department of Pediatrics, Northwestern University, Chicago, Instituto de Investigaciones Immunologicas, Universidad de IL, Cartagena, Cartagena, 20 54 The Ann & Robert H. Lurie Children’s Hospital of Chicago, Faculty of Medicine, Universidad Nacional Autonoma de Hon- Chicago, IL, duras en el Valle de Sula, San Pedro Sula, 21 55 Department of Medicine, Northwestern Feinberg School of Facultad de Medicina, Universidad Catolica de Honduras, San Medicine, Chicago, IL, Pedro Sula, 22 56 Department of Bioengineering and Therapeutic Sciences, Uni- Centro de Neumologia y Alergias, San Pedro Sula, versity of California, San Francisco, San Francisco, CA, Faculty of Medicine, Centro Medico de la Familia, San Pedro Department of Medicine, University of California, San Fran- Sula, cisco, San Francisco, CA, Tropical Medicine Research Institute, The University of the Department of Neurology, University of California, San Fran- West Indies, cisco, San Francisco, CA, Department of Child Health, The University of the West Indies, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 Sun et al. 15 Centre de Recherches Mdicales de Lambarn, 13. Niemenmaa M, Kallio A, Schumacher A, et al. Hadoop-BAM: Institut fr Tropenmedizin, Universitt Tbingen, directly manipulating next generation sequencing data in Department of Parasitology, Leiden University Medical Center, the cloud. Bioinformatics 2012;28(6):876–7. Netherlands. 14. Langmead B, Schatz MC, Lin J, et al. Searching for SNPs with cloud computing. Genome Biol 2009;10(11):R134. 15. Wang S, Mares MA, Guo YK. CGDM: collaborative genomic Acknowledgements data model for molecular profiling data using NoSQL. Bioin- formatics 2016;32(23):3654–60. We thank the three referees for their constructive critiques and 16. Zhou W, Li R, Yuan S et al. MetaSpark: a spark-based dis- detailed comments. We are grateful to Mary Taylor Mann and tributed processing tool to recruit metagenomic reads to ref- Alyssa Leann Duck for their editorial help during writing and erence genomes. Bioinformatics 2017;33(7):1090–2. revising of the manuscript. E.G.B. acknowledges the following 17. Niu B, Zhu Z, Fu L, et al. FR-HIT, a very fast program to re- GALA II and SAGE co-investigators for subject recruitment, sam- cruit metagenomic reads to homologous reference genomes. ple processing, and quality control: Sandra Salazar, Scott Hunts- Bioinformatics 2011;27(12):1704–5. man, MSc, Donglei Hu, PhD, Lisa Caine, Shannon Thyne, MD, 18. AWS Genomics Guide. https://d0.awsstatic.com/Industries/ Harold J. Farber, MD, MSPH, Pedro C. Avila, MD, Denise Serebrisky, HCLS/Resources/AWS Genomics WP.pdf. Accessed 10 April MD, William Rodriguez-Cintron, MD, Jose R. Rodriguez-Santana, MD, Rajesh Kumar, MD, Luisa N. Borrell, DDS, PhD, Emerita 19. Gruber K. Google for genomes. Nature Biotechnology 2014, Brigino-Buenaventura, MD, Adam Davis, MA, MPH, Michael A. 32, 6, 508. LeNoir, MD, Kelley Meade, MD, Saunak Sen, PhD, and Fred Lur- 20. O’Brien AR, Saunders NF, Guo Y, et al. VariantSpark: pop- mann, MS. ulation scale clustering of genotype information. BMC Ge- nomics 2015;16(1):1052. 21. Wiewiorka ´ MS, Messina A, Pacholewska A, et al. SparkSeq: References fast, scalable and cloud-ready tool for the interactive ge- 1. Massie M, Nothaft F, Hartl C, et al. Adam: Genomics formats nomic data analysis with nucleotide precision. Bioinformat- ics 2014;30(18):2652–3. and processing patterns for cloud scale computing. Univer- sity of California, Berkeley Technical Report, No UCB/EECS- 22. Decap D, Reumers J, Herzeel C, et al. Halvade: scal- 2013. 2013;207. able sequence analysis with MapReduce. Bioinformatics 2. Siretskiy A, Sundqvist T, Voznesenskiy M, et al. A quan- 2015;31(15):2482–8. titative assessment of the hadoop framework for analyz- 23. He M, Person TN, Hebbring SJ, et al. SeqHBase: a big data ing massively parallel DNA sequencing data. GigaScience toolset for family based sequencing data analysis. J Med 2015;4(1):26. Genet 2015, 52, 282–288. 3. Merelli I, Per ´ ez-Sanc ´ hez H, Gesing S, et al. Managing, 24. Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set analysing, and integrating big data in medical bioinformat- for whole-genome association and population-based linkage ics: open problems and future perspectives. BioMed Re- analyses. Am J Hum Genet 2007;81(3):559–75. search International 2014: 134023. 25. Mohammed EA, Far BH, Naugler C. Applications of the 4. Reyes-Ortiz JL, Oneto L, Anguita D. Big data analytics in the MapReduce programming framework to clinical big data cloud: Spark on hadoop vs mpi/openmp on beowulf. Proce- analysis: current landscape and future trends. BioData Min 2014;7:22. dia Computer Science 2015;53:121–30. 5. Burren OS, Guo H, Wallace C. VSEAMS: a pipeline for variant 26. Huang H, Tata S, Prill RJ. BlueSNP: R package for highly scal- set enrichment analysis using summary GWAS data identi- able genome-wide association studies using Hadoop clus- fies IKZF3, BATF and ESRRA as key transcription factors in ters. Bioinformatics 2013;29(1):135–6. type 1 diabetes. Bioinformatics 2014;30(23):3342–8. 27. White T. Hadoop: The Definitive Guide, Sebastopol, CA. 6. Apache Hadoop. http://hadoop.apache.org/. Accessed 10 “O’Reilly Media, Inc.”; 2012. April 2018. 28. Danecek P, Auton A, Abecasis G, et al. The variant call format 7. Dean J, Ghemawat S. Mapreduce: simplified data processing and VCFtools. Bioinformatics 2011;27(15):2156–8. on large clusters. Commun Acm 2008;51(1):107–13. 29. Multiway-Merge Algorithm. https://en.wikipedia.org/wiki/ 8. Vora MN. Hadoop-HBase for large-scale data. In: Computer K-Way Merge Algorithms. Accessed 10 April 2018. Science and Network Technology (ICCSNT), 2011 Interna- 30. Chang F, , Dean J, Ghemawat S, et al. Bigtable: a distributed tional Conference on. IEEE; 2011, pp. 601–5. storage system for structured data. Acm T Comput Syst 9. Zaharia M, Chowdhury M, Das T, et al. Resilient distributed 2008;26(2): 1365815–1365816. 31. Genomes Project C, Abecasis GR, Altshuler D, et al. A map datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference of human genome variation from population-scale sequenc- on Networked Systems Design and Implementation. USENIX ing. Nature 2010;467(7319):1061–73. Association; 2012, pp. 2. 32. Mathias RA, Taub MA, Gignoux CR, et al. A continuum of ad- 10. McKenna A, Hanna M, Banks E, et al. The Genome mixture in the Western Hemisphere revealed by the African Analysis Toolkit: a MapReduce framework for analyz- Diaspora genome. Nat Commun 2016;7:12522. ing next-generation DNA sequencing data. Genome Res 33. Kwon Y, Balazinska M, Howe B, et al. A study of skew in 2010;20(9):1297–303. Mapreduce applications. Open Cirrus Summit 2011;11. 11. Schatz MC. CloudBurst: highly sensitive read mapping with 34. O’Neil P, Cheng E, Gawlick D, et al. The Log-Structured MapReduce. Bioinformatics 2009;25(11):1363–9. Merge-tree (LSM-tree). Acta Informatica 1996;33(4):351–85. 12. Pireddu L, Leo S, Zanetti G. SEAL: a distributed short 35. CloudMerge. https://github.com/xsun28/CloudMerge., Last access: 05-25-2018. read mapping and duplicate removal tool. Bioinformatics 2011;27(15):2159–60. 36. Li H. Tabix: fast retrieval of sequence features from generic Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018 16 Sorted Merging using Distributed Systems TAB-delimited files. Bioinformatics 2011; 27(5):718–9. 41. Miner D, Shook A. MapReduce Design Patterns: Building Ef- 37. Amdahl GM. Validity of the single processor approach to fective Algorithms and Analytics for Hadoop and Other Sys- achieving large scale computing capabilities. In: Proceedings tems, Sebastopol, CA. “O’Reilly Media, Inc.”; 2012. of the April 18–20, 1967, Spring Joint Computer Conference. 42. Chen L, Wang C, Qin ZS, et al. A novel statistical method ACM; 1967, pp. 483–5. for quantitative comparison of multiple ChIP-seq datasets. 38. Gustafson JL. Reevaluating Amdahl’s law. Commun Acm Bioinformatics 2015;31(12):1889–96. 1988;31(5):532–3. 43. Sun X, Gao J, Jin P, et al. Supporting data for “Optimized dis- 39. Sedgewick R, Flajolet P. An Introduction to the Analysis of tributed systems achieve significant performance improve- Algorithms, Boston, MA. Addison-Wesley; 2013. ment on sorted merging of massive VCF Files” GigaScience 40. Ozsu MT, Valduriez P. Principles of Distributed Database Sys- Database 2018. http://dx.doi.org/10.5524/100423. tems, Berlin/Heidelberg, Germany. Springer Science & Busi- ness Media; 2011. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/6/giy052/4995263 by Ed 'DeepDyve' Gillespie user on 21 June 2018

Journal

GigaScienceOxford University Press

Published: May 10, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off