Core Hunter 3: flexible core subset selection

Core Hunter 3: flexible core subset selection Background: Core collections provide genebank curators and plant breeders a way to reduce size of their collections and populations, while minimizing impact on genetic diversity and allele frequency. Many methods have been proposed to generate core collections, often using distance metrics to quantify the similarity of two accessions, based on genetic marker data or phenotypic traits. Core Hunter is a multi-purpose core subset selection tool that uses local search algorithms to generate subsets relying on one or more metrics, including several distance metrics and allelic richness. Results: In version 3 of Core Hunter (CH3) we have incorporated two new, improved methods for summarizing distances to quantify diversity or representativeness of the core collection. A comparison of CH3 and Core Hunter 2 (CH2) showed that these new metrics can be effectively optimized with less complex algorithms, as compared to those used in CH2. CH3 is more effective at maximizing the improved diversity metric than CH2, still ensures a high average and minimum distance, and is faster for large datasets. Using CH3, a simple stochastic hill-climber is able to find highly diverse core collections, and the more advanced parallel tempering algorithm further increases the quality of the core and further reduces variability across independent samples. We also evaluate the ability of CH3 to simultaneously maximize diversity, and either representativeness or allelic richness, and compare the results with those of the GDOpt and SimEli methods. CH3 can sample equally representative cores as GDOpt, which was specifically designed for this purpose, and is able to construct cores that are simultaneously more diverse, and either are more representative or have higher allelic richness, than those obtained by SimEli. Conclusions: In version 3, Core Hunter has been updated to include two new core subset selection metrics that construct cores for representativeness or diversity, with improved performance. It combines and outperforms the strengths of other methods, as it (simultaneously) optimizes a variety of metrics. In addition, CH3 is an improvement over CH2, with the option to use genetic marker data or phenotypic traits, or both, and improved speed. Core Hunter 3 is freely available on http://www.corehunter.org. Keywords: Core collections, Multi-objective, Local search heuristics Background working) and base collections. Examples of active collec- Genebanks were established by national or international tions include seed stores or live plants that can be accessed breeding, or conservation programs with the goal to safe- quickly by plant breeders and researchers through germi- guard genetic diversity for future use. Many breeding pro- nation or clonal propagation. In contrast, accessions in grams have established genebanks as a resource for new base collections are held in long-term storage, such as variation in the crops they breed, allowing them to react cryopreservation, and require some time for regeneration to changing environments and emerging biotic and abiotic and propagation before being made available. stresses. Accessions are often divided between active (or During the last few decades the collections stored in genebanks have grown enormously, and cost of main- taining viable germplasm within genebanks has increased. *Correspondence: herman.debeukelaer@ugent.be Department of Applied Mathematics, Computer Science and Statistics, Ghent Genebank curators must make decisions about which University, Krijgslaan 281 S9, 9000 Gent, Belgium accessions to maintain in the active collection versus the Full list of author information is available at the end of the article base collection, and may even consider not maintaining © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 2 of 12 an accession at all. The concept of a core collection was [16]. The genetic distance sampling strategy constructs introduced to help with these decisions, and is defined as cores with a given minimum distance between selected subset of the complete collection which most represents accessions by repeatedly including a random acces- the diversity of the entire collection with minimum redun- sion and removing all others within a certain sampling dancy [1]. Genebank curators can use core collections to radius [17]. define the active collection over the base collection. Core Core Hunter was designed to meet the variety of crite- collections can also be used to aid researchers and plant ria used to evaluate core collections for different purposes, breeders in the choice of starting material. For example, and supports optimization of several of these metrics, the potential for use of core collections has been shown using flexible local search algorithms [18]. Core Hunter for association studies [2, 3]. can construct core collections for specific applications, A variety of measures have been used to evaluate core and combines multiple objectives to bring the different collections based on genetic marker data or phenotypic perspectives closer together, for example by simultane- traits, including pairwise distances and allelic richness. ously maximizing genetic dissimilarity and allelic rich- The choice of the most appropriate evaluation measure ness. Although Core Hunter is mainly focused at fixed size depends on the purpose of the core collection [4]. Some- core subset selection, version 1 and 2 allowed to spec- times core collections are sampled based on a combi- ify a minimum and maximum size and preferred smaller nation of both genotypes and phenotypes [5–7]. Many cores with the same value. Core Hunter was shown to methods have been proposed to sample high quality core outperform stratified sampling strategies, MSTRAT and collections according to the measure(s) of interest. The PowerCore. first methods were stratified sampling techniques that It has been assumed that, to obtain a diverse core, the cluster the accessions, based on distance matrices calcu- average distance between its entries should be maximized lated from their allele scores or phenotypic trait values, [9, 18]. However, a high entry-to-entry distance does not and then select several accessions from each cluster using guarantee that selected accessions are sufficiently differ- a certain allocation method. Brown suggested to randomly ent, and it is known that maximizing this criterion over- select either a constant (C) number of accessions per clus- represents extreme values [4, 19]. Core Hunter 2 (CH2) ter, or a number proportional (P) to the size or logarithm deals with this issue by also maximizing the minimum (L) of the size of the cluster, and argued that the L-method distance between selected accessions [19]. Although aver- is preferred [8]. It was later shown that more diverse cores age distance and allelic richness can be effectively opti- are obtained when the number of included accessions is mized using simple and fast local search algorithms, such proportional to the within-cluster diversity [9]. as a stochastic hill-climber, a more complex and slower Another allocation method, the M-method, maximizes mixed replica search (MixRep) was required to maximize the probability to retain all observed alleles in order to minimum distance in the Core Hunter framework. The construct cores with high allelic richness [10]. This idea MixRep algorithm runs multiple types of stochastic local led to the development of the MSTRAT software, which searches in parallel, as well as a constructive algorithm implements a generalized M-method that directly sam- (LR) that starts from an empty selection that is itera- ples from the entire collection to maximize allelic richness tively extended. In case an active search is unable to find with a simple hill-climbing algorithm [11]. Other heuris- any further improvements, it is terminated and replaced tics work by repeatedly removing one of the two most with a new local search engine starting from a selection similar accessions from the collection until the desired that is obtained by combining two previously found high- core size is obtained, either randomly (least distance quality selections, in an attempt to further explore other stepwise sampling [12]), or using a specific elimination interesting regions of the search space, as in a genetic criterion maximizing the distance to the remaining acces- algorithm [20]. sions or expected heterozygosity of the reduced collection Another approach to maximize diversity, while at the (SimEli) [13]. The genetic distance optimization strategy same time avoiding inclusion of too similar accessions at (GDOpt) was designed to construct highly representative the extremes of the collection, is to maximize the average cores, in which each accession from the entire collection distance between each entry and the closest other entry is represented by a similar core entry [14]. GDOpt par- in the core, as proposed by Odong et al. [4]. The SimEli titions the data around a number of identified medoids, algorithm was shown to outperform Core Hunter 2 in which are then selected as the core entries. Methods terms of this new entry-to-nearest-entry (E-NE) metric. for variable size core sampling have also been devel- Alternatively, one may desire to optimally represent the oped. PowerCore minimizes the size of the core, while individual accessions, instead of the whole range of diver- covering all observed marker alleles and/or trait values sity. In such case, Odong et al. reccomend to minimize the [15]. GenoCore was developed for the same purpose, average distance between each accession in the full collec- and specifically tailored to high-density marker datasets tion and the most similar accession contained in the core. De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 3 of 12 The GDOpt strategy was specifically developed to mini- Evaluation measures mize this accession-to-nearest-entry (A-NE) metric, and Core Hunter 3 includes various evaluation measures that shown to outperform both Core Hunter 2 and SimEli for can be selected as optimization objectives, including but this purpose [13, 14]. not limited to those described below. We refer to the We introduce Core Hunter 3 (CH3), which incorporates website http://www.corehunter.org for an overview of all the two improved methods for summarizing distances, provided measures. entry-to-nearest-entry (E-NE) and accession-to-nearest- Distance measures entry (A-NE), proposed by Odong et al. [4]. CH3 attempts We used the Modified Roger’s distance [18, 21]toassess to find the maximum entry-to-nearest-entry distance to the dissimilarity of accessions based on genetic marker obtain diverse cores, whereas accession-to-nearest-entry data. For phenotypic traits we used Gower’s distance [22] distance is minimized to represent as much as possible which simultaneously takes into account qualitative and all accessions from the entire collection. More specifi- quantitative traits. Pairwise distances are aggregated as cally, CH3 can sample fixed size cores based on molecular follows to evaluate the diversity or representativeness of marker data, phenotypic traits, a precomputed distance the core [4]: matrix, or a combination of these. The distance matrix can be generated using an appropriate measure, such as Entry-to-nearest-entry (E-NE): the average distance Modified Roger’s distance for genotypes [21] or Gower’s between each selected accession and the closest other distance for phenotypes [22]. As in previous versions, core entry. This criterion can be maximized to Core Hunter 3 can also maximize allelic richness, as well construct highly diverse cores in which all accessions as a combination of multiple metrics. In particular, we are maximally different. assess whether the new distance-based E-NE and A-NE Accession-to-nearest-entry (A-NE): the mean metrics can be effectively optimized using fast local search distance between each accession from the entire algorithms, and whether maximizing E-NE indirectly also collection and the most similar core entry, including yields a high minimum distance, without the need for itself in case the accession has been selected. a more complex algorithm. Furthermore, we assess the Minimizing this criterion yields cores that maximally ability of Core Hunter 3 to simultaneously maximize E- represent all individual accessions. NE and A-NE, or E-NE and allelic richness, and com- pare the results with those obtained with Core Hunter When comparing CH3 with CH2 we also evaluated the 2, GDOpt, and SimEli, for three marker datasets with minimum distance (DMIN) between selected accessions, different allelic composition and varying size, and one but this is not an objective that can be directly opti- phenotypic trait dataset. Core Hunter 3 is available as an mized by CH3, for reasons explained in the discussion. Rpackage corehunter on CRAN and as an open source A detailed description and comparison of the E-NE and project on GitHub. A prototype graphical user interface A-NE metrics are provided in [4]. is also available. See http://www.corehunter.org for more Allelic richness information. To evaluate the allelic richness of cores sampled based on genetic marker data, we used the average expected Methods heterozygosity (HE) per locus [18, 23], calculated as Core selection problem Given a collection A that contains n accessions, and a desired core size 1 < k < n, the feasible solution space of 0 ≤ HE = 1 − p ˆ ≤ 1 la possible core subsets is defined as l=1 a=1 = {C | C ⊂ A ∧ |C| = k} where L is thenumberofmarkers (loci), n is the number of observed alleles at the lth locus, and p ˆ is the fre- la where |C| denotes the size of the subset. The core selection quency of the ath allele at the lth locus in the selected core problem then consists of finding an optimal subset C ∈ collection. that maximizes a certain evaluation measure F(C) :  → R,i.e. Weighted index and normalization As in previous versions, Core Hunter can simultaneously optimize k measures by maximizing a weighted index C = argmax F(C). C∈ In case the evaluation measure F(C) is intended to be F(c) = α F (c) i i minimized, this can be achieved by maximizing −F(C). i=1 De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 4 of 12 where F is the ith included evaluation measure and 0 < Algorithm 1 Random descent. α < 1 is the weight assigned to this objective, with i Input: collection A,coresize(k), evaluation measure α = 1. In case of a measure F that is to be F(C), neighbourhood N (C) :  → P () with  = i i i=1 minimized, such as A-NE, it is transformed into a max- {C | C ⊂ A ∧ |C| = k} imization objective F =−F when it is included in the Output: best found core C ∈ weighted index. The individual measures are automati- 1: C ← random element of cally normalized to [ 0, 1], following the Pareto minimum 2: repeat based upper-lower-bound approach as described in [24], 3: pick random neighbour C ∈ N (C) to ensure a fair balance between the included objectives, 4: if F(C )> F(C) then independent of their original range. More information 5: C ← C about this normalization is provided in the documenta- 6: end if tion of the R package. 7: until stop condition satisfied 8: return C Core sampling algorithms We evaluate the performance of three general purpose selection heuristics to optimize the chosen evaluation higher probability to accept inferior modifications, simi- measureorweightedindex forafixedcoresize: ran- lar to the frequently used simulated annealing algorithm dom descent, parallel tempering, and a genetic algorithm. [25]. The acceptance function is commonly defined as Based on the findings in this study, only the former two were included in Core Hunter 3, which defaults to the par- 1if > 0 p(, t) = allel tempering algorithm, but also provides a fast mode /t e else in which the random descent algorithm is applied. Note that these two stochastic local search algorithms were where  = F(C ) − F(C ) and t is the temperature of also available in CH2, although they were not used by the replica. This acceptance function ensures that neigh- default. The search algorithms are executed until either an bours with a better score are always accepted, whereas absolute runtime limit has been exceeded, or no further inferior neighbours are accepted at a probability that improvements were obtained during a certain amount exponentially decreases as the solution gets poorer or as of time. the temperature is decreased. In addition, searches with similar temperature periodically exchange their current Random descent selection, which has the effect to push the most promis- This basic local search outlined in Algorithm 1 starts with ing solutions towards the coolest searches to promote a random selection of the desired size and then iteratively convergence towards a common solution, and the worst tries to improve its quality by slightly modifying the core. solutions towards the hottest searches allowing them to The obtained similar selection, referred to as a neighbour escape from local optima. The probability that replica r of the current selection, is accepted if and only if it has and r + 1 will swap their current selection is commonly a higher objective function value according to the cho- defined as sen evaluation measure. Otherwise, another move is tried from the current selection. Core Hunter uses a single- 1if > 0 q ( , t , t ) = 1 1 r r r+1 swap neighbourhood, i.e. considers all neighbours that can − t t r+1 e else be obtained from the current selection by replacing one selected accession with a currently unselected accession. with  = F (C ) − F(C ). As such, if the current selec- r r+1 r tion of replica r + 1 has a better objective function value Parallel tempering than that of the rth replica, these are always swapped. In Algorithm 2 describes the more advanced parallel tem- addition, similar to the probabilistic acceptance of infe- pering method [18], also referred to as replica exchange rior neighbours, swaps that push solutions in the opposite Monte Carlo (REMC), which consists of multiple coop- direction may also be performed—yet with a probability erating local searches that are executed in parallel. Each that decreases for a larger difference in objective func- search performs the same procedure as random descent, tion value and replica temperature. The parallel tempering but may also accept inferior modifications to be able to algorithm implemented in Core Hunter 3 consists of p = −8 −4 escape from local optima, i.e. to further improve the cur- 10 searches with a temperature range of 10 ,10 ,and rent selection even if none of the considered neighbours uses the same single-swap neighbourhood as the random has a better score. For this purpose, the search repli- descent method described above. The number of replica cates are assigned fixed, increasing temperatures, equally steps per iteration is fixed to q = 500, and the default spread in a given range. A higher temperature leads to a acceptance and swap functions are applied. De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 5 of 12 Algorithm 2 Parallel tempering. Selection (SELECT).We randomly picked five candidates from the current population, from which the one Input: collection A,coresize(k), evaluation measure with the highest objective function value was chosen F(C), neighbourhood N (C) :  → P () with as a parent (tournament selection). = {C | C ⊂ A ∧ |C| = k}, number of replicas (p), Crossover (CROSS). A child was created from two par- temperature range [ t , t ], acceptance function min max ents by repeatedly adding an arbitrary accession that p(, t),swapfunction q(, t , t ), number of replica 1 2 is selected in either parent solution (at random with steps per iteration (q) equal probability) until the desired core size was Output: best found core C ∈ obtained. 1: for i from 1 to p do Mutation (MUTATE). As mutation operator we applied i−1 2: t ← t + (t − t ) i min max min p−1 the random descent heuristic described above, start- 3: C ← random element of ing from the given solution, until no improvement 4: end for was found in the last 5000 steps. 5: C ← argmax F(C ) best i 1≤i≤p Survival (SURVIVE). We applied a roulette selection to 6: s ← 0 discard five solutions in each step, so that the pop- 7: repeat ulation size remained fixed over all generations. A 8: for i from 1 to p (in parallel) do solution C was assigned a weight of 1/F(C) meaning 9: repeat q times that the probability that it is discarded is inversely 10: pick random neighbour C ∈ N (C ) proportional to its fitness. 11: compute  ← F(C ) − F(C ) i i 12: with probability p( , t ):set C ← C i i i 13: if F(C )> F(C ) then best Algorithm 3 Genetic algorithm. 14: C ← C best i 15: end if Input: collection A,coresize(k), evaluation measure 16: end repeat F(C),populationsize(p), number of children per gen- 17: end for eration (c), selection operator SELECT :  → 18: r ← s + 1 with  = {C | C ⊂ A ∧ |C| = k}, crossover operator 19: while r < p do CROSS :  → ,mutationoperatorMUTATE :  → p+c p 20: compute  ← F(C ) − F(C ) , survival operator SURVIVE :  → r r+1 r 21: with probability q( , t , t ):swap C and Output: best found core C ∈ r r r+1 r r+1 1: pop ←∅ 22: r ← r + 2 2: for i from 1 to p do 23: end while 3: add random element of  to pop 24: s ← 1 − s 4: end for 25: until stop condition satisfied 5: C ← argmax F(C) best C∈pop 26: return C best 6: repeat 7: children ←∅ 8: for i from 1 to c (in parallel) do 9: P ← SELECT(pop) 10: P ← SELECT(pop) Genetic algorithm 11: C ← MUTATE(CROSS(P , P )) To assess the potential improvement of a global opti- 1 2 12: add C to children mization engine over a local search we also applied the 13: end for genetic algorithm [20] outlined in Algorithm 3. Here, 14: for C ∈ children do a population of initially randomly generated solutions 15: add C to pop (cores) is maintained. In every step, new child solutions 16: if F(C)> F(C ) then are produced by combining two randomly chosen par- best 17: C ← C ent solutions (crossover), followed by one or more swaps best 18: end if of accessions (mutation) between the unselected and the 19: end for selected subset. These children are added to the pop- 20: pop ← SURVIVE(pop) ulation, and certain solutions are discarded to simulate 21: until stop condition satisfied survival of the fittest individuals in natural evolution. For 22: return C our experiments we used a population size of p = 25 and best generated c = 5 children in each step (in parallel). We applied the following operators: De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 6 of 12 Comparison with GDOpt and SimEli R through the package corehunter (https://cran.r-project. For the GDOpt selection strategy [14], we used the org/package=corehunter). GDOpt, SimEli, and all compu- k-medoids algorithm of Kaufman and Rousseuw [26] tational experiments were implemented in R v3.3.1 [31]. through the R function pam, to identify a representative Note that the R function pam used in GDOpt calls a C core collection. The number of clusters was chosen equal function which performs the actual partitioning. Exper- to the desired core size and the returned medoids were iments were executed on a computing server with two selected as core accessions. We also implemented SimEli 10-core Intel E5-2660v3 (2.6 GHz) CPUs and 128 GB in R, considering both elimination criteria suggested in RAM. [13]. In each step, one of the two most similar acces- sions was eliminated, maximizing either the average dis- Results tance to the remaining accessions (SimEli-A-RA) or the Optimizing E-NE and A-NE with local searches expected heterozygosity of the reduced collection (SimEli- We sampled 10 cores from each dataset using random HE), until the desired core size was obtained. The source descent, parallel tempering, and the described genetic code for these implementations is available on GitHub algorithm, configured to maximize E-NE with a runtime (corehunter/corehunter3-paper). limit of 30 min. Table 1 shows mean values and standard deviations of the obtained cores. The results indicate that Datasets parallel tempering yields the highest E-NE values, with We used four datasets of varying size and composition the lowest variability across independent samples. Vari- to compare the performance of different core sampling ability in solution quality is always at least one order of algorithms: magnitude below that observed for random descent and the genetic algorithm. Still, variability is already quite low 1 Rice data: 1000 accessions for which 39 phenotypic when using the basic random descent heuristic. Although traits were recorded, including 28 qualitative and 11 the genetic algorithm also outperforms random descent, quantitative traits. Available from the PowerCore it is not as effective as parallel tempering. We performed project [15] and previously used to assess the a pairwise comparison of the results obtained with the performance of several other core sampling three applied methods, for the four considered datasets, algorithms, including SimEli [13]. using a Wilcoxon rank-sum test [32]. The twelve result- 2 Coconut data: 1014 accessions characterized using ing p-values were adjusted for multiple testing to control 30 crop-specific SSR markers. Used in multiple the family-wise error rate (FWER) using Holm’s method previous core selection studies [4, 13, 14]. [33]. All differences were statistically significant at the α = 3 Maize data: 1250 accessions characterized with 1117 0.05 confidence level, with adjusted p-values ranging from SNP markers. Distributed as part of the R package 0.00013 to 0.00049. Figure 1 displays convergence curves synbreedData [27]. of the three applied algorithms, again averaged over 10 4 Pea data: 4428 accessions characterized by 17 RBIP runs, for the large pea dataset. These plots confirm that markers [28, 29]. Previously used to compare the all algorithms are able to iteratively improve an arbitrarily performance of Core Hunter 2 with other core bad random selection to reach a high E-NE value. Again sampling algorithms for large datasets [19]. we see that parallel tempering yields the highest-quality cores (left). Moreover, this algorithm is almost as fast as All cores sampled in the performed experiments com- the basic random descent heuristic (right). Both meth- prised 20% of the entire collection for the rice, coconut ods very quickly improve the initial random selection, and and maize datasets, and 10% for the large pea dataset. after less than 10 s, parallel tempering found a better solu- Implementation and hardware tion than random descent, after which it keeps improving Core Hunter 3 has been reimplemented in Java 8, using the the quality of the core. In contrast, the genetic algorithm JAMES framework (v1.2) for discrete optimization with takes a slower start, catches up with random descent after local search metaheuristics [30]and wasexecutedfrom 20 s, and then also further improves the selection—but not Table 1 Comparison of random descent, parallel tempering, and a genetic algorithm, when maximizing the entry-to-nearest-entry criterion (E-NE). Mean values and standard deviations are reported for 10 independently sampled core collections Rice Coconut Maize Pea Random descent 0.1500 ± 1.83e-04 0.5748 ± 5.22e-04 0.4332 ± 2.73e-04 0.3337 ± 1.70e-03 Parallel tempering 0.1508 ± 1.40e-15 0.5759 ± 2.12e-06 0.4359 ± 8.56e-05 0.3412 ± 1.46e-04 Genetic algorithm 0.1506 ± 1.12e-04 0.5755 ± 1.04e-04 0.4346 ± 3.45e-04 0.3386 ± 8.00e-04 De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 7 of 12 Fig. 1 Convergence curves for pea dataset. These curves show the E-NE value of the best found solution at each point in time during execution of random descent, parallel tempering, and the genetic algorithm, averaged over 10 independent runs, for the large pea dataset. The left plot reports the progress during the entire run with a runtime of 30 min while the right plot is zoomed in on the first 40 s as effectively as parallel tempering. We performed these and execution time for 10 independent samples, obtained experiments only for the E-NE measure but assume that with both methods, and for each dataset except the rice our findings also hold for A-NE due to the very similar collection, because CH2 cannot sample cores based on composition of both criteria. All following CH3 results phenotypic traits. For all three datasets, CH3 yields higher were obtained with the parallel tempering algorithm. E-NE and DMIN than CH2. However, a detailed inspec- tion of the output generated by CH2 (not shown) revealed Comparison with Core Hunter 2 that the LR replica—one of the search replicas in the To assess whether maximizing E-NE indirectly also yields MixRep algorithm used by CH2—did not always complete a high minimum distance (DMIN) between selected before CH2 was terminated. This LR search is a con- accessions, we compared the results of CH3 and CH2. structive heuristic that starts with an empty selection and We configured CH2 to maximize a weighted index includ- iteratively adds the two best accessions, i.e. those yielding ing both average and minimum pairwise distance, with the best possible score when added to the current selec- tion. After each two additions, one accession is removed equal weight, and CH3 to maximize E-NE. Both algo- rithms were terminated when no improvement was found from the selection, again chosen to optimize the score of during the last 10 s. Table 2 reports average E-NE, DMIN, the remaining selection. This procedure is repeated until the desired core size has been reached. The LR replica was specifically included in CH2 to construct cores with Table 2 Comparison of Core Hunter 2 and 3 high minimum distance [19]. Therefore, we repeated the E-NE DMIN Time (s) CH2 experiments with an absolute runtime limit that was Coconut empirically determined per dataset to ensure that the LR replica terminated in each run (CH2L). Especially for the CH2 0.552 ± 3.53e-2 0.501 ± 9.76e-2 27.6 ± 06.0 large pea dataset, significantly more time was needed in CH3 0.576 ± 9.35e-5 0.540 ± 0.00e-0 37.5 ± 07.9 this configuration. Table 2 shows that CH2L is indeed CH2L 0.569 ± 5.91e-4 0.548 ± 0.00e-0 31.0 ± 00.1 able to construct cores with a much higher minimum dis- Maize tance than CH2, and also outperforms CH3 in terms of CH2 0.416 ± 1.52e-2 0.396 ± 2.46e-2 78.3 ± 10.6 this measure. Although differences in minimum distance CH3 0.435 ± 2.70e-4 0.409 ± 3.05e-3 74.3 ± 26.5 obtained with CH2L and CH3 are not larger than 4%, they are statistically significant for the coconut and maize CH2L 0.429 ± 5.00e-4 0.415 ± 1.11e-3 78.6 ± 02.0 datasets (p = 0.000097), but not for the pea dataset (p = Pea 0.3064). Moreover, CH3 still yields significantly higher- CH2 0.219 ± 1.49e-3 0.000 ± 0.00e-0 85.6 ± 04.5 quality core collections in terms of the E-NE criterion CH3 0.338 ± 1.04e-3 0.287 ± 1.34e-2 154.1 ± 49.7 (p = 0.000097), and is faster for large datasets. CH2L 0.325 ± 8.21e-4 0.297 ± 0.00e-0 802.3 ± 00.8 CH2 maximizes a weighted index including average and minimum pairwise Comparison with GDOpt and SimEli distance, with equal weight, while CH3 maximizes E-NE. Mean E-NE, DMIN, runtime We approximated the Pareto front obtained by Core and corresponding standard deviations are reported for 10 independent Hunter 3 when simultaneously optimizing E-NE, and executions. The highest obtained E-NE and DMIN value per dataset is shown in bold. CH3 was terminated when no improvements were found during 10 s. For CH2, either A-NE or HE, with varying weights α ∈[0, 1] and two alternatives were considered: (a) the same stop condition as for CH3 (CH2); and α = 1 − α , respectively, and compared the results with 2 1 (b) an absolute runtime limit that was empirically determined per dataset to ensure those obtained by GDOpt and SimEli. Note that A-NE is that the LR replica of MixRep terminated in each run (CH2L) De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 8 of 12 minimized, while E-NE and HE are maximized. As before, Core Hunter 3 is able to simultaneously improve over CH3 was terminated when no improvement was found SimEli in terms of both objectives (E-NE and HE value). during 10 s. Figure 2 shows that GDOpt and CH3 are able Average execution times of GDOpt, SimEli and CH3 to construct representative cores with low A-NE, which is (configured to optimize E-NE, A-NE or HE) are reported not the case for SimEli. In fact, all cores sampled by SimEli in Table 3. Core Hunter 3 was slower than GDOpt and have a worse A-NE value than those obtained by GDOpt SimEli for the rice and coconut datasets. For the maize and CH3, even when the latter is configured to maximize dataset CH3 was faster than GDOpt and SimEli-HE when E-NE only. On the other hand, SimEli scores much bet- maximizing HE or E-NE but slower when minimizing A- ter than GDOpt in terms of diversity (high E-NE). Still, NE and always slower than SimEli-A-RA. Finally, for the Core Hunter 3 is able to find cores which simultaneously pea dataset, CH3 was faster than both GDOpt and SimEli. have a higher diversity and are more representative than Core Hunter 3 was also consistently faster when maximiz- those obtained with SimEli. For the maize dataset, SimEli- ing HE as compared to the configurations where E-NE or A-RA and SimEli-HE found cores of similar quality, while A-NE were optimized. for the coconut and pea dataset SimEli-A-RA showed to be preferred in terms of both E-NE and A-NE. For the rice Discussion dataset, SimEli-HE was not included because expected Depending on the purpose of a core collection, a vari- heterozygosity can only be evaluated for genotypic data. ety of metrics is used to evaluate its quality. Distance- Figure 3 shows that GDOpt yields cores with significantly based measures are attractive because they are intuitive lower HE than any of the other methods. SimEli performs to understand and can capture both diversity within the better in this respect, especially SimEli-HE, but as before core as well as representativeness of the accessions from Fig. 2 Simultaneous optimization of entry-to-nearest-entry (E-NE) and accession-to-nearest-entry (A-NE) distance. These Pareto front approximations for Core Hunter 3 were obtained by sampling cores with varying weights α ∈[0, 1] and α = 1 − α assigned to the E-NE and A-NE 1 2 1 measures, respectively, with a step size of 0.05. The quality of the cores constructed by CH3 is compared with those obtained by GDOpt and SimEli, in terms of both objective functions. All reported values are averages of 10 independently sampled cores with the same settings De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 9 of 12 Fig. 3 Simultaneous maximization of entry-to-nearest-entry distance (E-NE) and expected heterozygosity (HE). These Pareto front approximations for Core Hunter 3 were obtained by sampling cores with varying weights α ∈[0, 1] and α = 1 − α assigned to the E-NE and HE measures, 1 2 1 respectively, with a step size of 0.05. The quality of the cores constructed by CH3 is compared with those obtained by GDOpt and SimEli, in terms of both objective functions. All reported values are averages of 10 independently sampled cores with the same settings. The rice dataset is excluded here because expected heterozygosity can only be evaluated for genotypic data the full collection, computed from either genetic mark- minimum distance, meaning that the search has no clue as ers or phenotypes. However, pairwise distances need to be to whether these modifications may eventually lead to an aggregated in an appropriate way to evaluate the selected improved solution. To smooth out the objective function, core. Although many studies and methods have used aver- CH2 maximized a combination of average and minimum age pairwise distance to assess the diversity in the core, distance. Also, the applied MixRep algorithm includes a it is known that a high average does not guarantee that constructive LR heuristic (see “Results”), which is much all accessions in the core are sufficiently different from better suited to maintain a high minimum distance as it each other [4, 19]. Maximizing this criterion tends to iteratively adds accessions to an initially empty selection. overrepresent the extremes of the distribution in the full Unfortunately, the LR algorithm becomes slow for large collection. datasets, because it builds the core bottom-up, instead of Core Hunter 2 addressed this issue by maximizing min- iteratively refining a randomly chosen initial selection. imum distance in addition to average distance, using Two new distance-based metrics, entry-to-nearest- a complex mixed replica search (MixRep) consisting of entry (E-NE) and accession-to-nearest-entry (A-NE), different cooperating strategies [19]. The original Core introduced by [4], were shown to generate improved cores Hunter software used a local search algorithm to optimize for specific goals. The E-NE criterion takes all acces- the chosen evaluation measure, but such local searches sions into account and can therefore presumably be more are not well suited to optimize minimum distance because effectively optimized with local searches as compared to this measure is very sensitive to the precise selection. Sim- minimum distance, but still focuses on maintaining a high ilar cores may have very different values, while at the same distance between each pair of closest accessions which, time very different cores may have a similar or even the in contrast to average pairwise distance, avoids overrep- same minimum distance. This makes it difficult for a local resentation of extreme values. Therefore, in Core Hunter search to find its way from a randomly generated selection 3, the minimum distance measure was replaced with the to a high-quality core. In particular, for a given current newly proposed E-NE criterion. The A-NE metric was solution, many possible modifications may not affect the also included to sample cores that maximally represent all individual accessions from the full collection. We assessed whether the new E-NE metric can indeed Table 3 Average execution times (seconds) of GDOpt, both be effectively optimized with local search algorithms, in SimEli implementations and CH3 for 10 independent samples an attempt to avoid the complexity of the MixRep algo- from each dataset. Three configurations are considered for CH3: rithm used by CH2, and in particular the slowness of the (a) maximize E-NE; (b) minimize A-NE; and (c) maximize HE LR replica. We showed that even a very basic stochastic Rice Coconut Maize Pea hill-climber (random descent) can already construct cores GDOpt 14.9 7.1 91.2 350.1 with high E-NE value and quite little variability in qual- ity across independent samples. Still, the value of the core SimEli-A-RA 7.6 7.5 11.5 514.7 is further improved, and variability further reduced, when SimEli-HE - 15.9 78.0 502.3 using the more advanced parallel tempering algorithm. CH3 E-NE 45.8 37.5 74.3 154.1 Since parallel tempering takes advantage of modern multi- CH3 A-NE 74.6 55.7 133.1 86.7 core CPUs, the associated computational overhead is very CH3 HE - 16.6 40.2 62.8 limited. In our experiments, even for the large pea dataset De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 10 of 12 with over 4000 accessions, parallel tempering was only it may be confusing that there is a possibly large time gap marginally slower than random descent. We also assessed between the last improvement found by the other replicas whether a genetic algorithm could further improve these and that obtained when the LR replica has finished. In this results. Such global optimization strategy iteratively com- respect, CH3 is more user-friendly because it uses a well- bines currently known high-quality solutions (crossover) known local search algorithm that gradually improves the in an attempt to explore other interesting regions of the E-NE value of the core. Large gaps between significant solution space. The obtained solutions are then exploited improvements are not expected, which makes it easier to by applying local modifications (mutation). We used the determine an appropriate time limit and even more so to random descent heuristic as a mutation operator, since it use a convenient adaptive stop condition such as a maxi- showed to be able to effectively improve the E-NE value mum time without finding an improvement, in which case of a given selection. Although the genetic algorithm out- the execution time is automatically adjusted—to some performed random descent, it showed to be slower and extent—to the size of the collection. produced cores with slightly lower E-NE values as com- One of the main advantages of Core Hunter 3 and pre- pared to parallel tempering. These results indicate that the vious versions is its flexibility. While other methods are intelligent exploitation of parallel tempering is more effec- often developed for a specific purpose such as maximiz- tive to optimize E-NE than the more global exploration ing diversity, representativeness, or allelic richness, Core of the evaluated genetic algorithm. We thus conclude Hunter is suited for each of these as it includes a variety of that parallel tempering is preferred, and that more com- evaluation measures that can directly be optimized, and if plex algorithms are not needed to optimize E-NE, since desired combined in a weighted index. We compared CH3 a basic stochastic hill-climber (random descent) already with GDOpt, designed to maximize representativeness, yields high-quality cores and a global optimization engine and SimEli, where the elimination criterion was chosen (genetic algorithm) did not provide any further advantage. either to maximize diversity (SimEli-A-RA) or expected Moreover, parallel tempering does not yield a significant heterozygosity (SimEli-HE). Core Hunter was configured computationaloverhead—itisalmostasfastasrandom to optimize a weighted index including E-NE and either A- descent. We assume that the same conclusion holds for NE (Fig. 2)orHE(Fig. 3), with varying weights, in order to A-NE duetothevery similarcomposition of both met- approximate the corresponding Pareto front. The results rics. Therefore, Core Hunter 3 uses parallel tempering showed that, as expected, GDOpt is especially suited to by default, which is also known to effectively optimize construct cores that optimally represent all accessions the other measures that were already included in CH2, from the entire collection (low A-NE), as it was specifically such as allelic richness [19]. Afastmodeisalsoprovided developed for this purpose. On the other hand, in terms in which the basic random descent algorithm is applied, of diversity (E-NE) and allelic richness (HE), SimEli scores in case execution time is critical, but it was not used in much better than GDOpt. From the two considered elim- this study. ination criteria, SimEli-HE resulted in the highest allelic To validate the effectiveness of the new E-NE mea- richness, while SimEli-A-RA showed to be most suited sure, we assessed whether maximizing E-NE indirectly to maximize diversity (E-NE). Again, this was expected also yields a high minimum distance. A comparison with and confirms that the SimEli method can be adjusted to Core Hunter 2, configured to sample cores with high aver- some extent, by using an appropriate elimination criterion age and minimum distance, revealed that this is indeed depending on the purpose of the core collection. How- the case. The minimum distance obtained with CH3 is ever, Core Hunter 3 found cores that simultaneously have slightly lower as compared to CH2, but more importantly higher E-NE (more diverse), and lower A-NE (more rep- CH3 yields higher E-NE values because it actively opti- resentative) or higher HE values (higher allelic richness), mizes this criterion. As the minimum distance captures than those obtained by SimEli. In addition, CH3 was able less information about the core than E-NE, we believe that to construct equally representative cores as GDOpt, and the latter criterion better reflects within-core diversity. As thus combines and improves over the advantages of both expected, CH3 was faster than CH2 for large datasets, other methods. due to the quadratic time complexity of the LR replica. A comparison of execution times showed that CH3 Because of its constructive nature, LR only produces use- needs less time to optimize HE as compared to E-NE ful results if given enough time to complete. Therefore, and A-NE. This is not surprising, as it is known that a potential additional issue of CH2 is that the user is allelic richness can also be effectively maximized with a responsible to set an appropriate time limit that allows basic stochastic hill-climber [19]. As we showed that the the LR replica to complete, when aiming at a high min- more advanced parallel tempering algorithm is preferred imum distance. It is not possible to affect the execution to optimize E-NE and A-NE, it is clearly more difficult to time of the LR replica and therefore this method does find cores with high E-NE and low A-NE than to maxi- not provide a quality-runtime tradeoff to the user. Also, mize allelic richness. In our experiments CH3 was slower De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 11 of 12 than GDOpt and SimEli for smaller datasets but faster for function value, minimizing the core size may not always the large pea dataset. Note that although these methods be desired, depending on the purpose of the core. We were implemented in different programming languages, are therefore convinced that fixed and variable size core which affects their absolute execution times, the latter sampling should be treated as separate problems, using does not affect the observed trend in their execution times specific evaluation measures and optimization strategies. when sampling from increasingly large collections. Here, the main advantage of Core Hunter is again its flexibil- Conclusions ity. For example, the runtime of SimEli is determined We introduced Core Hunter 3 (CH3) and showed that by the size of the dataset and the sampled core. When it constructs core collections with high diversity (high sampling a small core from a large collection, many acces- entry-to-nearest-entry distance; E-NE) and which maxi- sions need to be eliminated, and finding the two most mally represent the individual accessions from the entire similar accessions in each step as well as deciding which collection (low accession-to-nearest-entry distance; A- one to eliminate requires many computations. In contrast, NE) using flexible and fast local search algorithms. By the runtime of Core Hunter can be adjusted by using an default, the parallel tempering algorithm is used. Version appropriate stop condition. It is possible to limit the total 3 improves over Core Hunter 2 (CH2) in multiple ways. runtime, but we used an adaptive condition that termi- CH3 is able to find cores with higher E-NE, within less nated the search when no more improvement was found time for large datasets, which also have a high minimum during 10 s. distance, without the need for a more complex algorithm There is of course a tradeoff between execution time and like the mixed replica search from CH2. In addition, CH3 solution quality, and we may be able to further increase finds similar and often better cores than GDOpt and the quality of the core collections sampled from any of our SimEli, which were reported to outperform CH2 in terms datasets by allowing a longer runtime. For the large pea of E-NE and A-NE. In particular, CH3 can create equally dataset for example, we indeed see that allowing no more representative cores as GDOpt, which was designed for than 10 s without finding further improvements (Table 2) this purpose, while at the same time being able to con- yields a slightly lower E-NE value as compared to a config- struct cores that are simultaneously more diverse, and uration with an absolute runtime limit of 30 min (Table 1). either are more representative or have a higher allelic Since each of the tested methods was able to sample cores richness, than cores obtained with SimEli. As in previ- from collections with up to multiple thousands of acces- ous versions, one of the main strengths of Core Hunter is sionsinatmostafewminutes,wedonot expectthat its flexibility. The applied local search algorithms are not the execution time of any of these algorithms will be lim- confined to a specific evaluation measure and new criteria iting for most practical applications. Still, Core Hunter can easily be introduced and optimized without the need is the only one whose runtime can be controlled by the to alter the underlying algorithms. Moreover, multiple cri- user in various ways, which yields an interesting quality- teria can be simultaneously optimized and the execution runtime tradeoff that can be used to either reduce the time is controlled by the user through various stop condi- execution time for large datasets when needed, or to more tions, which offers a convenient quality-runtime tradeoff. thoroughly explore the solution space when more time We therefore believe that Core Hunter is a very broadly is available, neither of which is possible with the other applicable core subset selection tool with a lot of opportu- methods. Note that although we did not experiment with nities to be further extended. For example, we may explore genotypic datasets with tens or hundreds of thousands of the ability of Core Hunter 3 to sample cores based on markers, these can easily be dealt with by precomputing a a combination of genotypes and phenotypes, or extend distance matrix, if necessary, so that only the number of Core Hunter to properly incorporate variable size core accessions affects the performance of Core Hunter. sampling such as a method to construct covering cores of minimum size. Variable size core sampling Abbreviations Previous versions of Core Hunter also supported variable A-NE: Average accession-to-nearest-entry distance; CH2: Core Hunter 2; CH3: size core sampling. We decided to remove this function- Core Hunter 3; DMIN: Minimum distance; E-NE: Average entry-to-nearest-entry distance; GDOpt: Genetic distance optimization; HE: Expected heterozygosity; ality from Core Hunter 3, and to focus on fixed size core MixRep: Mixed replica search; REMC: Replica exchange Monte Carlo search sampling for the provided evaluation measures, because these measures are not generally applicable to compare Acknowledgements cores of different sizes. For example, reducing the core We thank Nathan Sinnesael who performed preliminary experiments that size artificially increases dissimilarity between selected supported the development of Core Hunter 3. The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were accessions, while adding more accessions always yields a provided by the VSC (Flemish Supercomputer Center), funded by Ghent more representative core. Also, while CH1 and CH2 pre- University, the Hercules Foundation and the Flemish Government - ferred thesmallest of twocores with thesameobjective department EWI. De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 12 of 12 Funding 12. Wang J, Hu J, Xu H, Zhang S. A strategy on constructing core collections Herman De Beukelaer is supported by a Ph.D. grant from the Research by least distance stepwise sampling. Theor Appl Genet. 2007;115(1):1–8. Foundation - Flanders (FWO). 13. Krishnan RR, Sumathy R, Ramesh S, Bindroo B, Naik GV. SimEli: Similarity elimination method for sampling distant entries in development of core Availability of data and materials collections. Crop Sci. 2014;54(3):1070–8. The raw rice, coconut and maize datasets are available from the cited 14. Odong T, van Heerwaarden J, Jansen J, van Hintum TJ, van Eeuwijk F. references [14, 15, 27] or on request. The raw pea dataset and computed Statistical techniques for defining reference sets of accessions and distance matrices are also available on request. microsatellite markers. Crop Sci. 2011;51(6):2401–11. 15. Kim K-W, Chung H-K, Cho G-T, Ma K-H, Chandrabalan D, Gwag J-G, Kim T-S, Authors’ contributions Cho E-G, Park Y-J. PowerCore: a program applying the advanced m HDB and GD implemented the Core Hunter 3 library in Java. HDB was strategy with a heuristic search for establishing core sets. Bioinformatics. responsible for the R package while GD developed the graphical interface. 2007;23(16):2155–62. HDB performed all experiments, under the supervision of VF. HDB wrote the 16. Jeong S, Kim J-Y, Jeong S-C, Kang S-T, Moon J-K, Kim N. Genocore: A initial manuscript with all authors contributing to the final version. All authors simple and fast algorithm for core subset selection from large genotype read and approved the final manuscript. datasets. PLoS ONE. 2017;12(7):0181420. 17. Jansen J, Van Hintum T. Genetic distance sampling: a novel sampling Ethics approval and consent to participate method for obtaining core collections using genetic distances with an Not applicable. application to cultivated lettuce. Theor Appl Genet. 2007;114(3):421–8. 18. Thachuk C, Crossa J, Franco J, Dreisigacker S, Warburton M, Davenport Competing interests GF. Core Hunter: an algorithm for sampling genetic resources based on The authors declare that they have no competing interests. multiple genetic measures. BMC Bioinformatics. 2009;10(1):1. 19. De Beukelaer H, Smykal ` P, Davenport GF, Fack V. Core Hunter II: fast core Publisher’s Note subset selection based on multiple genetic diversity measures using Springer Nature remains neutral with regard to jurisdictional claims in mixed replica search. BMC Bioinformatics. 2012;13(1):1. published maps and institutional affiliations. 20. Holland JH. Adaptation in Natural and Artificial Systems: an Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Author details Ann Arbor: U Michigan Press; 1975. Department of Applied Mathematics, Computer Science and Statistics, Ghent 21. Wright S. Evolution and genetics of populations. vol IV. Chicago: The University, Krijgslaan 281 S9, 9000 Gent, Belgium. New Zealand Institute for University of Chicago Press; 1978. p. 91. Plant & Food Research Limited, 412 No1 Rd RD2, Te Puke, New Zealand. 22. Gower JC. A general coefficient of similarity and some of its properties. J C Gower Biometrics. 1971;27(4):857–71. Received: 12 October 2016 Accepted: 16 May 2018 23. Berg EE, Hamrick J. Quantification of genetic diversity at allozyme loci. Can J For Res. 1997;27(3):415–24. 24. Marler RT, Arora JS. Function-transformation methods for multi-objective optimization. Eng Optim. 2005;37(6):551–70. References 25. Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. 1. Frankel O, et al. Genetic perspectives of germplasm conservation. Genetic Science. 1983;220(4598):671–80. manipulation: impact on man and society. Cambridge: Cambridge 26. Kaufman L, Rousseeuw PJ. Chapter 2 Partitioning Around Medoids University Press; 1984. pp. 161–170. (Program PAM) in Finding groups in data: an introduction to cluster 2. El Bakkali A, Haouane H, Moukhli A, Costes E, Van Damme P, Khadari B. analysis. New York: Wiley; 1990. pp. 68–125. Construction of core collections suitable for association mapping to 27. Wimmer V, Albrecht T, Auinger H-J, Schoen C-C. synbreedData: Data for optimize use of mediterranean olive (olea europaea l.) genetic resources. the Synbreed Package. 2015. R package version 1.5. https://CRAN.R- PLoS ONE. 2013;8(5):1–13. https://doi.org/10.1371/journal.pone.0061265. project.org/package=synbreedData. 3. Muñoz-Amatriaín M, Cuesta-Marcos A, Endelman JB, Comadran J, 28. Jing R, Vershinin A, Grzebyta J, Shaw P, Smykal ` P, Marshall D, Ambrose Bonman JM, Bockelman HE, Chao S, Russell J, Waugh R, Hayes PM, MJ, Ellis TN, Flavell AJ. The genetic diversity and evolution of field pea Muehlbauer GJ. The usda barley core collection: Genetic diversity, (pisum) studied by high throughput retrotransposon based insertion population structure, and potential for genome-wide association studies. polymorphism (rbip) marker analysis. BMC Evol Biol. 2010;10(1):1. PLoS ONE. 2014;9(4):1–13. https://doi.org/10.1371/journal.pone.0094688. 29. Smykal ` P., Kenicer G, Flavell AJ, Corander J, Kosterin O, Redden RJ, Ford 4. Odong T, Jansen J, Van Eeuwijk F, van Hintum TJ. Quality of core R, Coyne CJ, Maxted N, Ambrose MJ, et al. Phylogeny, phylogeography collections for effective utilisation of genetic resources review, discussion and genetic diversity of the pisum genus. Plant Genet Resour. 2011;9(01): and interpretation. Theor Appl Genet. 2013;126(2):289–305. 4–18. 5. Wang J-C, Hu J, Liu N-N, Xu H-M, Zhang S. Investigation of combining 30. De Beukelaer H, Davenport GF, De Meyer G, Fack V. JAMES: An plant genotypic values and molecular marker information for object-oriented java framework for discrete optimization using local constructing core subsets. J Integr Plant Biol. 2006;48(11):1371–8. search metaheuristics. Softw Pract Experience. 2017;47(6):921–38. 6. Franco J, Crossa J, Desphande S. Hierarchical multiple-factor analysis for 31. R Core Team. R: A Language and Environment for Statistical Computing. classifying genotypes based on phenotypic and genetic data. Crop Sci. Vienna, Austria: R Foundation for Statistical Computing; 2016. R 2010;50(1):105–17. Foundation for Statistical Computing. https://www.R-project.org/. 7. Borrayo E, Machida-Hirano R, Takeya M, Kawase M, Watanabe K. 32. Hollander M, Wolfe DA, Chicken E. Nonparametric Statistical Methods. Principal components analysis-k-means transposon element based foxtail Chichester: Wiley; 2013. millet core collection selection method. BMC Genet. 2016;17(1):1. 33. Holm S. A simple sequentially rejective multiple test procedure. Scand J 8. Brown A. Core collections: a practical approach to genetic resources Stat. 1979;65–70. management. Genome. 1989;31(2):818–24. 9. Franco J, Crossa J, Taba S, Shands H. A sampling strategy for conserving genetic diversity when forming core subsets. Crop Sci. 2005;45(3): 1035–44. 10. Schoen D, Brown A. Conservation of allelic richness in wild crop relatives is aided by assessment of genetic markers. Proc Natl Acad Sci. 1993;90(22):10623–7. 11. Gouesnard B, Bataillon T, Decoux G, Rozale C, Schoen D, David J. MSTRAT: An algorithm for building germ plasm core collections by maximizing allelic or phenotypic richness. J Hered. 2001;92(1):93–4. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Core Hunter 3: flexible core subset selection

Free
12 pages
Loading next page...
 
/lp/springer_journal/core-hunter-3-flexible-core-subset-selection-UXH0fDB6LV
Publisher
BioMed Central
Copyright
Copyright © 2018 by The Author(s)
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Algorithms
eISSN
1471-2105
D.O.I.
10.1186/s12859-018-2209-z
Publisher site
See Article on Publisher Site

Abstract

Background: Core collections provide genebank curators and plant breeders a way to reduce size of their collections and populations, while minimizing impact on genetic diversity and allele frequency. Many methods have been proposed to generate core collections, often using distance metrics to quantify the similarity of two accessions, based on genetic marker data or phenotypic traits. Core Hunter is a multi-purpose core subset selection tool that uses local search algorithms to generate subsets relying on one or more metrics, including several distance metrics and allelic richness. Results: In version 3 of Core Hunter (CH3) we have incorporated two new, improved methods for summarizing distances to quantify diversity or representativeness of the core collection. A comparison of CH3 and Core Hunter 2 (CH2) showed that these new metrics can be effectively optimized with less complex algorithms, as compared to those used in CH2. CH3 is more effective at maximizing the improved diversity metric than CH2, still ensures a high average and minimum distance, and is faster for large datasets. Using CH3, a simple stochastic hill-climber is able to find highly diverse core collections, and the more advanced parallel tempering algorithm further increases the quality of the core and further reduces variability across independent samples. We also evaluate the ability of CH3 to simultaneously maximize diversity, and either representativeness or allelic richness, and compare the results with those of the GDOpt and SimEli methods. CH3 can sample equally representative cores as GDOpt, which was specifically designed for this purpose, and is able to construct cores that are simultaneously more diverse, and either are more representative or have higher allelic richness, than those obtained by SimEli. Conclusions: In version 3, Core Hunter has been updated to include two new core subset selection metrics that construct cores for representativeness or diversity, with improved performance. It combines and outperforms the strengths of other methods, as it (simultaneously) optimizes a variety of metrics. In addition, CH3 is an improvement over CH2, with the option to use genetic marker data or phenotypic traits, or both, and improved speed. Core Hunter 3 is freely available on http://www.corehunter.org. Keywords: Core collections, Multi-objective, Local search heuristics Background working) and base collections. Examples of active collec- Genebanks were established by national or international tions include seed stores or live plants that can be accessed breeding, or conservation programs with the goal to safe- quickly by plant breeders and researchers through germi- guard genetic diversity for future use. Many breeding pro- nation or clonal propagation. In contrast, accessions in grams have established genebanks as a resource for new base collections are held in long-term storage, such as variation in the crops they breed, allowing them to react cryopreservation, and require some time for regeneration to changing environments and emerging biotic and abiotic and propagation before being made available. stresses. Accessions are often divided between active (or During the last few decades the collections stored in genebanks have grown enormously, and cost of main- taining viable germplasm within genebanks has increased. *Correspondence: herman.debeukelaer@ugent.be Department of Applied Mathematics, Computer Science and Statistics, Ghent Genebank curators must make decisions about which University, Krijgslaan 281 S9, 9000 Gent, Belgium accessions to maintain in the active collection versus the Full list of author information is available at the end of the article base collection, and may even consider not maintaining © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 2 of 12 an accession at all. The concept of a core collection was [16]. The genetic distance sampling strategy constructs introduced to help with these decisions, and is defined as cores with a given minimum distance between selected subset of the complete collection which most represents accessions by repeatedly including a random acces- the diversity of the entire collection with minimum redun- sion and removing all others within a certain sampling dancy [1]. Genebank curators can use core collections to radius [17]. define the active collection over the base collection. Core Core Hunter was designed to meet the variety of crite- collections can also be used to aid researchers and plant ria used to evaluate core collections for different purposes, breeders in the choice of starting material. For example, and supports optimization of several of these metrics, the potential for use of core collections has been shown using flexible local search algorithms [18]. Core Hunter for association studies [2, 3]. can construct core collections for specific applications, A variety of measures have been used to evaluate core and combines multiple objectives to bring the different collections based on genetic marker data or phenotypic perspectives closer together, for example by simultane- traits, including pairwise distances and allelic richness. ously maximizing genetic dissimilarity and allelic rich- The choice of the most appropriate evaluation measure ness. Although Core Hunter is mainly focused at fixed size depends on the purpose of the core collection [4]. Some- core subset selection, version 1 and 2 allowed to spec- times core collections are sampled based on a combi- ify a minimum and maximum size and preferred smaller nation of both genotypes and phenotypes [5–7]. Many cores with the same value. Core Hunter was shown to methods have been proposed to sample high quality core outperform stratified sampling strategies, MSTRAT and collections according to the measure(s) of interest. The PowerCore. first methods were stratified sampling techniques that It has been assumed that, to obtain a diverse core, the cluster the accessions, based on distance matrices calcu- average distance between its entries should be maximized lated from their allele scores or phenotypic trait values, [9, 18]. However, a high entry-to-entry distance does not and then select several accessions from each cluster using guarantee that selected accessions are sufficiently differ- a certain allocation method. Brown suggested to randomly ent, and it is known that maximizing this criterion over- select either a constant (C) number of accessions per clus- represents extreme values [4, 19]. Core Hunter 2 (CH2) ter, or a number proportional (P) to the size or logarithm deals with this issue by also maximizing the minimum (L) of the size of the cluster, and argued that the L-method distance between selected accessions [19]. Although aver- is preferred [8]. It was later shown that more diverse cores age distance and allelic richness can be effectively opti- are obtained when the number of included accessions is mized using simple and fast local search algorithms, such proportional to the within-cluster diversity [9]. as a stochastic hill-climber, a more complex and slower Another allocation method, the M-method, maximizes mixed replica search (MixRep) was required to maximize the probability to retain all observed alleles in order to minimum distance in the Core Hunter framework. The construct cores with high allelic richness [10]. This idea MixRep algorithm runs multiple types of stochastic local led to the development of the MSTRAT software, which searches in parallel, as well as a constructive algorithm implements a generalized M-method that directly sam- (LR) that starts from an empty selection that is itera- ples from the entire collection to maximize allelic richness tively extended. In case an active search is unable to find with a simple hill-climbing algorithm [11]. Other heuris- any further improvements, it is terminated and replaced tics work by repeatedly removing one of the two most with a new local search engine starting from a selection similar accessions from the collection until the desired that is obtained by combining two previously found high- core size is obtained, either randomly (least distance quality selections, in an attempt to further explore other stepwise sampling [12]), or using a specific elimination interesting regions of the search space, as in a genetic criterion maximizing the distance to the remaining acces- algorithm [20]. sions or expected heterozygosity of the reduced collection Another approach to maximize diversity, while at the (SimEli) [13]. The genetic distance optimization strategy same time avoiding inclusion of too similar accessions at (GDOpt) was designed to construct highly representative the extremes of the collection, is to maximize the average cores, in which each accession from the entire collection distance between each entry and the closest other entry is represented by a similar core entry [14]. GDOpt par- in the core, as proposed by Odong et al. [4]. The SimEli titions the data around a number of identified medoids, algorithm was shown to outperform Core Hunter 2 in which are then selected as the core entries. Methods terms of this new entry-to-nearest-entry (E-NE) metric. for variable size core sampling have also been devel- Alternatively, one may desire to optimally represent the oped. PowerCore minimizes the size of the core, while individual accessions, instead of the whole range of diver- covering all observed marker alleles and/or trait values sity. In such case, Odong et al. reccomend to minimize the [15]. GenoCore was developed for the same purpose, average distance between each accession in the full collec- and specifically tailored to high-density marker datasets tion and the most similar accession contained in the core. De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 3 of 12 The GDOpt strategy was specifically developed to mini- Evaluation measures mize this accession-to-nearest-entry (A-NE) metric, and Core Hunter 3 includes various evaluation measures that shown to outperform both Core Hunter 2 and SimEli for can be selected as optimization objectives, including but this purpose [13, 14]. not limited to those described below. We refer to the We introduce Core Hunter 3 (CH3), which incorporates website http://www.corehunter.org for an overview of all the two improved methods for summarizing distances, provided measures. entry-to-nearest-entry (E-NE) and accession-to-nearest- Distance measures entry (A-NE), proposed by Odong et al. [4]. CH3 attempts We used the Modified Roger’s distance [18, 21]toassess to find the maximum entry-to-nearest-entry distance to the dissimilarity of accessions based on genetic marker obtain diverse cores, whereas accession-to-nearest-entry data. For phenotypic traits we used Gower’s distance [22] distance is minimized to represent as much as possible which simultaneously takes into account qualitative and all accessions from the entire collection. More specifi- quantitative traits. Pairwise distances are aggregated as cally, CH3 can sample fixed size cores based on molecular follows to evaluate the diversity or representativeness of marker data, phenotypic traits, a precomputed distance the core [4]: matrix, or a combination of these. The distance matrix can be generated using an appropriate measure, such as Entry-to-nearest-entry (E-NE): the average distance Modified Roger’s distance for genotypes [21] or Gower’s between each selected accession and the closest other distance for phenotypes [22]. As in previous versions, core entry. This criterion can be maximized to Core Hunter 3 can also maximize allelic richness, as well construct highly diverse cores in which all accessions as a combination of multiple metrics. In particular, we are maximally different. assess whether the new distance-based E-NE and A-NE Accession-to-nearest-entry (A-NE): the mean metrics can be effectively optimized using fast local search distance between each accession from the entire algorithms, and whether maximizing E-NE indirectly also collection and the most similar core entry, including yields a high minimum distance, without the need for itself in case the accession has been selected. a more complex algorithm. Furthermore, we assess the Minimizing this criterion yields cores that maximally ability of Core Hunter 3 to simultaneously maximize E- represent all individual accessions. NE and A-NE, or E-NE and allelic richness, and com- pare the results with those obtained with Core Hunter When comparing CH3 with CH2 we also evaluated the 2, GDOpt, and SimEli, for three marker datasets with minimum distance (DMIN) between selected accessions, different allelic composition and varying size, and one but this is not an objective that can be directly opti- phenotypic trait dataset. Core Hunter 3 is available as an mized by CH3, for reasons explained in the discussion. Rpackage corehunter on CRAN and as an open source A detailed description and comparison of the E-NE and project on GitHub. A prototype graphical user interface A-NE metrics are provided in [4]. is also available. See http://www.corehunter.org for more Allelic richness information. To evaluate the allelic richness of cores sampled based on genetic marker data, we used the average expected Methods heterozygosity (HE) per locus [18, 23], calculated as Core selection problem Given a collection A that contains n accessions, and a desired core size 1 < k < n, the feasible solution space of 0 ≤ HE = 1 − p ˆ ≤ 1 la possible core subsets is defined as l=1 a=1 = {C | C ⊂ A ∧ |C| = k} where L is thenumberofmarkers (loci), n is the number of observed alleles at the lth locus, and p ˆ is the fre- la where |C| denotes the size of the subset. The core selection quency of the ath allele at the lth locus in the selected core problem then consists of finding an optimal subset C ∈ collection. that maximizes a certain evaluation measure F(C) :  → R,i.e. Weighted index and normalization As in previous versions, Core Hunter can simultaneously optimize k measures by maximizing a weighted index C = argmax F(C). C∈ In case the evaluation measure F(C) is intended to be F(c) = α F (c) i i minimized, this can be achieved by maximizing −F(C). i=1 De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 4 of 12 where F is the ith included evaluation measure and 0 < Algorithm 1 Random descent. α < 1 is the weight assigned to this objective, with i Input: collection A,coresize(k), evaluation measure α = 1. In case of a measure F that is to be F(C), neighbourhood N (C) :  → P () with  = i i i=1 minimized, such as A-NE, it is transformed into a max- {C | C ⊂ A ∧ |C| = k} imization objective F =−F when it is included in the Output: best found core C ∈ weighted index. The individual measures are automati- 1: C ← random element of cally normalized to [ 0, 1], following the Pareto minimum 2: repeat based upper-lower-bound approach as described in [24], 3: pick random neighbour C ∈ N (C) to ensure a fair balance between the included objectives, 4: if F(C )> F(C) then independent of their original range. More information 5: C ← C about this normalization is provided in the documenta- 6: end if tion of the R package. 7: until stop condition satisfied 8: return C Core sampling algorithms We evaluate the performance of three general purpose selection heuristics to optimize the chosen evaluation higher probability to accept inferior modifications, simi- measureorweightedindex forafixedcoresize: ran- lar to the frequently used simulated annealing algorithm dom descent, parallel tempering, and a genetic algorithm. [25]. The acceptance function is commonly defined as Based on the findings in this study, only the former two were included in Core Hunter 3, which defaults to the par- 1if > 0 p(, t) = allel tempering algorithm, but also provides a fast mode /t e else in which the random descent algorithm is applied. Note that these two stochastic local search algorithms were where  = F(C ) − F(C ) and t is the temperature of also available in CH2, although they were not used by the replica. This acceptance function ensures that neigh- default. The search algorithms are executed until either an bours with a better score are always accepted, whereas absolute runtime limit has been exceeded, or no further inferior neighbours are accepted at a probability that improvements were obtained during a certain amount exponentially decreases as the solution gets poorer or as of time. the temperature is decreased. In addition, searches with similar temperature periodically exchange their current Random descent selection, which has the effect to push the most promis- This basic local search outlined in Algorithm 1 starts with ing solutions towards the coolest searches to promote a random selection of the desired size and then iteratively convergence towards a common solution, and the worst tries to improve its quality by slightly modifying the core. solutions towards the hottest searches allowing them to The obtained similar selection, referred to as a neighbour escape from local optima. The probability that replica r of the current selection, is accepted if and only if it has and r + 1 will swap their current selection is commonly a higher objective function value according to the cho- defined as sen evaluation measure. Otherwise, another move is tried from the current selection. Core Hunter uses a single- 1if > 0 q ( , t , t ) = 1 1 r r r+1 swap neighbourhood, i.e. considers all neighbours that can − t t r+1 e else be obtained from the current selection by replacing one selected accession with a currently unselected accession. with  = F (C ) − F(C ). As such, if the current selec- r r+1 r tion of replica r + 1 has a better objective function value Parallel tempering than that of the rth replica, these are always swapped. In Algorithm 2 describes the more advanced parallel tem- addition, similar to the probabilistic acceptance of infe- pering method [18], also referred to as replica exchange rior neighbours, swaps that push solutions in the opposite Monte Carlo (REMC), which consists of multiple coop- direction may also be performed—yet with a probability erating local searches that are executed in parallel. Each that decreases for a larger difference in objective func- search performs the same procedure as random descent, tion value and replica temperature. The parallel tempering but may also accept inferior modifications to be able to algorithm implemented in Core Hunter 3 consists of p = −8 −4 escape from local optima, i.e. to further improve the cur- 10 searches with a temperature range of 10 ,10 ,and rent selection even if none of the considered neighbours uses the same single-swap neighbourhood as the random has a better score. For this purpose, the search repli- descent method described above. The number of replica cates are assigned fixed, increasing temperatures, equally steps per iteration is fixed to q = 500, and the default spread in a given range. A higher temperature leads to a acceptance and swap functions are applied. De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 5 of 12 Algorithm 2 Parallel tempering. Selection (SELECT).We randomly picked five candidates from the current population, from which the one Input: collection A,coresize(k), evaluation measure with the highest objective function value was chosen F(C), neighbourhood N (C) :  → P () with as a parent (tournament selection). = {C | C ⊂ A ∧ |C| = k}, number of replicas (p), Crossover (CROSS). A child was created from two par- temperature range [ t , t ], acceptance function min max ents by repeatedly adding an arbitrary accession that p(, t),swapfunction q(, t , t ), number of replica 1 2 is selected in either parent solution (at random with steps per iteration (q) equal probability) until the desired core size was Output: best found core C ∈ obtained. 1: for i from 1 to p do Mutation (MUTATE). As mutation operator we applied i−1 2: t ← t + (t − t ) i min max min p−1 the random descent heuristic described above, start- 3: C ← random element of ing from the given solution, until no improvement 4: end for was found in the last 5000 steps. 5: C ← argmax F(C ) best i 1≤i≤p Survival (SURVIVE). We applied a roulette selection to 6: s ← 0 discard five solutions in each step, so that the pop- 7: repeat ulation size remained fixed over all generations. A 8: for i from 1 to p (in parallel) do solution C was assigned a weight of 1/F(C) meaning 9: repeat q times that the probability that it is discarded is inversely 10: pick random neighbour C ∈ N (C ) proportional to its fitness. 11: compute  ← F(C ) − F(C ) i i 12: with probability p( , t ):set C ← C i i i 13: if F(C )> F(C ) then best Algorithm 3 Genetic algorithm. 14: C ← C best i 15: end if Input: collection A,coresize(k), evaluation measure 16: end repeat F(C),populationsize(p), number of children per gen- 17: end for eration (c), selection operator SELECT :  → 18: r ← s + 1 with  = {C | C ⊂ A ∧ |C| = k}, crossover operator 19: while r < p do CROSS :  → ,mutationoperatorMUTATE :  → p+c p 20: compute  ← F(C ) − F(C ) , survival operator SURVIVE :  → r r+1 r 21: with probability q( , t , t ):swap C and Output: best found core C ∈ r r r+1 r r+1 1: pop ←∅ 22: r ← r + 2 2: for i from 1 to p do 23: end while 3: add random element of  to pop 24: s ← 1 − s 4: end for 25: until stop condition satisfied 5: C ← argmax F(C) best C∈pop 26: return C best 6: repeat 7: children ←∅ 8: for i from 1 to c (in parallel) do 9: P ← SELECT(pop) 10: P ← SELECT(pop) Genetic algorithm 11: C ← MUTATE(CROSS(P , P )) To assess the potential improvement of a global opti- 1 2 12: add C to children mization engine over a local search we also applied the 13: end for genetic algorithm [20] outlined in Algorithm 3. Here, 14: for C ∈ children do a population of initially randomly generated solutions 15: add C to pop (cores) is maintained. In every step, new child solutions 16: if F(C)> F(C ) then are produced by combining two randomly chosen par- best 17: C ← C ent solutions (crossover), followed by one or more swaps best 18: end if of accessions (mutation) between the unselected and the 19: end for selected subset. These children are added to the pop- 20: pop ← SURVIVE(pop) ulation, and certain solutions are discarded to simulate 21: until stop condition satisfied survival of the fittest individuals in natural evolution. For 22: return C our experiments we used a population size of p = 25 and best generated c = 5 children in each step (in parallel). We applied the following operators: De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 6 of 12 Comparison with GDOpt and SimEli R through the package corehunter (https://cran.r-project. For the GDOpt selection strategy [14], we used the org/package=corehunter). GDOpt, SimEli, and all compu- k-medoids algorithm of Kaufman and Rousseuw [26] tational experiments were implemented in R v3.3.1 [31]. through the R function pam, to identify a representative Note that the R function pam used in GDOpt calls a C core collection. The number of clusters was chosen equal function which performs the actual partitioning. Exper- to the desired core size and the returned medoids were iments were executed on a computing server with two selected as core accessions. We also implemented SimEli 10-core Intel E5-2660v3 (2.6 GHz) CPUs and 128 GB in R, considering both elimination criteria suggested in RAM. [13]. In each step, one of the two most similar acces- sions was eliminated, maximizing either the average dis- Results tance to the remaining accessions (SimEli-A-RA) or the Optimizing E-NE and A-NE with local searches expected heterozygosity of the reduced collection (SimEli- We sampled 10 cores from each dataset using random HE), until the desired core size was obtained. The source descent, parallel tempering, and the described genetic code for these implementations is available on GitHub algorithm, configured to maximize E-NE with a runtime (corehunter/corehunter3-paper). limit of 30 min. Table 1 shows mean values and standard deviations of the obtained cores. The results indicate that Datasets parallel tempering yields the highest E-NE values, with We used four datasets of varying size and composition the lowest variability across independent samples. Vari- to compare the performance of different core sampling ability in solution quality is always at least one order of algorithms: magnitude below that observed for random descent and the genetic algorithm. Still, variability is already quite low 1 Rice data: 1000 accessions for which 39 phenotypic when using the basic random descent heuristic. Although traits were recorded, including 28 qualitative and 11 the genetic algorithm also outperforms random descent, quantitative traits. Available from the PowerCore it is not as effective as parallel tempering. We performed project [15] and previously used to assess the a pairwise comparison of the results obtained with the performance of several other core sampling three applied methods, for the four considered datasets, algorithms, including SimEli [13]. using a Wilcoxon rank-sum test [32]. The twelve result- 2 Coconut data: 1014 accessions characterized using ing p-values were adjusted for multiple testing to control 30 crop-specific SSR markers. Used in multiple the family-wise error rate (FWER) using Holm’s method previous core selection studies [4, 13, 14]. [33]. All differences were statistically significant at the α = 3 Maize data: 1250 accessions characterized with 1117 0.05 confidence level, with adjusted p-values ranging from SNP markers. Distributed as part of the R package 0.00013 to 0.00049. Figure 1 displays convergence curves synbreedData [27]. of the three applied algorithms, again averaged over 10 4 Pea data: 4428 accessions characterized by 17 RBIP runs, for the large pea dataset. These plots confirm that markers [28, 29]. Previously used to compare the all algorithms are able to iteratively improve an arbitrarily performance of Core Hunter 2 with other core bad random selection to reach a high E-NE value. Again sampling algorithms for large datasets [19]. we see that parallel tempering yields the highest-quality cores (left). Moreover, this algorithm is almost as fast as All cores sampled in the performed experiments com- the basic random descent heuristic (right). Both meth- prised 20% of the entire collection for the rice, coconut ods very quickly improve the initial random selection, and and maize datasets, and 10% for the large pea dataset. after less than 10 s, parallel tempering found a better solu- Implementation and hardware tion than random descent, after which it keeps improving Core Hunter 3 has been reimplemented in Java 8, using the the quality of the core. In contrast, the genetic algorithm JAMES framework (v1.2) for discrete optimization with takes a slower start, catches up with random descent after local search metaheuristics [30]and wasexecutedfrom 20 s, and then also further improves the selection—but not Table 1 Comparison of random descent, parallel tempering, and a genetic algorithm, when maximizing the entry-to-nearest-entry criterion (E-NE). Mean values and standard deviations are reported for 10 independently sampled core collections Rice Coconut Maize Pea Random descent 0.1500 ± 1.83e-04 0.5748 ± 5.22e-04 0.4332 ± 2.73e-04 0.3337 ± 1.70e-03 Parallel tempering 0.1508 ± 1.40e-15 0.5759 ± 2.12e-06 0.4359 ± 8.56e-05 0.3412 ± 1.46e-04 Genetic algorithm 0.1506 ± 1.12e-04 0.5755 ± 1.04e-04 0.4346 ± 3.45e-04 0.3386 ± 8.00e-04 De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 7 of 12 Fig. 1 Convergence curves for pea dataset. These curves show the E-NE value of the best found solution at each point in time during execution of random descent, parallel tempering, and the genetic algorithm, averaged over 10 independent runs, for the large pea dataset. The left plot reports the progress during the entire run with a runtime of 30 min while the right plot is zoomed in on the first 40 s as effectively as parallel tempering. We performed these and execution time for 10 independent samples, obtained experiments only for the E-NE measure but assume that with both methods, and for each dataset except the rice our findings also hold for A-NE due to the very similar collection, because CH2 cannot sample cores based on composition of both criteria. All following CH3 results phenotypic traits. For all three datasets, CH3 yields higher were obtained with the parallel tempering algorithm. E-NE and DMIN than CH2. However, a detailed inspec- tion of the output generated by CH2 (not shown) revealed Comparison with Core Hunter 2 that the LR replica—one of the search replicas in the To assess whether maximizing E-NE indirectly also yields MixRep algorithm used by CH2—did not always complete a high minimum distance (DMIN) between selected before CH2 was terminated. This LR search is a con- accessions, we compared the results of CH3 and CH2. structive heuristic that starts with an empty selection and We configured CH2 to maximize a weighted index includ- iteratively adds the two best accessions, i.e. those yielding ing both average and minimum pairwise distance, with the best possible score when added to the current selec- tion. After each two additions, one accession is removed equal weight, and CH3 to maximize E-NE. Both algo- rithms were terminated when no improvement was found from the selection, again chosen to optimize the score of during the last 10 s. Table 2 reports average E-NE, DMIN, the remaining selection. This procedure is repeated until the desired core size has been reached. The LR replica was specifically included in CH2 to construct cores with Table 2 Comparison of Core Hunter 2 and 3 high minimum distance [19]. Therefore, we repeated the E-NE DMIN Time (s) CH2 experiments with an absolute runtime limit that was Coconut empirically determined per dataset to ensure that the LR replica terminated in each run (CH2L). Especially for the CH2 0.552 ± 3.53e-2 0.501 ± 9.76e-2 27.6 ± 06.0 large pea dataset, significantly more time was needed in CH3 0.576 ± 9.35e-5 0.540 ± 0.00e-0 37.5 ± 07.9 this configuration. Table 2 shows that CH2L is indeed CH2L 0.569 ± 5.91e-4 0.548 ± 0.00e-0 31.0 ± 00.1 able to construct cores with a much higher minimum dis- Maize tance than CH2, and also outperforms CH3 in terms of CH2 0.416 ± 1.52e-2 0.396 ± 2.46e-2 78.3 ± 10.6 this measure. Although differences in minimum distance CH3 0.435 ± 2.70e-4 0.409 ± 3.05e-3 74.3 ± 26.5 obtained with CH2L and CH3 are not larger than 4%, they are statistically significant for the coconut and maize CH2L 0.429 ± 5.00e-4 0.415 ± 1.11e-3 78.6 ± 02.0 datasets (p = 0.000097), but not for the pea dataset (p = Pea 0.3064). Moreover, CH3 still yields significantly higher- CH2 0.219 ± 1.49e-3 0.000 ± 0.00e-0 85.6 ± 04.5 quality core collections in terms of the E-NE criterion CH3 0.338 ± 1.04e-3 0.287 ± 1.34e-2 154.1 ± 49.7 (p = 0.000097), and is faster for large datasets. CH2L 0.325 ± 8.21e-4 0.297 ± 0.00e-0 802.3 ± 00.8 CH2 maximizes a weighted index including average and minimum pairwise Comparison with GDOpt and SimEli distance, with equal weight, while CH3 maximizes E-NE. Mean E-NE, DMIN, runtime We approximated the Pareto front obtained by Core and corresponding standard deviations are reported for 10 independent Hunter 3 when simultaneously optimizing E-NE, and executions. The highest obtained E-NE and DMIN value per dataset is shown in bold. CH3 was terminated when no improvements were found during 10 s. For CH2, either A-NE or HE, with varying weights α ∈[0, 1] and two alternatives were considered: (a) the same stop condition as for CH3 (CH2); and α = 1 − α , respectively, and compared the results with 2 1 (b) an absolute runtime limit that was empirically determined per dataset to ensure those obtained by GDOpt and SimEli. Note that A-NE is that the LR replica of MixRep terminated in each run (CH2L) De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 8 of 12 minimized, while E-NE and HE are maximized. As before, Core Hunter 3 is able to simultaneously improve over CH3 was terminated when no improvement was found SimEli in terms of both objectives (E-NE and HE value). during 10 s. Figure 2 shows that GDOpt and CH3 are able Average execution times of GDOpt, SimEli and CH3 to construct representative cores with low A-NE, which is (configured to optimize E-NE, A-NE or HE) are reported not the case for SimEli. In fact, all cores sampled by SimEli in Table 3. Core Hunter 3 was slower than GDOpt and have a worse A-NE value than those obtained by GDOpt SimEli for the rice and coconut datasets. For the maize and CH3, even when the latter is configured to maximize dataset CH3 was faster than GDOpt and SimEli-HE when E-NE only. On the other hand, SimEli scores much bet- maximizing HE or E-NE but slower when minimizing A- ter than GDOpt in terms of diversity (high E-NE). Still, NE and always slower than SimEli-A-RA. Finally, for the Core Hunter 3 is able to find cores which simultaneously pea dataset, CH3 was faster than both GDOpt and SimEli. have a higher diversity and are more representative than Core Hunter 3 was also consistently faster when maximiz- those obtained with SimEli. For the maize dataset, SimEli- ing HE as compared to the configurations where E-NE or A-RA and SimEli-HE found cores of similar quality, while A-NE were optimized. for the coconut and pea dataset SimEli-A-RA showed to be preferred in terms of both E-NE and A-NE. For the rice Discussion dataset, SimEli-HE was not included because expected Depending on the purpose of a core collection, a vari- heterozygosity can only be evaluated for genotypic data. ety of metrics is used to evaluate its quality. Distance- Figure 3 shows that GDOpt yields cores with significantly based measures are attractive because they are intuitive lower HE than any of the other methods. SimEli performs to understand and can capture both diversity within the better in this respect, especially SimEli-HE, but as before core as well as representativeness of the accessions from Fig. 2 Simultaneous optimization of entry-to-nearest-entry (E-NE) and accession-to-nearest-entry (A-NE) distance. These Pareto front approximations for Core Hunter 3 were obtained by sampling cores with varying weights α ∈[0, 1] and α = 1 − α assigned to the E-NE and A-NE 1 2 1 measures, respectively, with a step size of 0.05. The quality of the cores constructed by CH3 is compared with those obtained by GDOpt and SimEli, in terms of both objective functions. All reported values are averages of 10 independently sampled cores with the same settings De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 9 of 12 Fig. 3 Simultaneous maximization of entry-to-nearest-entry distance (E-NE) and expected heterozygosity (HE). These Pareto front approximations for Core Hunter 3 were obtained by sampling cores with varying weights α ∈[0, 1] and α = 1 − α assigned to the E-NE and HE measures, 1 2 1 respectively, with a step size of 0.05. The quality of the cores constructed by CH3 is compared with those obtained by GDOpt and SimEli, in terms of both objective functions. All reported values are averages of 10 independently sampled cores with the same settings. The rice dataset is excluded here because expected heterozygosity can only be evaluated for genotypic data the full collection, computed from either genetic mark- minimum distance, meaning that the search has no clue as ers or phenotypes. However, pairwise distances need to be to whether these modifications may eventually lead to an aggregated in an appropriate way to evaluate the selected improved solution. To smooth out the objective function, core. Although many studies and methods have used aver- CH2 maximized a combination of average and minimum age pairwise distance to assess the diversity in the core, distance. Also, the applied MixRep algorithm includes a it is known that a high average does not guarantee that constructive LR heuristic (see “Results”), which is much all accessions in the core are sufficiently different from better suited to maintain a high minimum distance as it each other [4, 19]. Maximizing this criterion tends to iteratively adds accessions to an initially empty selection. overrepresent the extremes of the distribution in the full Unfortunately, the LR algorithm becomes slow for large collection. datasets, because it builds the core bottom-up, instead of Core Hunter 2 addressed this issue by maximizing min- iteratively refining a randomly chosen initial selection. imum distance in addition to average distance, using Two new distance-based metrics, entry-to-nearest- a complex mixed replica search (MixRep) consisting of entry (E-NE) and accession-to-nearest-entry (A-NE), different cooperating strategies [19]. The original Core introduced by [4], were shown to generate improved cores Hunter software used a local search algorithm to optimize for specific goals. The E-NE criterion takes all acces- the chosen evaluation measure, but such local searches sions into account and can therefore presumably be more are not well suited to optimize minimum distance because effectively optimized with local searches as compared to this measure is very sensitive to the precise selection. Sim- minimum distance, but still focuses on maintaining a high ilar cores may have very different values, while at the same distance between each pair of closest accessions which, time very different cores may have a similar or even the in contrast to average pairwise distance, avoids overrep- same minimum distance. This makes it difficult for a local resentation of extreme values. Therefore, in Core Hunter search to find its way from a randomly generated selection 3, the minimum distance measure was replaced with the to a high-quality core. In particular, for a given current newly proposed E-NE criterion. The A-NE metric was solution, many possible modifications may not affect the also included to sample cores that maximally represent all individual accessions from the full collection. We assessed whether the new E-NE metric can indeed Table 3 Average execution times (seconds) of GDOpt, both be effectively optimized with local search algorithms, in SimEli implementations and CH3 for 10 independent samples an attempt to avoid the complexity of the MixRep algo- from each dataset. Three configurations are considered for CH3: rithm used by CH2, and in particular the slowness of the (a) maximize E-NE; (b) minimize A-NE; and (c) maximize HE LR replica. We showed that even a very basic stochastic Rice Coconut Maize Pea hill-climber (random descent) can already construct cores GDOpt 14.9 7.1 91.2 350.1 with high E-NE value and quite little variability in qual- ity across independent samples. Still, the value of the core SimEli-A-RA 7.6 7.5 11.5 514.7 is further improved, and variability further reduced, when SimEli-HE - 15.9 78.0 502.3 using the more advanced parallel tempering algorithm. CH3 E-NE 45.8 37.5 74.3 154.1 Since parallel tempering takes advantage of modern multi- CH3 A-NE 74.6 55.7 133.1 86.7 core CPUs, the associated computational overhead is very CH3 HE - 16.6 40.2 62.8 limited. In our experiments, even for the large pea dataset De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 10 of 12 with over 4000 accessions, parallel tempering was only it may be confusing that there is a possibly large time gap marginally slower than random descent. We also assessed between the last improvement found by the other replicas whether a genetic algorithm could further improve these and that obtained when the LR replica has finished. In this results. Such global optimization strategy iteratively com- respect, CH3 is more user-friendly because it uses a well- bines currently known high-quality solutions (crossover) known local search algorithm that gradually improves the in an attempt to explore other interesting regions of the E-NE value of the core. Large gaps between significant solution space. The obtained solutions are then exploited improvements are not expected, which makes it easier to by applying local modifications (mutation). We used the determine an appropriate time limit and even more so to random descent heuristic as a mutation operator, since it use a convenient adaptive stop condition such as a maxi- showed to be able to effectively improve the E-NE value mum time without finding an improvement, in which case of a given selection. Although the genetic algorithm out- the execution time is automatically adjusted—to some performed random descent, it showed to be slower and extent—to the size of the collection. produced cores with slightly lower E-NE values as com- One of the main advantages of Core Hunter 3 and pre- pared to parallel tempering. These results indicate that the vious versions is its flexibility. While other methods are intelligent exploitation of parallel tempering is more effec- often developed for a specific purpose such as maximiz- tive to optimize E-NE than the more global exploration ing diversity, representativeness, or allelic richness, Core of the evaluated genetic algorithm. We thus conclude Hunter is suited for each of these as it includes a variety of that parallel tempering is preferred, and that more com- evaluation measures that can directly be optimized, and if plex algorithms are not needed to optimize E-NE, since desired combined in a weighted index. We compared CH3 a basic stochastic hill-climber (random descent) already with GDOpt, designed to maximize representativeness, yields high-quality cores and a global optimization engine and SimEli, where the elimination criterion was chosen (genetic algorithm) did not provide any further advantage. either to maximize diversity (SimEli-A-RA) or expected Moreover, parallel tempering does not yield a significant heterozygosity (SimEli-HE). Core Hunter was configured computationaloverhead—itisalmostasfastasrandom to optimize a weighted index including E-NE and either A- descent. We assume that the same conclusion holds for NE (Fig. 2)orHE(Fig. 3), with varying weights, in order to A-NE duetothevery similarcomposition of both met- approximate the corresponding Pareto front. The results rics. Therefore, Core Hunter 3 uses parallel tempering showed that, as expected, GDOpt is especially suited to by default, which is also known to effectively optimize construct cores that optimally represent all accessions the other measures that were already included in CH2, from the entire collection (low A-NE), as it was specifically such as allelic richness [19]. Afastmodeisalsoprovided developed for this purpose. On the other hand, in terms in which the basic random descent algorithm is applied, of diversity (E-NE) and allelic richness (HE), SimEli scores in case execution time is critical, but it was not used in much better than GDOpt. From the two considered elim- this study. ination criteria, SimEli-HE resulted in the highest allelic To validate the effectiveness of the new E-NE mea- richness, while SimEli-A-RA showed to be most suited sure, we assessed whether maximizing E-NE indirectly to maximize diversity (E-NE). Again, this was expected also yields a high minimum distance. A comparison with and confirms that the SimEli method can be adjusted to Core Hunter 2, configured to sample cores with high aver- some extent, by using an appropriate elimination criterion age and minimum distance, revealed that this is indeed depending on the purpose of the core collection. How- the case. The minimum distance obtained with CH3 is ever, Core Hunter 3 found cores that simultaneously have slightly lower as compared to CH2, but more importantly higher E-NE (more diverse), and lower A-NE (more rep- CH3 yields higher E-NE values because it actively opti- resentative) or higher HE values (higher allelic richness), mizes this criterion. As the minimum distance captures than those obtained by SimEli. In addition, CH3 was able less information about the core than E-NE, we believe that to construct equally representative cores as GDOpt, and the latter criterion better reflects within-core diversity. As thus combines and improves over the advantages of both expected, CH3 was faster than CH2 for large datasets, other methods. due to the quadratic time complexity of the LR replica. A comparison of execution times showed that CH3 Because of its constructive nature, LR only produces use- needs less time to optimize HE as compared to E-NE ful results if given enough time to complete. Therefore, and A-NE. This is not surprising, as it is known that a potential additional issue of CH2 is that the user is allelic richness can also be effectively maximized with a responsible to set an appropriate time limit that allows basic stochastic hill-climber [19]. As we showed that the the LR replica to complete, when aiming at a high min- more advanced parallel tempering algorithm is preferred imum distance. It is not possible to affect the execution to optimize E-NE and A-NE, it is clearly more difficult to time of the LR replica and therefore this method does find cores with high E-NE and low A-NE than to maxi- not provide a quality-runtime tradeoff to the user. Also, mize allelic richness. In our experiments CH3 was slower De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 11 of 12 than GDOpt and SimEli for smaller datasets but faster for function value, minimizing the core size may not always the large pea dataset. Note that although these methods be desired, depending on the purpose of the core. We were implemented in different programming languages, are therefore convinced that fixed and variable size core which affects their absolute execution times, the latter sampling should be treated as separate problems, using does not affect the observed trend in their execution times specific evaluation measures and optimization strategies. when sampling from increasingly large collections. Here, the main advantage of Core Hunter is again its flexibil- Conclusions ity. For example, the runtime of SimEli is determined We introduced Core Hunter 3 (CH3) and showed that by the size of the dataset and the sampled core. When it constructs core collections with high diversity (high sampling a small core from a large collection, many acces- entry-to-nearest-entry distance; E-NE) and which maxi- sions need to be eliminated, and finding the two most mally represent the individual accessions from the entire similar accessions in each step as well as deciding which collection (low accession-to-nearest-entry distance; A- one to eliminate requires many computations. In contrast, NE) using flexible and fast local search algorithms. By the runtime of Core Hunter can be adjusted by using an default, the parallel tempering algorithm is used. Version appropriate stop condition. It is possible to limit the total 3 improves over Core Hunter 2 (CH2) in multiple ways. runtime, but we used an adaptive condition that termi- CH3 is able to find cores with higher E-NE, within less nated the search when no more improvement was found time for large datasets, which also have a high minimum during 10 s. distance, without the need for a more complex algorithm There is of course a tradeoff between execution time and like the mixed replica search from CH2. In addition, CH3 solution quality, and we may be able to further increase finds similar and often better cores than GDOpt and the quality of the core collections sampled from any of our SimEli, which were reported to outperform CH2 in terms datasets by allowing a longer runtime. For the large pea of E-NE and A-NE. In particular, CH3 can create equally dataset for example, we indeed see that allowing no more representative cores as GDOpt, which was designed for than 10 s without finding further improvements (Table 2) this purpose, while at the same time being able to con- yields a slightly lower E-NE value as compared to a config- struct cores that are simultaneously more diverse, and uration with an absolute runtime limit of 30 min (Table 1). either are more representative or have a higher allelic Since each of the tested methods was able to sample cores richness, than cores obtained with SimEli. As in previ- from collections with up to multiple thousands of acces- ous versions, one of the main strengths of Core Hunter is sionsinatmostafewminutes,wedonot expectthat its flexibility. The applied local search algorithms are not the execution time of any of these algorithms will be lim- confined to a specific evaluation measure and new criteria iting for most practical applications. Still, Core Hunter can easily be introduced and optimized without the need is the only one whose runtime can be controlled by the to alter the underlying algorithms. Moreover, multiple cri- user in various ways, which yields an interesting quality- teria can be simultaneously optimized and the execution runtime tradeoff that can be used to either reduce the time is controlled by the user through various stop condi- execution time for large datasets when needed, or to more tions, which offers a convenient quality-runtime tradeoff. thoroughly explore the solution space when more time We therefore believe that Core Hunter is a very broadly is available, neither of which is possible with the other applicable core subset selection tool with a lot of opportu- methods. Note that although we did not experiment with nities to be further extended. For example, we may explore genotypic datasets with tens or hundreds of thousands of the ability of Core Hunter 3 to sample cores based on markers, these can easily be dealt with by precomputing a a combination of genotypes and phenotypes, or extend distance matrix, if necessary, so that only the number of Core Hunter to properly incorporate variable size core accessions affects the performance of Core Hunter. sampling such as a method to construct covering cores of minimum size. Variable size core sampling Abbreviations Previous versions of Core Hunter also supported variable A-NE: Average accession-to-nearest-entry distance; CH2: Core Hunter 2; CH3: size core sampling. We decided to remove this function- Core Hunter 3; DMIN: Minimum distance; E-NE: Average entry-to-nearest-entry distance; GDOpt: Genetic distance optimization; HE: Expected heterozygosity; ality from Core Hunter 3, and to focus on fixed size core MixRep: Mixed replica search; REMC: Replica exchange Monte Carlo search sampling for the provided evaluation measures, because these measures are not generally applicable to compare Acknowledgements cores of different sizes. For example, reducing the core We thank Nathan Sinnesael who performed preliminary experiments that size artificially increases dissimilarity between selected supported the development of Core Hunter 3. The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were accessions, while adding more accessions always yields a provided by the VSC (Flemish Supercomputer Center), funded by Ghent more representative core. Also, while CH1 and CH2 pre- University, the Hercules Foundation and the Flemish Government - ferred thesmallest of twocores with thesameobjective department EWI. De Beukelaer et al. BMC Bioinformatics (2018) 19:203 Page 12 of 12 Funding 12. Wang J, Hu J, Xu H, Zhang S. A strategy on constructing core collections Herman De Beukelaer is supported by a Ph.D. grant from the Research by least distance stepwise sampling. Theor Appl Genet. 2007;115(1):1–8. Foundation - Flanders (FWO). 13. Krishnan RR, Sumathy R, Ramesh S, Bindroo B, Naik GV. SimEli: Similarity elimination method for sampling distant entries in development of core Availability of data and materials collections. Crop Sci. 2014;54(3):1070–8. The raw rice, coconut and maize datasets are available from the cited 14. Odong T, van Heerwaarden J, Jansen J, van Hintum TJ, van Eeuwijk F. references [14, 15, 27] or on request. The raw pea dataset and computed Statistical techniques for defining reference sets of accessions and distance matrices are also available on request. microsatellite markers. Crop Sci. 2011;51(6):2401–11. 15. Kim K-W, Chung H-K, Cho G-T, Ma K-H, Chandrabalan D, Gwag J-G, Kim T-S, Authors’ contributions Cho E-G, Park Y-J. PowerCore: a program applying the advanced m HDB and GD implemented the Core Hunter 3 library in Java. HDB was strategy with a heuristic search for establishing core sets. Bioinformatics. responsible for the R package while GD developed the graphical interface. 2007;23(16):2155–62. HDB performed all experiments, under the supervision of VF. HDB wrote the 16. Jeong S, Kim J-Y, Jeong S-C, Kang S-T, Moon J-K, Kim N. Genocore: A initial manuscript with all authors contributing to the final version. All authors simple and fast algorithm for core subset selection from large genotype read and approved the final manuscript. datasets. PLoS ONE. 2017;12(7):0181420. 17. Jansen J, Van Hintum T. Genetic distance sampling: a novel sampling Ethics approval and consent to participate method for obtaining core collections using genetic distances with an Not applicable. application to cultivated lettuce. Theor Appl Genet. 2007;114(3):421–8. 18. Thachuk C, Crossa J, Franco J, Dreisigacker S, Warburton M, Davenport Competing interests GF. Core Hunter: an algorithm for sampling genetic resources based on The authors declare that they have no competing interests. multiple genetic measures. BMC Bioinformatics. 2009;10(1):1. 19. De Beukelaer H, Smykal ` P, Davenport GF, Fack V. Core Hunter II: fast core Publisher’s Note subset selection based on multiple genetic diversity measures using Springer Nature remains neutral with regard to jurisdictional claims in mixed replica search. BMC Bioinformatics. 2012;13(1):1. published maps and institutional affiliations. 20. Holland JH. Adaptation in Natural and Artificial Systems: an Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Author details Ann Arbor: U Michigan Press; 1975. Department of Applied Mathematics, Computer Science and Statistics, Ghent 21. Wright S. Evolution and genetics of populations. vol IV. Chicago: The University, Krijgslaan 281 S9, 9000 Gent, Belgium. New Zealand Institute for University of Chicago Press; 1978. p. 91. Plant & Food Research Limited, 412 No1 Rd RD2, Te Puke, New Zealand. 22. Gower JC. A general coefficient of similarity and some of its properties. J C Gower Biometrics. 1971;27(4):857–71. Received: 12 October 2016 Accepted: 16 May 2018 23. Berg EE, Hamrick J. Quantification of genetic diversity at allozyme loci. Can J For Res. 1997;27(3):415–24. 24. Marler RT, Arora JS. Function-transformation methods for multi-objective optimization. Eng Optim. 2005;37(6):551–70. References 25. Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. 1. Frankel O, et al. Genetic perspectives of germplasm conservation. Genetic Science. 1983;220(4598):671–80. manipulation: impact on man and society. Cambridge: Cambridge 26. Kaufman L, Rousseeuw PJ. Chapter 2 Partitioning Around Medoids University Press; 1984. pp. 161–170. (Program PAM) in Finding groups in data: an introduction to cluster 2. El Bakkali A, Haouane H, Moukhli A, Costes E, Van Damme P, Khadari B. analysis. New York: Wiley; 1990. pp. 68–125. Construction of core collections suitable for association mapping to 27. Wimmer V, Albrecht T, Auinger H-J, Schoen C-C. synbreedData: Data for optimize use of mediterranean olive (olea europaea l.) genetic resources. the Synbreed Package. 2015. R package version 1.5. https://CRAN.R- PLoS ONE. 2013;8(5):1–13. https://doi.org/10.1371/journal.pone.0061265. project.org/package=synbreedData. 3. Muñoz-Amatriaín M, Cuesta-Marcos A, Endelman JB, Comadran J, 28. Jing R, Vershinin A, Grzebyta J, Shaw P, Smykal ` P, Marshall D, Ambrose Bonman JM, Bockelman HE, Chao S, Russell J, Waugh R, Hayes PM, MJ, Ellis TN, Flavell AJ. The genetic diversity and evolution of field pea Muehlbauer GJ. The usda barley core collection: Genetic diversity, (pisum) studied by high throughput retrotransposon based insertion population structure, and potential for genome-wide association studies. polymorphism (rbip) marker analysis. BMC Evol Biol. 2010;10(1):1. PLoS ONE. 2014;9(4):1–13. https://doi.org/10.1371/journal.pone.0094688. 29. Smykal ` P., Kenicer G, Flavell AJ, Corander J, Kosterin O, Redden RJ, Ford 4. Odong T, Jansen J, Van Eeuwijk F, van Hintum TJ. Quality of core R, Coyne CJ, Maxted N, Ambrose MJ, et al. Phylogeny, phylogeography collections for effective utilisation of genetic resources review, discussion and genetic diversity of the pisum genus. Plant Genet Resour. 2011;9(01): and interpretation. Theor Appl Genet. 2013;126(2):289–305. 4–18. 5. Wang J-C, Hu J, Liu N-N, Xu H-M, Zhang S. Investigation of combining 30. De Beukelaer H, Davenport GF, De Meyer G, Fack V. JAMES: An plant genotypic values and molecular marker information for object-oriented java framework for discrete optimization using local constructing core subsets. J Integr Plant Biol. 2006;48(11):1371–8. search metaheuristics. Softw Pract Experience. 2017;47(6):921–38. 6. Franco J, Crossa J, Desphande S. Hierarchical multiple-factor analysis for 31. R Core Team. R: A Language and Environment for Statistical Computing. classifying genotypes based on phenotypic and genetic data. Crop Sci. Vienna, Austria: R Foundation for Statistical Computing; 2016. R 2010;50(1):105–17. Foundation for Statistical Computing. https://www.R-project.org/. 7. Borrayo E, Machida-Hirano R, Takeya M, Kawase M, Watanabe K. 32. Hollander M, Wolfe DA, Chicken E. Nonparametric Statistical Methods. Principal components analysis-k-means transposon element based foxtail Chichester: Wiley; 2013. millet core collection selection method. BMC Genet. 2016;17(1):1. 33. Holm S. A simple sequentially rejective multiple test procedure. Scand J 8. Brown A. Core collections: a practical approach to genetic resources Stat. 1979;65–70. management. Genome. 1989;31(2):818–24. 9. Franco J, Crossa J, Taba S, Shands H. A sampling strategy for conserving genetic diversity when forming core subsets. Crop Sci. 2005;45(3): 1035–44. 10. Schoen D, Brown A. Conservation of allelic richness in wild crop relatives is aided by assessment of genetic markers. Proc Natl Acad Sci. 1993;90(22):10623–7. 11. Gouesnard B, Bataillon T, Decoux G, Rozale C, Schoen D, David J. MSTRAT: An algorithm for building germ plasm core collections by maximizing allelic or phenotypic richness. J Hered. 2001;92(1):93–4.

Journal

BMC BioinformaticsSpringer Journals

Published: May 31, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off