Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

What do we gain from simplicity versus complexity in species distribution models?

What do we gain from simplicity versus complexity in species distribution models? Species distribution models (SDMs), also known as ecolo gical niche models or habitat selection models, are widely used in ecology, evolutionary biology, and conservation (Elith and Leathwick , Franklin , Zimmermann et al. , Peterson et al. , Svenning et al. , Guisan et al. ). SDMs can provide insights into generalities and idiosyncrasies of the drivers of complex patterns of species' geographic distributions. SDMs are built using a variety of statistical methods – e.g. generalized linear/additive models, tree‐based models, maximum entropy – which span a range of complexity in the occurrence–environment relationships that they fit. Capturing the appropriate amount of complexity for particular study objectives is challenging. By building ‘under fit’ models, having insufficient flexibility to describe observed occurrence–environment relationships, we risk misunderstanding the factors shaping species distributions. By building ‘over fit’ models, with excessive flexibility, we risk inadvertently ascribing pattern to noise or building opaque models. As such, determining a suitable amount of complexity to include in SDMs is crucial for biological applications. Because traditional model selection is challenging when comparing models from different SDM modeling approaches (e.g. those in Table ), we argue that researchers must constrain model complexity based on attributes of the data and study objectives and an understanding of how these interact with the underlying biological processes. Here, we discuss the challenges that choosing an appropriate amount of model complexity poses and how this influences the use of different statistical methods and modeling decisions (Elith and Graham ). Common modeling paradigms used to build SDMs and decisions used to control their complexity. The variation among response curves from different modeling paradigms and different model settings suggests that they are suitable for different study objectives and attributes of the data. Response curves come from fitting SDMs to presence/background data on the overstory shrub, Protea punctata , from the Cape Floristic Region of South Africa (see Merow et al. for details of the data) with different degrees of control over the complexity of the fitted response curves. All models were constructed using the biomod2 package (Thuiller et al. ) within the statistical software R (R Core Team). Response curves of different complexity are shown which are representative of those commonly observed during SDM building. Dark grey curves were fitted using the settings at or near the default options sets in biomod2 (for illustration) with the exception of forcing the package to perform only a single fit per method using all of the presence data in model fitting. Black (light grey) curves were fitted by choosing options to make the fitted response curves simpler (more complex). Note that complexity of any of these paradigms is affected by changing the number of predictors, the order of interactions, and model averaging, hence these decisions are not explicitly included in the the table Algorithm Response curves Responses are built from Complexity controlled by Bioclimatic envelope models (BIOCLIM) quantiles, between which occurrence probability is 1 Features: step functions Quantiles Generalized linear models (GLM) parametric terms specified by user Features: polynomials, piecewise functions, splines Feature complexity specified by user Generalized additive models (GAM) combination of parametric terms and flexible smooth functions suggested by the data or the user Features: parametric terms as in GLMs and various smoothers (e.g. splines, loess) Number of nodes Penalties Multivariate adaptive regression splines (MARS) the sum of multiple piecewise basis functions of predictors suggested by the data Features: splines Number of knots Cost per degree of freedom Pruning Artificial neural networks (ANN) networks of interactions be tween simple functions of predictors suggested by the data Number of hidden layers Classification and regression trees (CART) repeated partitioning of predictors into different categories, suggested by the data, associated with different occurrence probabilities Features: threshold, with implicit interactions Minimum observations for split/terminal node Maximum node depth Complexity threshold to attempt a split Random forests (RF) an average of multiple CARTs, each constructed on bootstrapped samples of the data and using different random subsets of the full predictor set Features: threshold, with implicit interactions See CARTs Number of trees Boosted regression trees (BRT) regression trees at multiple steps; at each, models the residuals from the sum of all previous models weighted by the learning rate 2 Features: threshold, with implicit interactions See CARTs Number of trees Learning rate Maximum entropy (MAXENT) a GLM with a large number of features , which are suggested by the data or the user Features: linear, quadratic, interaction, hinge, threshold Feature classes used Regularization penalty Complexity is a fundamental feature of observed occurrence patterns because occurrence–environment relationships may be obscured by processes that are not exclusively related to the environment, such as dispersal, response to disturbance, and biotic interactions (Pulliam , Holt , Boulangeat et al. ). Consequently, SDMs can be dynamic and process‐based, explicitly representing aspects of the underlying biology. This paper focuses on the more widely used static, correlative SDMs, although many of the issues considered relate to process‐based SDMs as well. Describing this complexity is critical for many applications of SDMs, and using flexible occurrence–environment relationships allows biologists to hypothesize about the drivers of complexity or make accurate predictions that derive from their representation in SDMs. Such hypotheses are a valuable step toward the types of process‐based models discussed in this issue (Merow et al. , Snell et al. ). However, building complex models comes with the challenge of differentiating true complexity from noise (see chapter 7 in Hastie et al. for a statistical viewpoint on optimising model complexity). Some believe that flexible models are often overfit to the noise prevalent in many occurrence data sets. Thus, with such variation in both needs and opinions regarding model complexity, many modeling approaches are in current use (Table ). We characterize model complexity by the shape of the inferred occurrence–environment relationships (Table ) and the number of posited predictors and parameters used to describe them. A simpler model typically has relatively fewer parameters and fewer relationships among predictors compared to a more complex model. However, it remains a challenge to quantify complexity in a way that is appropriate across the spectrum of modeling approaches in Table (e.g. Janson et al. showed effective degrees of freedom to be an unreliable metric when defining complexity). Univariate ‘response curves’ are commonly used to give an impression of the complexity of the predicted occurrence–environment relationships. These are one‐dimensional ‘slices’ of multivariate space. The most common approach is to plot the predicted occurrence probability against the predictor of interest by holding all other predictors at their mean or median values (Elith et al. ; Table ), although other approaches are possible (Fox , Hastie et al. ). When visualized in this way, a simpler model is relatively smooth, containing fewer inflection and turning points compared to a more complex model. Though insightful, univariate curves only represent the true fitted response incompletely (3‐dimensional response surfaces or the ‘inflated response curves’ of Zurell et al. ( ) help here). Complex models contain more interactions, which can only be visualized on higher dimensional surfaces, compared to simpler models. Such responses must be interpreted as conditional on the other mean or median predictors in the model, which may be different than the responses to variables held at other values (Zurell et al. ), or to an unconditional model. Nonetheless, uni‐ and multivariate response curves remain one of the best standardized ways to assess relative model complexity. In this paper, we develop general guidelines for deciding on an appropriate level of complexity in occurrence–environment relationships. Uncertainty about how best to describe ecological complexity has to some extent divided biologists between those who prefer to use the principle of parsimony to identify model complexity (preferring the simplest model that is consistent with the data), and those who try to approximate more of the complexities of the real world relationships. We review the literature and the general modeling principles emerging from these two viewpoints, and we discuss the ways in which these overlap or differ in light of study objectives and attributes of the data. We make a variety of recommendations for choosing levels of complexity under different circumstances, while highlighting unresolved scenarios where viewpoints differ. We conclude with suggestions for drawing from the strengths of each modeling approach in order to advance our knowledge of current and future species geographical ranges. Complexity in ecology Many interacting biotic and abiotic processes influence species distributions and can manifest as complex occurrence–environment relationships (Soberón , Boulangeat et al. ). One essential challenge to recovering primary environmental drivers of these distributions, however, is to differentiate the signals of range determinants from sampling and environmental noise. Before embarking on statistical analyses of range determinants, ecological theory can focus an investigation (Austin , , , Pulliam , Chase and Leibold , Holt ). There is, a priori, a set of common drivers of populations that can be used to propose general shapes of occurrence–environment relationships. For example, we expect that for many variables, response curves describing a fundamental niche should be smooth because sudden jumps in fitness along an environmental gradient are unlikely to exist (Pulliam , Chase and Leibold , Holt ). For other variables, e.g. related to thermal tolerance, steep thresholds may exist due to loss of physiological function (Buckley et al. ). However, response curves describing realized niches might exhibit discontinuities due to the multiple interacting factors that can limit a species' occurrence in any particular location. Unimodal responses are expected (e.g. a bell‐shaped curve) because conditions too extreme for survival often exist at either end of a proximal gradient (Austin ). However, response curves can be linear where only part of the environmental range of the species has been sampled (e.g. one side of a unimodal response; Albert et al. ). Austin and Smith's ( ) continuum concept for plant species distributions predicts that skewed unimodal response curves are likely when plant species distributions are predominantly determined by one or a few environmental variables that strongly regulate survivorship and or reproduction (e.g. by temperature thresholds), but that more irregular response curves are expected given that species are influenced by a range of regulatory factors (e.g. different limiting nutrients, biotic and abiotic interactions) and historical contingencies (Austin et al. , Normand et al. ). Even with single factors, the processes that determine fitness may be different across the range, e.g. where one temperature extreme leads to abrupt loss of function while the other extreme causes gradually reduced performance. Interaction terms can be desirable to capture covariation between predictors or tradeoffs along resource gradients (e.g. higher temperatures are tolerable with greater rainfall). Many applications of SDMs do not explicitly consider such theoretical constraints on the shape of response curves (but see Santika and Hutchinson ), perhaps because it is difficult to work out how they translate into observations. We are faced with the challenge of inferring unknown levels of ecological complexity through the lens of data and models that imperfectly capture it. Complexity in models Two attributes of model fitting determine the complexity of inferred occurrence–environment relationships in SDMs: the underlying statistical method and modeling decisions made about inputs and settings. Together, these define what we will call different modeling approaches, a number of which are illustrated in Table . Statistical methods One of the primary differences among the available statistical methods for fitting SDMs is the range of transformations of predictors that they typically consider (in machine learning parlance: which ‘features’ to allow), and this helps to define the upper limit of complexity for their fitted response surfaces. We detail commonly used modeling approaches and demonstrate examples of their response curves in Table . Rectilinear or convex‐hull environmental envelopes (e.g. BIOCLIM or DOMAIN) and distance‐based approaches in multivariate environmental spaces (e.g. Malahanobis) are used in the simplest SDMs. Their response curves are simple functions (e.g. linear, hinge or step; Elith et al. ). Generalized linear models (GLMs), which are typically fitted with linear or polynomial features up to second order terms (rarely third or fourth order) for SDMs, and often without interactions, admit more complexity. Generalized additive models (GAMs) are potentially more complex because they allow non‐parametric smooth functions of variable flexibility (Hastie and Tibshirani , Wood ). Decision trees (Breiman et al. ) can also become quite complex because these can use a large number of step functions (each requiring a parameter) and can implicitly include high order interaction terms to depict response curves of arbitrary complexity. Modeling decisions Decisions that affect model complexity apply to all the statistical methods described above. For example, if a large set of predictors are available, then model complexity will differ depending on whether the full set, or a small subset, is used. One must also determine which features are considered in the model. Each feature requires at least one parameter in the occurrence–environment relationship and hence increases model complexity (see increased complexity of black vs grey MAXENT response curves due to increase in number of features; Table ). Large numbers of predictors are more commonly used in machine‐learning approaches because they automate feature selection whereas fewer are often used in simpler models where features are specified a priori. For example, maximum entropy models (MAXENT) can consider any number of linear, quadratic, product, threshold (step functions) or hinge transformations of the predictors (Phillips et al. , Phillips and Dudik ). In principle, this same complexity could be fit in a traditional GLM but this is typically impractical and not of interest to ecologists. SDM complexity is amplified when interactions between predictors are included to account for nonadditive relationships. GLMs and GAMs can include interactions that have been specified during model formulation as potentially ecologically relevant, but are usually used only sparingly. Decision trees include interactions implicitly through their hierarchical structure; i.e. the response to one variable depends on values of inputs higher in the tree, meaning that high order interaction terms (that depend on all the predictors along a branch) are possible. However interactions between variables are fitted automatically if supported by the data and cannot be explicitly controlled by the user (except to specify the permissible order of the interactions considered). Using ensembles of models can increase or decrease complexity. Ensembles are combinations of models in which the component models can be chosen based on selected criteria (e.g. predictive performance on held out data; Araújo and New ) or with an ensemble algorithm (a machine learning method). For instance, regression models selected via an information criterion can be combined using ‘multi‐model inference’, allowing distributions over effect sizes and over predictions to new sites (Burnham and Anderson ). A typical machine learning approach to ensembles uses an algorithm to build an ensemble of simple models that together predict better than any one component model. Examples include bagging and boosting – while these can be used on any component models, in ecology the most used component models are decision trees (e.g. in random forests, Brieman 2001; and boosted regression trees, Friedman ). Bagging (bootstrap aggregation) can be used to fit many models to bootstrapped replicates of the dataset (with and without random subsetting of predictors used across trees as in random forests). In contrast, boosting uses a forward stagewise method to build an ensemble, at each step modeling the residuals of the models fitted to date. Taking ensembles of relatively simple models usually increases complexity because combinations of simple models will not necessarily be simple. In contrast, ensembles of more complex models can average over idiosyncrasies of individual models to produce smoother response curves (Elder ). Model comparison To avoid overfitting and underfitting, it is common to compare models of differing complexity and select the model that optimizes some measure of performance. However, comparing models across modeling approaches (e.g. those in Table ) can be challenging. This is one of our motivations for constraining model complexity based on study objectives and data attributes. Information theoretic measures are a conventional way to choose model complexity and are relatively easy to apply for models where estimating the number of degrees of freedom is possible. However these cannot be calculated for ensemble‐based methods nor for many other methods in common use (Janson et al. ). In fact, Janson et al. ( ) warn, ‘contrary to folk intuition, model complexity and degrees of freedom are not synonymous and may correspond very poorly’. One way to compare models produced by different algorithms is to adopt a common currency for model performance by evaluating model predictions on either the training data or independent testing data. Measures such as AUC, Cohen's Kappa, and the True Skill Statistic are based on correctly distinguishing presences from absences. Measures based on non‐thresholded predictions are also relevant and preferable in many situations (Lawson et al. ). However, each of these metrics has weaknesses in different circumstances (Lobo et al. ) and further, only represent heuristic diagnostics for presence‐only data, because presences must be compared to pseudoabsence/background data (Hirzel et al. ). Once one has determined a suitable modeling approach tuning of the amount of complexity is more straightforward using a range of model selection techniques. Feature significance (e.g. p‐values), measures of model fit (e.g. likelihood), and information criteria (e.g. AIC, AICc, BIC; Burnham and Anderson ) can be applied to regression‐based methods. Cross‐validation or other resampling techniques are also used to set the smoothness of splines in GAMs (Wood ) or to determine tuning parameters in most machine learning methods (Hastie et al. ). Shrinkage or regularization is often used in regression, MAXENT and boosted regression trees to constrain coefficient estimates so models predict reliably (Phillips et al. , Hastie et al. ). Loss functions, which penalize for errors in prediction, can be constructed for any of the modeling approaches we consider (Hastie et al. ). An alternative approach employs null models to evaluate whether additional complexity has lead to spurious predictive accuracy (Raes and terSteege ). Evaluation against fit to training data alone cannot control for over fitting and risks selecting excessively complex models (Pearce and Ferrier , Araújo et al. ). In general, best practice involves splitting the data into training data to fit the model, validation data for model selection, and test data to evaluate the predictive performance of the selected model (Hastie et al. ). Recent studies have emphasized that care should also be taken in how data is partitioned into training, evaluation and test data, in particular to control for spatial autocorrelation (Latimer et al. , Dormann et al. , Veloz , Hijmans ; see below for more details). Hence methods such as block cross‐validation (where blocks are spatially stratified) are gaining momentum (Hutchinson et al. , Pearson et al. , Warton et al. ). Failure to factor out spatial autocorrelation in data partitioning can lead to misleadingly good estimates of model predictive performance. Basing model comparison on holdout data presents some practical challenges. Sample size may be insufficient to subset the data without introducing bias. Subsets of data can contain the same or different biases compared to the full data set. In particular, it can be difficult to remove spatial correlation between training and holdout data when the sampling design for the occurrence data is unknown or when a species is restricted geographically or environmentally (this is discussed below). Importantly, all these approaches to model comparison have strengths and weaknesses and none can unambiguously select between models of differing complexity built with different statistical methods and underlying assumptions. The tried and tested methods of statistics and machine learning for model selection are valuable when working within a particular modeling approach, but to benefit from these, it is valuable to narrow the scope of the feasible models based on biological considerations. We therefore now move to exploring approaches for identifying the appropriate level of complexity for particular study objectives based on data limitations and the underlying biological processes. Philosophical, statistical and biological considerations when choosing complexity In this section, we discuss factors that should influence the choice of model complexity. First, we outline general considerations and philosophical differences underlying both simple and complex modeling strategies (section Simple versus complex: fundamental approaches to describing natural systems). Next, we discuss how the study goals (section Study objectives) and data attributes (section Data attributes) interact with model complexity. Figure summarizes our findings. Importantly, a general consensus for choosing model complexity is not possible in many cases. To reflect the different schools of thought, we divide our facts, ideas and opinions into those that are relatively uncontroversial (subsections denoted ‘Recommendations’), those that favor simple models (denoted ‘Simple’), and those that favor more complex models (denoted ‘Complex’). We recall that ‘simple’ and ‘complex’ refer to the extremes along a gradient of complexity in response curves produced by distinct statistical methods and modeling decisions (section Complexity in models and Table ). Influence of attributes of study objectives and data attributes on the choice of model complexity. Green arrows illustrate attributes where the choice of complexity is of no particular concern. Red arrows illustrate the situations where caution and/or experimentation with model complexity is needed. Gray arrows indicate decisions that involve interactions with other study goals or data attributes. The thickness of the arrows illustrates the strength of the arguments in favor of choosing a specific level of complexity, with thicker arrows indicating stronger arguments. Simple versus complex: fundamental approaches to describing natural systems Simple Simple models tend towards a conservative, parsimonious approach and typically avoid over‐fitting. They link model structure to hypotheses that posit occurrence–environment relationships a priori and examine whether the resulting model meets these expectations. Simple models have greater tractability, can facilitate the interpretation of coefficients (cf. Tibshirani ), can help in understanding the primary drivers of species occurrence patterns, and are likely to be more easily generalized to new data sets (Randin et al. , Elith et al. ). Although complex responses surely exist in nature, we cannot often detect them because their signal is weak or they are confounded with sampling noise, bias or spatial autocorrelation. By using models that are too complex, one can inadvertently assign patterns due to either data limitations or missing processes, or both, to environmental suitability and fit the patterns simply by chance. Complex Complex models are often semi‐ or fully non‐parametric, and are preferred when there is no desire to impose parametric assumptions, specific functional forms or pre‐select predictors for models a priori. This does not mean that they are not biologically motivated, but rather emphasizes the reality that Nature is complex. Simple models may be readily interpretable but misleading (Breiman ), and for many applications of SDMs a preference for predictive accuracy in new data sets over interpretability is justifiable. Also, complex models are not necessarily difficult to interpret. Indeed, their complexity can be valuable for suggesting novel, unexpected responses. If we do not explore the full spectrum of complexity, there is a risk of obtaining an overly simplified, or even biased, view of ecological responses. Complex models can, depending on how they are structured, still identify simple relationships if responses are strong and robust. Study objectives Niche description vs range mapping Two prominent applications of SDMs are characterizing the predictors that define a species' niche and projecting fitted models across a landscape. Niche characterization quantifies the variables, primarily climate and physical, that affect a species' distribution. This is often done by analyzing response curves, the functions (coefficients or smoothing terms) that define them, and their relative importance in the model. Projecting these fitted models across a landscape can predict the geographic locations where the species may occur in the present or in the future. In some studies, focus lies in the final mapped predictions rather than how they derive from the underlying fitted models. Recommendations Some evaluation of the biological plausibility of the shape and complexity of response curves is always valuable, even if the objective is not niche description. Such evaluation is particularly critical for extrapolation (section Interpolate vs extrapolate), though it is admittedly quite challenging in multivariate models. Modelers should also carefully evaluate whether maps built from complex models substantially differ from maps built from simple models. If the predictions differ, the source of this should be explored. If the interest lies in interpretation, it is important to assess whether the mapped predictions are right for the right reason, and that complex environmental responses have not become proxies for sources of spatial aggregation in the data that lead to bias when projected to other locations (whether interpolation or extrapolation; section Spatial autocorrelation). Simple Simple models are preferable for niche description because they usually yield straightforward, smooth response curves that can be linked directly to ecological niche theory (section Complexity in models; Austin ), in contrast to the often irregular shapes that result from complex models (Table ). Assumptions about species responses are more transparent when simple models are being projected in new situations. Complex Complex models can be valuable for describing a species' niche when only qualitative descriptors of response curves are necessary (e.g. positive/negative, modality, relative importance) – i.e. even complex responses can be described in terms of main trends. Allowing complexity might offer more chance of identifying relevant response shapes. Complex models can be powerful for accurately mapping within the fitting region (Elith et al. , Randin et al. ) when one is not necessarily concerned with an ecological understanding of the complexity of underlying models. Although the source of complex relationships may remain unknown, complex models have the flexibility to describe these. Abrupt steps in response curves might be helpful to uncover strictly unsuitable sites when mapping distribution in space. Hypothesis testing vs hypothesis generation Some SDM studies are focused on testing specific hypotheses related to how species are distributed in relation to particular predictors or features. In others, little is known about the predictors shaping the distribution and the objective is to explore occurrence–environment relationships and generate hypotheses for explanation. For example, SDMs are valuable exploratory analyses for detecting the processes that confound occurrence–environment relationships, such as transient dynamics, dispersal, biotic interactions, or human modification of landscapes. The indirect effect of such processes can be seen in occurrence patterns, often due to abrupt changes or nonlinearities in response curves, leading to hypothesis generation. Whether one is testing or generating hypotheses critically affects the level of complexity permitted because hypothesis testing depends on being able to isolate the affects of particular features, whereas this matters less when exploring data in order to generate hypotheses. Recommendations When testing hypotheses, insights from ecological theory can guide the selection of features to include. A higher degree of control over the specific details of the underlying response surface is likely needed for hypothesis testing, which is made much easier using simple models. Hypothesis testing is more challenging in complex models with correlated features that can trade off with one another. Complex models are well suited to hypothesis generation, enabling a wider range of environmental covariates and modeling options than can be conveniently explored with simple models. Simple When the goal is hypothesis testing, simple parametric models allow investigation of the strength and shape of relationships between species occurrence and a small set of features. Furthermore, parametric models allow for hypothesis tests to examine if specific nonlinear features should be included in the selected model(s). The problem with complex models in such a setting is that with the large suite of potential features that they use, it is challenging to determine the significance of a single feature or attribute of the response curve or to compare alternative models. Instead, one is constrained to accept the features selected by the statistical method (e.g. features classes in MAXENT; splits in tree‐based methods) to represent that predictor (within some user‐specified bounds). Rather, it is preferable to specify a set of features (or multiple sets for competing models) to determine the suitability for describing a particular pattern. For example, when features are selected automatically, it may be challenging to determine whether a quadratic term that makes the response unimodal is important or how much better/worse the model might be without it. Complex The starting premises, for hypothesis testing, is a priori ecological understanding enabling the user to select a small set of features. However, we do not always have this prior understanding. Complex models explore much larger sets of nonlinear features and interactions than simple models and are suited for generating hypotheses about underlying processes (Boulangeat et al. ) derived from potentially flexible responses that would not often be detected with simpler models (e.g. bimodality). This same flexibility can be used to augment existing knowledge. For example, if we know that a species is associated with dry, high elevation locations, we don't need a simplified model to describe this, but rather more insight from a potentially complex model to capture bimodality or strong asymmetries. Complex models also provide tools for evaluating predictor importance, which is useful for both generation and testing of hypotheses and can lead to inference that differs little from simpler models (Grömping ). These importance indices can be generated from permutation tests (Strobl et al. , Grömping ), contribution to the likelihood (e.g. ‘percent contribution’ in MAXENT), or proportion of deviance explained (decision trees). Interpolate vs extrapolate When predicting species' distributions over space and time, it is important to distinguish between interpolation and extrapolation. When a point is interpolated by a fitted model, it lies within the known data range of predictors, but was not measured for its response. Alternatively, an extrapolated point is one that lies outside the observed range of the predictor. Both interpolation and extrapolation can occur in geographic or environmental space (cf. Peterson et al. , Aarts et al. ). Extrapolation requires caution in all scenarios but cannot be avoided when assessing questions relating to ‘no‐analogue’ climate scenarios (Araújo et al. ) or range expansion. The correlative models discussed here are not optimal for extrapolation in many cases; process‐based models are generally preferred because the functional form of the response curve captures the processes that apply beyond the range of observed data (Kearney and Porter , Thuiller et al. , Merow et al. ). Recommendations The challenges associated with interpolation and extrapolation, though differing in the way they manifest, are apparent for models of any complexity and hence simple and complex perspectives align. Interpolation within the range of the observed data will be accurate if the model includes all processes operating in the interpolation extent and is based on well‐structured data. Without that, prediction to unsampled sites will average across unrepresented processes and might reflect biases in the sample. More generally, it may not matter whether a response curve is complex as long as it retains the basic qualities of a simpler model. For example, a line or a sequence of small step functions parallel to the line can produce similar predictions. Some caution should be taken with complex models, as complex combinations of features can become proxies for unmeasured spatial factors in unintended ways and inadvertently model clustering in geographic space as complexity in environmental space, which can lead to errant interpolation (section Spatial autocorrelation). Extrapolation always requires that response curves have been checked for biological plausibility (cf. section Niche description vs range mapping). Of course, even simple models can extrapolate poorly. For example, Thuiller et al. ( ) showed that a simple GLM or GAM run on a restricted and incomplete range could create spurious termination of the smoothed relationships, leading to errant extrapolation. Hence, the importance of extrapolation can depend on the chosen spatial extent and on the selected features (section Spatial extents and resolution). Complex models should be carefully monitored at the edges of the data range, both because small sample sizes and the ways different statistical methods handle extrapolation can have drastic effects on predictions (Pearson et al. ). When using complex models, feature space may be sparsely sampled, which means that when one expects to interpolate a predictor, there may be inadvertent extrapolation of nonlinear features. For example, in a model with interaction terms, one may adequately sample the linear features for all predictors while poorly sampling the relevant combinations of these predictors (Zurell et al. ). Complex models can lead to different combinations of features producing similar model performance in the present (Maggini et al. ), but vastly diverging spatial predictions when transferred to other conditions (Thuiller , Thuiller et al. , Pearson et al. , Edwards et al. , Elith et al. ). Narrowing the range of possibilities using a simpler model that controls for the biological plausibility of the response curves (cf. section Complexity in models) can reduce this divergence (Randin et al. ). Data attributes Sample size The number of occurrence records is a critical limiting factor when building SDMs. With presence–absence data, the number of records in the least frequent class determines the amount of information available for modeling. Small sample sizes can lead to low signal to noise ratios, thereby making it difficult to evaluate the strength of any occurrence–environment pattern in the presence of confounding processes. Recommendations Simple models are necessary for species with few occurrences to avoid over‐fitting (Fig. ). This suggests few predictors and only simple features. Support for features can be found by reporting intervals on response curves (e.g. from confidence intervals or subsamples), with an eye for tight intervals around pronounced nonlinearities. For large data sets, any of the modeling approaches described earlier are potentially suitable, dependent on study objectives. Simple We expect a large amount of noise in occurrence data due to processes unrelated to environmental responses and this noise can be particularly influential when sample sizes are small. For example, if a basic temperature response is built from data that are variably influenced by a strong land‐use history and dispersal limitation throughout the range, a failure to take that into account results in a misspecified climate response surface. While simple models have a chance of smoothing over such variations, complex models can more readily fit these latent patterns, leading to biased prediction when models are projection to other locations where the latent processes differ. Complex models fitting many features are only appropriate when there are sufficient data to meaningfully train, test and validate the model (cf. Hastie et al. ). Complex If data are available, increasing the number of predictors ensures a more accurate understanding of the drivers of distributions. If the data set is small, it is possible to use a method that can be potentially complex, as long as it is well controlled by the user to protect against over‐fitting e.g., using penalized likelihoods (Tibshirani ), a reduced set of features in MAXENT; (Phillips and Dudik , Merow et al. ), or heavy pruning in tree‐based methods. Permitting some complexity may be useful to identify counterintuitive response curves and develop stratified sampling strategies for future data collection to support or refute the model responses. Sampling bias Sampling bias arises from imperfect sampling design, which includes purposive, non‐probabilistic, or targeted sampling (Schreuder et al. , Edwards et al. ) and imperfect detection (MacKenzie et al. ). The important question is whether sampling bias – which often arises in geographic space – transfers to bias in environmental space, and further, whether some environments are completely unsampled. No statistical manipulation can fully overcome biased sampling. The main challenge when choosing complexity is that – particularly for models based on presence‐only data – it may be unclear whether patterns in environmental space derive from habitat suitability, divergence between the fundamental and realized niches (Pulliam ), transient behavior, or sampling problems (Phillips et al. , Hefley et al. , Warton et al. ). For presence–absence data with perfect detection, sampling biases may not be too detrimental as long as at least some samples exist across environments into which the model is required to predict (Zadrozny , but see Edwards et al. for contrasting results). Recommendations More flexible models will be more prone to finding patterns in restricted parts of environmental space where sampling is problematic. Poor performance on test data could identify over fitting to sampling bias, but only if the test data are unbiased. In practice, if unbiased testing data were available, they could be used to build an unbiased model in the first place. Recent advances that enable presence‐only and presence–absence data to be modeled together, and across species, will be useful in this context (Fithian et al. ). A tradeoff exists between a complex model that might fit, e.g. step functions to few data points in poorly sampled regions and simple models that predict smooth but potentially meaningless functions from just a few points. Simple The hope when using simple models for biased data is that main trends are still identified. Complex models can over‐fit to the bias (particularly if the bias is heterogeneous in space) and miss the true main trends. Methods for dealing with imperfect detection (MacKenzie and Royle , Welsh et al. ) or sampling design often specify relatively simple responses to environment because they simultaneously fit the model for sampling (Latimer et al. ), and identifiability can become an issue when too many parameters are used that might relate to either observation or occurrence. In such cases, inference will be limited to very general trends. Complex If the sampling bias is strongly linked to the environmental gradients, even simple models can predict spurious relationships (Lahoz‐Monfort et al. ). Complex models could be useful in understanding, or hypothesizing about, the nature of the sampling bias: for example, the most parsimonious explanation for sharp changes in the probability of presence in some circumstances could be sampling bias, although we know of no published examples. Detection and sampling bias models are not restricted to simple models – for instance, the former have recently been developed for boosted regression trees (Hutchinson et al. ) and the latter are often used with MAXENT (Phillips et al. ). Predictor variables: proximal vs distal A priority in selecting candidate predictors is to identify variables that are as proximal as possible to the factors constraining the species' distribution. Proximal variables (e.g. soil moisture for plants) best represent the resources and direct gradients that influence species ranges (Austin ). More distal predictors, such as using topographic aspect as a surrogate for soil moisture, do not directly affect species distributions but do so indirectly through their imperfect relationships with the proximal predictors they replace. The problem with using distal predictors is that their correlation with the proximal predictor can change across the species' range, even if the proximal predictor's relationship with the species does not (Dormann et al. ). We rarely have access to all of the most important proximal predictors across a study region, so the main question is what response shapes should we expect for more distal predictors? Imagine that a species is limited by the duration of the growing season, but that the response is instead modeled with a combination of mean annual temperature and topographic position (aspect, slope, etc.). It is difficult to anticipate the shape of the multivariate surface that mimics the species response to the proximal predictor. Recommendations Responses to proximal predictors over sufficiently large gradients should be relatively strong (Austin and references therein), and either simple or complex models should be able to identify these responses if complexity is suitably controlled. However, the extent to which the included set of predictors is proximal or distal may be unknown. Experimentation with complex and simple models may help test hypotheses about which predictors are more proximal, potentially best encapsulated in a simple response curve, and those that are more distal and better represented with more complex curves. As physiological mechanisms generally provide the best insights into how environmental gradients translate into demographic (and therefore population) patterns, the use of informed physiological understanding could provide a valuable starting point (Austin , Kearney and Porter ). Simple Ecological theory supports using unimodal or skewed smooth responses to proximal variables (Austin and Nicholls , Oksanen , Austin , , Guisan and Thuiller , Franklin ), which motivates constraining the functional form of response curves a priori (section Complexity in models; e.g. specific features in a GLM, few nodes in a GAM). Remotely sensed data, even for proximal predictors, may introduce noise to the environmental covariates due to imprecision and to use of long term averaged data (Austin , Letten et al. ), and may be prone to over‐fitting with complex models if those data generally fail to describe the local habitat conditions accurately. One can use simple models to smooth over such idiosyncrasies if the main trends are sufficiently strong or one can omit predictors if trends are weak. Parametric, latent variable models can help to deal with this imprecision (Mcinerny and Purves ). Complex Ecological theory is based on responses to idealized gradients, whereas we observe (often imperfectly) a messy reality. Specifying an overly simple model will result in over‐ and under‐estimation of the response at points throughout the covariate space (Barry and Elith ). Given that the relationship between proximal and distal predictors is unlikely linear and may vary across landscapes, it is likely that the true response to distal variables might also be complex and best represented by a model that allows flexible fits and interactions. Hence the complex viewpoint still adheres to ecological theory, but allows for a modified view of idealized relationships as seen through available data. Spatial extents and resolution Interpretation of ecological patterns is scale dependent; hence changing spatial extent and/or resolution affects the patterns and processes that can be modeled (Tobalske , Chave ). Ecologists often use hierarchical concepts to describe influences of environment on species distributions – for instance, that climate dominates distributions of terrestrial species at the global scale (coarsest grain, largest extent), while topography, lithology or habitat structure create the finer scale variation that impact species at regional to local scales together with dispersal limitations and biotic interactions (Boulangeat et al. , Dubuis et al. , Thuiller et al. ). SDMs built across large spatial extents often rely on remotely sensed, coarse resolution or highly interpolated predictors, creating inherent biases and sampling issues (section Sampling bias). The choice of extent can also determine whether the species entire range is included in the model or whether data are censored (e.g. limited by political borders). Recommendations Resolutions should be chosen that provide data from proximal rather than distal variables. Such data are becoming available at high resolutions with expanded and technologically enhanced monitoring networks and more sophisticated interpolation of climate data (e.g. PRISM). The choice of resolution hence reduces to the discussion of proximal versus distal predictors in section Predictor variables: proximal vs distal. When the extent is chosen to contain the species' entire range, models should include sufficient complexity to detect unimodal, skewed responses (section Complexity in models). Simple Smooth responses, characterized by simpler models, are to be expected at large spatial extents and coarse resolution that smooth over the confounding processes that affect finer resolution occurrence patterns (Austin ). At finer resolutions, it may also be undesirable to incorporate the full complexity of the response curve: much of the finer details may derive from factors for which no predictor variables are available or are irrelevant to the purpose of the investigation (e.g. microhabitat or regional competition effects). Complex At small spatial extents, we might have data on the relevant proximal factors (e.g. soil properties), so fitting complex models along small‐scale gradients can capture this complexity. Also, complex models may be useful for exploring the nonlinearities that arise in response curves from distal variables at broad scales in that they potentially provide insight into important unmeasured variables. Spatial autocorrelation Many processes omitted from SDMs have spatial structure. For example, dispersal limitation, foraging behavior, competition, prevailing weather patterns, and even sampling bias can all lead to spatially structured occurrence patterns that are not explained by the set of predictors included in the SDM (Legendre , Barry and Elith , but see Latimer et al , Dormann et al. ). When these spatial patterns are not appropriately accounted for, biased estimates of environmental responses may emerge. Recommendations If presence–absence data are available, one should assess the degree of spatial autocorrelation in the residuals and implement methods to control for spatial autocorrelation. Methods include spatially‐explicit models that separate the spatial pattern from the environmental response (Latimer et al. , Dormann et al. , Beale et al. ), using spatial eigenvectors as predictors (Diniz‐Filho and Bini ), or stratified sub‐sampling of the data to minimize autocorrelation (Hijmans ). Complex models should be used cautiously in the presence of spatial autocorrelation, because their flexibility may lead to them confounding aggregation in geographic space with complexity in environmental space. For example, if a large number of presences are recorded in a small region of environmental space due to social behavior in geographic space, it is more likely that a complex model can find some feature in environmental space that correlates with this clustering. This will result in biased interpretation or mapped projections in other locations where this social behavior is absent. Cross‐validation can eliminate such spurious fits, but only if it is spatially stratified at an appropriate scale. However, when used for exploratory purposes, complex models may reveal information about this spatial structure within their response curves. Simple Simple parametric models can accommodate spatial structure under assumptions about the correlation structure (Latimer et al. , Dormann et al. ). If a non‐spatial model is used, simple models can be valuable because they are not flexible enough to model discontinuities in the response curve that derive from spatial structure, however they will still exhibit bias due to aggregated observations. Another solution to dealing with spatial aggregation is to model at sufficiently coarse resolution (suggesting simple models; see Spatial extents and resolution) that geographic clustering occurs within (and not among) cells, so it can effectively be ignored. One should be cautious building complex models because in practice, obtaining spatially independent cross‐validation samples is extremely challenging when the underlying spatial process is unknown and failing to do so likely leads to over‐fitting (cf. Hijmans ). Complex It may be desirable to use complex response curves as proxies for geographic clustering for mapping applications if the model focuses on small extents where nonlinear relationships are likely to hold across the landscape of interest (e.g. interpolation). For example, Santika and Hutchinson ( ) showed that using only linear responses in logistic regression reduced the model performance by misleadingly introducing spatial autocorrelation in the residuals, instead of allowing for unimodal responses in semi‐parametric GAMs. Methods broadly dealing with spatial and temporal autocorrelation are more recently available for complex models (Hothorn et al. , Crase et al. ). Conclusions Methodological Based on our observations on the appropriate use of different statistical methods and modeling decisions, how should modelers proceed to build SDMs? Many modelers’ preferences for particular statistical methods derive from the types of data they typically use and the questions they ask, rather than any fundamental philosophy of statistical modeling. For this reason, it is valuable for modelers to have experience in both simple and complex modeling strategies. We suggest that researchers develop a comprehensive understanding of regression models in general and GLMs in particular, as these represent the foundation of almost all of the more complex modeling frameworks. Also, understanding at least one approach to building complex SDMs can allow for sequential tests of more complex model structure. Importantly, because there are many different approaches to handling the same challenges in the data, it is less critical to understand each and every modeling approach than to become an expert in applying representatives of simple and complex modeling approaches. Bias can come from over fitting complex models, and it can come from misspecified simple models. To find a model of optimal complexity, many approaches are possible and are readily justified if sufficient cross‐validation has been performed. One might consider starting simple and adding the minimum complexity necessary (Snell et al. , this issue), or conversely starting with a complex model and removing as much superfluous complexity as possible. If one can narrow down the potential complexity based on the considerations discussed here to consider models within a particular modeling approach (Table ), then traditional model selection techniques are appropriate (section Modeling decisions). Due to the exploratory nature of many SDMs and the desire to discover spatial patterns and their drivers, we recommend that analyses begin exploration using complex models to determine an upper bound on the complexity of response curves. Over fitting can be controlled through cross‐validation (e.g. k‐fold, and particularly block resampling methods), even if a full decomposition into train‐validation‐test data is not feasible. Furthermore, complex models can be used to identify smooth, simple occurrence–environment relationships if patterns are sufficiently strong and guide specification of simpler models. In contrast, it will be more difficult to overcome a misspecified simple model, should a more complex response exist. If the exploration with complex models reveals smooth relationships, one can shift to a simpler model. If instead strong nonlinearities are prevalent, one should consider biological explanations for the nonlinearities. If complex nonlinearities cannot be avoided, one should focus on minimizing the complexity, understanding it through sensitivity analysis and uncertainty analysis (below) and providing biologically based hypotheses about it. The end result is a model that adds complexity only to the extent necessary to reproduce observed patterns. Uncertainty analysis is a relatively untapped resource for understanding appropriate model complexity. When the influence of particular model components is unknown (e.g. whether a predictor or feature is relevant a priori) it is particularly critical to account for uncertainty in modeled relationships to explore the implications of our ignorance. By studying uncertainty, one can gain confidence in pronounced nonlinearities when they come with tight confidence intervals. Information on parameter uncertainty, and consequently prediction uncertainty, can be obtained from any means of simulation from parameter distributions, including posterior sampling, sampling based on point estimates and covariance matrices, or bootstrapping. Bayesian models have the advantage of using the full data set to estimate parameter uncertainty, but are generally restricted to simpler models to avoid convergence issues (Latimer et al. , Ibáñez et al. ). One way of reducing uncertainty in predictions is to analyze the importance of predictors given the model and data using ‘average predictive comparisons’ (Gelman and Pardoe ) a form of sensitivity analysis that incorporates parameter uncertainty. One can also quantify uncertainty due to our modeling decisions by using ensembles of models built with different statistical methods or decisions (Pearson et al. , Araújo and New , Thuiller et al. ), provided that each component model is built based on modeling decisions reflecting a common goal. Biological Despite the valuable insights we can gain from occurrence models, it is worth acknowledging that fundamental limitations to biological inference may emerge from these studies (Tyre et al. , Araújo and Guisan , Araujo and Peterson , Merow et al. ). Balancing complex and simple models in such a way as to discover and discuss these limits may be as important as the actual patterns identified with some datasets. More broadly, it is important to keep in mind that we are ultimately performing exploratory analyses of occurrence–environment relationships. Occurrence records are not the ideal data to predict attributes of populations, Thuiller et al. ( ) provide an interesting cautionary note by showing weak relationships between occurrence probability and various demographic parameters for 108 tree species in temperate forests. However, often no other data are available at large spatial extents that might inform range models. Thus, while the limits may be obvious, insights from occurrence‐based correlative models may be an essential step in developing new hypotheses and research programs that can lead to the next generation of mechanistic models (Schurr et al. , Thuiller et al. , Snell et al. ). A novel, and potentially important, application of SDMs is for informing mechanistic models about the shapes of response curve in demographic models (Merow et al. ), or dynamic spatio‐temporal population models (Pagel and Schurr , Boulangeat et al. , Thuiller et al. ). Simple models may be preferable for these tasks because it is important to have a clear hypothesis to evaluate when linking it to a particular process (Thuiller et al. ). For example, SDMs might inform variable selection for the growth, survival and fecundity models in Integral Projection Models (Easterling et al. ). However highly nonlinear relationships would not be desirable for vital rate models due to the unlikely transitions through the life history that they might imply (cf. Merow et al. ). It is particularly important to avoid confounding missing processes with complex environmental responses (as might occur in complex models) when the mechanistic model explicitly describes the mechanisms that produce that aggregation (e.g. dispersal or species interactions: Kissling et al. ). The challenge in using SDMs in this way lies in ensuring response curves truly reflect environmental limitations; while environmental tolerance may limit a species' distribution at one end of a gradient, other (e.g. biotic) factors may limit it at the other end (Zimmermann et al. ). Many issues of response curve complexity that we discuss are also relevant for process‐based SDMs. Representations of processes are incorporated into SDMs to improve the precision and accuracy, or to improve our understanding of ecological processes. Consequently, process‐based models are used more for prediction and hypothesis testing than description and hypothesis generation. Yet, preferences for different model complexity persist (Evans et al. , Lonergan et al. ). Study objectives influence the choice of complexity; i.e. whether the model is intended for extrapolation or for understanding the potential importance of mechanisms. In the former case, simple models are useful to make the study of the role of a mechanism more analytically tractable. In the latter case, preference might be towards more complex models, where the roles of specific mechanisms can be understood in relation to other interconnected mechanisms. When the objective is prediction, complex models are valuable to represent all known relevant mechanisms in order to obtain the ‘best guess’. Simpler models are valuable when analyses imply that only certain key mechanisms are needed for sufficient predictive accuracy (further discussion in Evans et al. ). Attributes of the available data may be less important with process‐based models when relevant test datasets are well understood. However, data considerations are important when mechanisms or parameters are inferred from data or when assessing the spatiotemporal resolution over which particular degrees of abstraction and parameter values are relevant (Evans et al. , Lonergan , Snell et al. ). In any case, we expect that progress towards improved process‐based models lies in challenging occurrence‐based SDMs with stronger biological justifications and interpretations that aim to shed light on the mechanisms that drive process‐based models. Acknowledgements This study arose from two workshops entitled ‘Advancing concepts and models of species range dynamics: understanding and disentangling processes across scales’. Funding was provided by the Danish Council for Independent Research | Natural Sciences (grant no. 10‐085056 to SN). CM acknowledges funding from NSF grant 1046328 and NSF grant 1137366. WT acknowledges support from the European Research Council under the European Community's Seven Framework Programme FP7/2007–2013 Grant Agreement no. 281422 (TEEMBIO). RW acknowledges support from the Swiss National Science Foundation (Synergia Project CRS113‐125240, Early Postdoc Mobility Grant PBZHP3_147226). JE acknowledges funding from the Australian Research Council (grant FT0991640). TE states that mention any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Ecography Wiley

What do we gain from simplicity versus complexity in species distribution models?

Loading next page...
 
/lp/wiley/what-do-we-gain-from-simplicity-versus-complexity-in-species-mz7xTcUE5b

References (150)

Publisher
Wiley
Copyright
"Ecography © 2014 Nordic Society Oikos"
ISSN
0906-7590
eISSN
1600-0587
DOI
10.1111/ecog.00845
Publisher site
See Article on Publisher Site

Abstract

Species distribution models (SDMs), also known as ecolo gical niche models or habitat selection models, are widely used in ecology, evolutionary biology, and conservation (Elith and Leathwick , Franklin , Zimmermann et al. , Peterson et al. , Svenning et al. , Guisan et al. ). SDMs can provide insights into generalities and idiosyncrasies of the drivers of complex patterns of species' geographic distributions. SDMs are built using a variety of statistical methods – e.g. generalized linear/additive models, tree‐based models, maximum entropy – which span a range of complexity in the occurrence–environment relationships that they fit. Capturing the appropriate amount of complexity for particular study objectives is challenging. By building ‘under fit’ models, having insufficient flexibility to describe observed occurrence–environment relationships, we risk misunderstanding the factors shaping species distributions. By building ‘over fit’ models, with excessive flexibility, we risk inadvertently ascribing pattern to noise or building opaque models. As such, determining a suitable amount of complexity to include in SDMs is crucial for biological applications. Because traditional model selection is challenging when comparing models from different SDM modeling approaches (e.g. those in Table ), we argue that researchers must constrain model complexity based on attributes of the data and study objectives and an understanding of how these interact with the underlying biological processes. Here, we discuss the challenges that choosing an appropriate amount of model complexity poses and how this influences the use of different statistical methods and modeling decisions (Elith and Graham ). Common modeling paradigms used to build SDMs and decisions used to control their complexity. The variation among response curves from different modeling paradigms and different model settings suggests that they are suitable for different study objectives and attributes of the data. Response curves come from fitting SDMs to presence/background data on the overstory shrub, Protea punctata , from the Cape Floristic Region of South Africa (see Merow et al. for details of the data) with different degrees of control over the complexity of the fitted response curves. All models were constructed using the biomod2 package (Thuiller et al. ) within the statistical software R (R Core Team). Response curves of different complexity are shown which are representative of those commonly observed during SDM building. Dark grey curves were fitted using the settings at or near the default options sets in biomod2 (for illustration) with the exception of forcing the package to perform only a single fit per method using all of the presence data in model fitting. Black (light grey) curves were fitted by choosing options to make the fitted response curves simpler (more complex). Note that complexity of any of these paradigms is affected by changing the number of predictors, the order of interactions, and model averaging, hence these decisions are not explicitly included in the the table Algorithm Response curves Responses are built from Complexity controlled by Bioclimatic envelope models (BIOCLIM) quantiles, between which occurrence probability is 1 Features: step functions Quantiles Generalized linear models (GLM) parametric terms specified by user Features: polynomials, piecewise functions, splines Feature complexity specified by user Generalized additive models (GAM) combination of parametric terms and flexible smooth functions suggested by the data or the user Features: parametric terms as in GLMs and various smoothers (e.g. splines, loess) Number of nodes Penalties Multivariate adaptive regression splines (MARS) the sum of multiple piecewise basis functions of predictors suggested by the data Features: splines Number of knots Cost per degree of freedom Pruning Artificial neural networks (ANN) networks of interactions be tween simple functions of predictors suggested by the data Number of hidden layers Classification and regression trees (CART) repeated partitioning of predictors into different categories, suggested by the data, associated with different occurrence probabilities Features: threshold, with implicit interactions Minimum observations for split/terminal node Maximum node depth Complexity threshold to attempt a split Random forests (RF) an average of multiple CARTs, each constructed on bootstrapped samples of the data and using different random subsets of the full predictor set Features: threshold, with implicit interactions See CARTs Number of trees Boosted regression trees (BRT) regression trees at multiple steps; at each, models the residuals from the sum of all previous models weighted by the learning rate 2 Features: threshold, with implicit interactions See CARTs Number of trees Learning rate Maximum entropy (MAXENT) a GLM with a large number of features , which are suggested by the data or the user Features: linear, quadratic, interaction, hinge, threshold Feature classes used Regularization penalty Complexity is a fundamental feature of observed occurrence patterns because occurrence–environment relationships may be obscured by processes that are not exclusively related to the environment, such as dispersal, response to disturbance, and biotic interactions (Pulliam , Holt , Boulangeat et al. ). Consequently, SDMs can be dynamic and process‐based, explicitly representing aspects of the underlying biology. This paper focuses on the more widely used static, correlative SDMs, although many of the issues considered relate to process‐based SDMs as well. Describing this complexity is critical for many applications of SDMs, and using flexible occurrence–environment relationships allows biologists to hypothesize about the drivers of complexity or make accurate predictions that derive from their representation in SDMs. Such hypotheses are a valuable step toward the types of process‐based models discussed in this issue (Merow et al. , Snell et al. ). However, building complex models comes with the challenge of differentiating true complexity from noise (see chapter 7 in Hastie et al. for a statistical viewpoint on optimising model complexity). Some believe that flexible models are often overfit to the noise prevalent in many occurrence data sets. Thus, with such variation in both needs and opinions regarding model complexity, many modeling approaches are in current use (Table ). We characterize model complexity by the shape of the inferred occurrence–environment relationships (Table ) and the number of posited predictors and parameters used to describe them. A simpler model typically has relatively fewer parameters and fewer relationships among predictors compared to a more complex model. However, it remains a challenge to quantify complexity in a way that is appropriate across the spectrum of modeling approaches in Table (e.g. Janson et al. showed effective degrees of freedom to be an unreliable metric when defining complexity). Univariate ‘response curves’ are commonly used to give an impression of the complexity of the predicted occurrence–environment relationships. These are one‐dimensional ‘slices’ of multivariate space. The most common approach is to plot the predicted occurrence probability against the predictor of interest by holding all other predictors at their mean or median values (Elith et al. ; Table ), although other approaches are possible (Fox , Hastie et al. ). When visualized in this way, a simpler model is relatively smooth, containing fewer inflection and turning points compared to a more complex model. Though insightful, univariate curves only represent the true fitted response incompletely (3‐dimensional response surfaces or the ‘inflated response curves’ of Zurell et al. ( ) help here). Complex models contain more interactions, which can only be visualized on higher dimensional surfaces, compared to simpler models. Such responses must be interpreted as conditional on the other mean or median predictors in the model, which may be different than the responses to variables held at other values (Zurell et al. ), or to an unconditional model. Nonetheless, uni‐ and multivariate response curves remain one of the best standardized ways to assess relative model complexity. In this paper, we develop general guidelines for deciding on an appropriate level of complexity in occurrence–environment relationships. Uncertainty about how best to describe ecological complexity has to some extent divided biologists between those who prefer to use the principle of parsimony to identify model complexity (preferring the simplest model that is consistent with the data), and those who try to approximate more of the complexities of the real world relationships. We review the literature and the general modeling principles emerging from these two viewpoints, and we discuss the ways in which these overlap or differ in light of study objectives and attributes of the data. We make a variety of recommendations for choosing levels of complexity under different circumstances, while highlighting unresolved scenarios where viewpoints differ. We conclude with suggestions for drawing from the strengths of each modeling approach in order to advance our knowledge of current and future species geographical ranges. Complexity in ecology Many interacting biotic and abiotic processes influence species distributions and can manifest as complex occurrence–environment relationships (Soberón , Boulangeat et al. ). One essential challenge to recovering primary environmental drivers of these distributions, however, is to differentiate the signals of range determinants from sampling and environmental noise. Before embarking on statistical analyses of range determinants, ecological theory can focus an investigation (Austin , , , Pulliam , Chase and Leibold , Holt ). There is, a priori, a set of common drivers of populations that can be used to propose general shapes of occurrence–environment relationships. For example, we expect that for many variables, response curves describing a fundamental niche should be smooth because sudden jumps in fitness along an environmental gradient are unlikely to exist (Pulliam , Chase and Leibold , Holt ). For other variables, e.g. related to thermal tolerance, steep thresholds may exist due to loss of physiological function (Buckley et al. ). However, response curves describing realized niches might exhibit discontinuities due to the multiple interacting factors that can limit a species' occurrence in any particular location. Unimodal responses are expected (e.g. a bell‐shaped curve) because conditions too extreme for survival often exist at either end of a proximal gradient (Austin ). However, response curves can be linear where only part of the environmental range of the species has been sampled (e.g. one side of a unimodal response; Albert et al. ). Austin and Smith's ( ) continuum concept for plant species distributions predicts that skewed unimodal response curves are likely when plant species distributions are predominantly determined by one or a few environmental variables that strongly regulate survivorship and or reproduction (e.g. by temperature thresholds), but that more irregular response curves are expected given that species are influenced by a range of regulatory factors (e.g. different limiting nutrients, biotic and abiotic interactions) and historical contingencies (Austin et al. , Normand et al. ). Even with single factors, the processes that determine fitness may be different across the range, e.g. where one temperature extreme leads to abrupt loss of function while the other extreme causes gradually reduced performance. Interaction terms can be desirable to capture covariation between predictors or tradeoffs along resource gradients (e.g. higher temperatures are tolerable with greater rainfall). Many applications of SDMs do not explicitly consider such theoretical constraints on the shape of response curves (but see Santika and Hutchinson ), perhaps because it is difficult to work out how they translate into observations. We are faced with the challenge of inferring unknown levels of ecological complexity through the lens of data and models that imperfectly capture it. Complexity in models Two attributes of model fitting determine the complexity of inferred occurrence–environment relationships in SDMs: the underlying statistical method and modeling decisions made about inputs and settings. Together, these define what we will call different modeling approaches, a number of which are illustrated in Table . Statistical methods One of the primary differences among the available statistical methods for fitting SDMs is the range of transformations of predictors that they typically consider (in machine learning parlance: which ‘features’ to allow), and this helps to define the upper limit of complexity for their fitted response surfaces. We detail commonly used modeling approaches and demonstrate examples of their response curves in Table . Rectilinear or convex‐hull environmental envelopes (e.g. BIOCLIM or DOMAIN) and distance‐based approaches in multivariate environmental spaces (e.g. Malahanobis) are used in the simplest SDMs. Their response curves are simple functions (e.g. linear, hinge or step; Elith et al. ). Generalized linear models (GLMs), which are typically fitted with linear or polynomial features up to second order terms (rarely third or fourth order) for SDMs, and often without interactions, admit more complexity. Generalized additive models (GAMs) are potentially more complex because they allow non‐parametric smooth functions of variable flexibility (Hastie and Tibshirani , Wood ). Decision trees (Breiman et al. ) can also become quite complex because these can use a large number of step functions (each requiring a parameter) and can implicitly include high order interaction terms to depict response curves of arbitrary complexity. Modeling decisions Decisions that affect model complexity apply to all the statistical methods described above. For example, if a large set of predictors are available, then model complexity will differ depending on whether the full set, or a small subset, is used. One must also determine which features are considered in the model. Each feature requires at least one parameter in the occurrence–environment relationship and hence increases model complexity (see increased complexity of black vs grey MAXENT response curves due to increase in number of features; Table ). Large numbers of predictors are more commonly used in machine‐learning approaches because they automate feature selection whereas fewer are often used in simpler models where features are specified a priori. For example, maximum entropy models (MAXENT) can consider any number of linear, quadratic, product, threshold (step functions) or hinge transformations of the predictors (Phillips et al. , Phillips and Dudik ). In principle, this same complexity could be fit in a traditional GLM but this is typically impractical and not of interest to ecologists. SDM complexity is amplified when interactions between predictors are included to account for nonadditive relationships. GLMs and GAMs can include interactions that have been specified during model formulation as potentially ecologically relevant, but are usually used only sparingly. Decision trees include interactions implicitly through their hierarchical structure; i.e. the response to one variable depends on values of inputs higher in the tree, meaning that high order interaction terms (that depend on all the predictors along a branch) are possible. However interactions between variables are fitted automatically if supported by the data and cannot be explicitly controlled by the user (except to specify the permissible order of the interactions considered). Using ensembles of models can increase or decrease complexity. Ensembles are combinations of models in which the component models can be chosen based on selected criteria (e.g. predictive performance on held out data; Araújo and New ) or with an ensemble algorithm (a machine learning method). For instance, regression models selected via an information criterion can be combined using ‘multi‐model inference’, allowing distributions over effect sizes and over predictions to new sites (Burnham and Anderson ). A typical machine learning approach to ensembles uses an algorithm to build an ensemble of simple models that together predict better than any one component model. Examples include bagging and boosting – while these can be used on any component models, in ecology the most used component models are decision trees (e.g. in random forests, Brieman 2001; and boosted regression trees, Friedman ). Bagging (bootstrap aggregation) can be used to fit many models to bootstrapped replicates of the dataset (with and without random subsetting of predictors used across trees as in random forests). In contrast, boosting uses a forward stagewise method to build an ensemble, at each step modeling the residuals of the models fitted to date. Taking ensembles of relatively simple models usually increases complexity because combinations of simple models will not necessarily be simple. In contrast, ensembles of more complex models can average over idiosyncrasies of individual models to produce smoother response curves (Elder ). Model comparison To avoid overfitting and underfitting, it is common to compare models of differing complexity and select the model that optimizes some measure of performance. However, comparing models across modeling approaches (e.g. those in Table ) can be challenging. This is one of our motivations for constraining model complexity based on study objectives and data attributes. Information theoretic measures are a conventional way to choose model complexity and are relatively easy to apply for models where estimating the number of degrees of freedom is possible. However these cannot be calculated for ensemble‐based methods nor for many other methods in common use (Janson et al. ). In fact, Janson et al. ( ) warn, ‘contrary to folk intuition, model complexity and degrees of freedom are not synonymous and may correspond very poorly’. One way to compare models produced by different algorithms is to adopt a common currency for model performance by evaluating model predictions on either the training data or independent testing data. Measures such as AUC, Cohen's Kappa, and the True Skill Statistic are based on correctly distinguishing presences from absences. Measures based on non‐thresholded predictions are also relevant and preferable in many situations (Lawson et al. ). However, each of these metrics has weaknesses in different circumstances (Lobo et al. ) and further, only represent heuristic diagnostics for presence‐only data, because presences must be compared to pseudoabsence/background data (Hirzel et al. ). Once one has determined a suitable modeling approach tuning of the amount of complexity is more straightforward using a range of model selection techniques. Feature significance (e.g. p‐values), measures of model fit (e.g. likelihood), and information criteria (e.g. AIC, AICc, BIC; Burnham and Anderson ) can be applied to regression‐based methods. Cross‐validation or other resampling techniques are also used to set the smoothness of splines in GAMs (Wood ) or to determine tuning parameters in most machine learning methods (Hastie et al. ). Shrinkage or regularization is often used in regression, MAXENT and boosted regression trees to constrain coefficient estimates so models predict reliably (Phillips et al. , Hastie et al. ). Loss functions, which penalize for errors in prediction, can be constructed for any of the modeling approaches we consider (Hastie et al. ). An alternative approach employs null models to evaluate whether additional complexity has lead to spurious predictive accuracy (Raes and terSteege ). Evaluation against fit to training data alone cannot control for over fitting and risks selecting excessively complex models (Pearce and Ferrier , Araújo et al. ). In general, best practice involves splitting the data into training data to fit the model, validation data for model selection, and test data to evaluate the predictive performance of the selected model (Hastie et al. ). Recent studies have emphasized that care should also be taken in how data is partitioned into training, evaluation and test data, in particular to control for spatial autocorrelation (Latimer et al. , Dormann et al. , Veloz , Hijmans ; see below for more details). Hence methods such as block cross‐validation (where blocks are spatially stratified) are gaining momentum (Hutchinson et al. , Pearson et al. , Warton et al. ). Failure to factor out spatial autocorrelation in data partitioning can lead to misleadingly good estimates of model predictive performance. Basing model comparison on holdout data presents some practical challenges. Sample size may be insufficient to subset the data without introducing bias. Subsets of data can contain the same or different biases compared to the full data set. In particular, it can be difficult to remove spatial correlation between training and holdout data when the sampling design for the occurrence data is unknown or when a species is restricted geographically or environmentally (this is discussed below). Importantly, all these approaches to model comparison have strengths and weaknesses and none can unambiguously select between models of differing complexity built with different statistical methods and underlying assumptions. The tried and tested methods of statistics and machine learning for model selection are valuable when working within a particular modeling approach, but to benefit from these, it is valuable to narrow the scope of the feasible models based on biological considerations. We therefore now move to exploring approaches for identifying the appropriate level of complexity for particular study objectives based on data limitations and the underlying biological processes. Philosophical, statistical and biological considerations when choosing complexity In this section, we discuss factors that should influence the choice of model complexity. First, we outline general considerations and philosophical differences underlying both simple and complex modeling strategies (section Simple versus complex: fundamental approaches to describing natural systems). Next, we discuss how the study goals (section Study objectives) and data attributes (section Data attributes) interact with model complexity. Figure summarizes our findings. Importantly, a general consensus for choosing model complexity is not possible in many cases. To reflect the different schools of thought, we divide our facts, ideas and opinions into those that are relatively uncontroversial (subsections denoted ‘Recommendations’), those that favor simple models (denoted ‘Simple’), and those that favor more complex models (denoted ‘Complex’). We recall that ‘simple’ and ‘complex’ refer to the extremes along a gradient of complexity in response curves produced by distinct statistical methods and modeling decisions (section Complexity in models and Table ). Influence of attributes of study objectives and data attributes on the choice of model complexity. Green arrows illustrate attributes where the choice of complexity is of no particular concern. Red arrows illustrate the situations where caution and/or experimentation with model complexity is needed. Gray arrows indicate decisions that involve interactions with other study goals or data attributes. The thickness of the arrows illustrates the strength of the arguments in favor of choosing a specific level of complexity, with thicker arrows indicating stronger arguments. Simple versus complex: fundamental approaches to describing natural systems Simple Simple models tend towards a conservative, parsimonious approach and typically avoid over‐fitting. They link model structure to hypotheses that posit occurrence–environment relationships a priori and examine whether the resulting model meets these expectations. Simple models have greater tractability, can facilitate the interpretation of coefficients (cf. Tibshirani ), can help in understanding the primary drivers of species occurrence patterns, and are likely to be more easily generalized to new data sets (Randin et al. , Elith et al. ). Although complex responses surely exist in nature, we cannot often detect them because their signal is weak or they are confounded with sampling noise, bias or spatial autocorrelation. By using models that are too complex, one can inadvertently assign patterns due to either data limitations or missing processes, or both, to environmental suitability and fit the patterns simply by chance. Complex Complex models are often semi‐ or fully non‐parametric, and are preferred when there is no desire to impose parametric assumptions, specific functional forms or pre‐select predictors for models a priori. This does not mean that they are not biologically motivated, but rather emphasizes the reality that Nature is complex. Simple models may be readily interpretable but misleading (Breiman ), and for many applications of SDMs a preference for predictive accuracy in new data sets over interpretability is justifiable. Also, complex models are not necessarily difficult to interpret. Indeed, their complexity can be valuable for suggesting novel, unexpected responses. If we do not explore the full spectrum of complexity, there is a risk of obtaining an overly simplified, or even biased, view of ecological responses. Complex models can, depending on how they are structured, still identify simple relationships if responses are strong and robust. Study objectives Niche description vs range mapping Two prominent applications of SDMs are characterizing the predictors that define a species' niche and projecting fitted models across a landscape. Niche characterization quantifies the variables, primarily climate and physical, that affect a species' distribution. This is often done by analyzing response curves, the functions (coefficients or smoothing terms) that define them, and their relative importance in the model. Projecting these fitted models across a landscape can predict the geographic locations where the species may occur in the present or in the future. In some studies, focus lies in the final mapped predictions rather than how they derive from the underlying fitted models. Recommendations Some evaluation of the biological plausibility of the shape and complexity of response curves is always valuable, even if the objective is not niche description. Such evaluation is particularly critical for extrapolation (section Interpolate vs extrapolate), though it is admittedly quite challenging in multivariate models. Modelers should also carefully evaluate whether maps built from complex models substantially differ from maps built from simple models. If the predictions differ, the source of this should be explored. If the interest lies in interpretation, it is important to assess whether the mapped predictions are right for the right reason, and that complex environmental responses have not become proxies for sources of spatial aggregation in the data that lead to bias when projected to other locations (whether interpolation or extrapolation; section Spatial autocorrelation). Simple Simple models are preferable for niche description because they usually yield straightforward, smooth response curves that can be linked directly to ecological niche theory (section Complexity in models; Austin ), in contrast to the often irregular shapes that result from complex models (Table ). Assumptions about species responses are more transparent when simple models are being projected in new situations. Complex Complex models can be valuable for describing a species' niche when only qualitative descriptors of response curves are necessary (e.g. positive/negative, modality, relative importance) – i.e. even complex responses can be described in terms of main trends. Allowing complexity might offer more chance of identifying relevant response shapes. Complex models can be powerful for accurately mapping within the fitting region (Elith et al. , Randin et al. ) when one is not necessarily concerned with an ecological understanding of the complexity of underlying models. Although the source of complex relationships may remain unknown, complex models have the flexibility to describe these. Abrupt steps in response curves might be helpful to uncover strictly unsuitable sites when mapping distribution in space. Hypothesis testing vs hypothesis generation Some SDM studies are focused on testing specific hypotheses related to how species are distributed in relation to particular predictors or features. In others, little is known about the predictors shaping the distribution and the objective is to explore occurrence–environment relationships and generate hypotheses for explanation. For example, SDMs are valuable exploratory analyses for detecting the processes that confound occurrence–environment relationships, such as transient dynamics, dispersal, biotic interactions, or human modification of landscapes. The indirect effect of such processes can be seen in occurrence patterns, often due to abrupt changes or nonlinearities in response curves, leading to hypothesis generation. Whether one is testing or generating hypotheses critically affects the level of complexity permitted because hypothesis testing depends on being able to isolate the affects of particular features, whereas this matters less when exploring data in order to generate hypotheses. Recommendations When testing hypotheses, insights from ecological theory can guide the selection of features to include. A higher degree of control over the specific details of the underlying response surface is likely needed for hypothesis testing, which is made much easier using simple models. Hypothesis testing is more challenging in complex models with correlated features that can trade off with one another. Complex models are well suited to hypothesis generation, enabling a wider range of environmental covariates and modeling options than can be conveniently explored with simple models. Simple When the goal is hypothesis testing, simple parametric models allow investigation of the strength and shape of relationships between species occurrence and a small set of features. Furthermore, parametric models allow for hypothesis tests to examine if specific nonlinear features should be included in the selected model(s). The problem with complex models in such a setting is that with the large suite of potential features that they use, it is challenging to determine the significance of a single feature or attribute of the response curve or to compare alternative models. Instead, one is constrained to accept the features selected by the statistical method (e.g. features classes in MAXENT; splits in tree‐based methods) to represent that predictor (within some user‐specified bounds). Rather, it is preferable to specify a set of features (or multiple sets for competing models) to determine the suitability for describing a particular pattern. For example, when features are selected automatically, it may be challenging to determine whether a quadratic term that makes the response unimodal is important or how much better/worse the model might be without it. Complex The starting premises, for hypothesis testing, is a priori ecological understanding enabling the user to select a small set of features. However, we do not always have this prior understanding. Complex models explore much larger sets of nonlinear features and interactions than simple models and are suited for generating hypotheses about underlying processes (Boulangeat et al. ) derived from potentially flexible responses that would not often be detected with simpler models (e.g. bimodality). This same flexibility can be used to augment existing knowledge. For example, if we know that a species is associated with dry, high elevation locations, we don't need a simplified model to describe this, but rather more insight from a potentially complex model to capture bimodality or strong asymmetries. Complex models also provide tools for evaluating predictor importance, which is useful for both generation and testing of hypotheses and can lead to inference that differs little from simpler models (Grömping ). These importance indices can be generated from permutation tests (Strobl et al. , Grömping ), contribution to the likelihood (e.g. ‘percent contribution’ in MAXENT), or proportion of deviance explained (decision trees). Interpolate vs extrapolate When predicting species' distributions over space and time, it is important to distinguish between interpolation and extrapolation. When a point is interpolated by a fitted model, it lies within the known data range of predictors, but was not measured for its response. Alternatively, an extrapolated point is one that lies outside the observed range of the predictor. Both interpolation and extrapolation can occur in geographic or environmental space (cf. Peterson et al. , Aarts et al. ). Extrapolation requires caution in all scenarios but cannot be avoided when assessing questions relating to ‘no‐analogue’ climate scenarios (Araújo et al. ) or range expansion. The correlative models discussed here are not optimal for extrapolation in many cases; process‐based models are generally preferred because the functional form of the response curve captures the processes that apply beyond the range of observed data (Kearney and Porter , Thuiller et al. , Merow et al. ). Recommendations The challenges associated with interpolation and extrapolation, though differing in the way they manifest, are apparent for models of any complexity and hence simple and complex perspectives align. Interpolation within the range of the observed data will be accurate if the model includes all processes operating in the interpolation extent and is based on well‐structured data. Without that, prediction to unsampled sites will average across unrepresented processes and might reflect biases in the sample. More generally, it may not matter whether a response curve is complex as long as it retains the basic qualities of a simpler model. For example, a line or a sequence of small step functions parallel to the line can produce similar predictions. Some caution should be taken with complex models, as complex combinations of features can become proxies for unmeasured spatial factors in unintended ways and inadvertently model clustering in geographic space as complexity in environmental space, which can lead to errant interpolation (section Spatial autocorrelation). Extrapolation always requires that response curves have been checked for biological plausibility (cf. section Niche description vs range mapping). Of course, even simple models can extrapolate poorly. For example, Thuiller et al. ( ) showed that a simple GLM or GAM run on a restricted and incomplete range could create spurious termination of the smoothed relationships, leading to errant extrapolation. Hence, the importance of extrapolation can depend on the chosen spatial extent and on the selected features (section Spatial extents and resolution). Complex models should be carefully monitored at the edges of the data range, both because small sample sizes and the ways different statistical methods handle extrapolation can have drastic effects on predictions (Pearson et al. ). When using complex models, feature space may be sparsely sampled, which means that when one expects to interpolate a predictor, there may be inadvertent extrapolation of nonlinear features. For example, in a model with interaction terms, one may adequately sample the linear features for all predictors while poorly sampling the relevant combinations of these predictors (Zurell et al. ). Complex models can lead to different combinations of features producing similar model performance in the present (Maggini et al. ), but vastly diverging spatial predictions when transferred to other conditions (Thuiller , Thuiller et al. , Pearson et al. , Edwards et al. , Elith et al. ). Narrowing the range of possibilities using a simpler model that controls for the biological plausibility of the response curves (cf. section Complexity in models) can reduce this divergence (Randin et al. ). Data attributes Sample size The number of occurrence records is a critical limiting factor when building SDMs. With presence–absence data, the number of records in the least frequent class determines the amount of information available for modeling. Small sample sizes can lead to low signal to noise ratios, thereby making it difficult to evaluate the strength of any occurrence–environment pattern in the presence of confounding processes. Recommendations Simple models are necessary for species with few occurrences to avoid over‐fitting (Fig. ). This suggests few predictors and only simple features. Support for features can be found by reporting intervals on response curves (e.g. from confidence intervals or subsamples), with an eye for tight intervals around pronounced nonlinearities. For large data sets, any of the modeling approaches described earlier are potentially suitable, dependent on study objectives. Simple We expect a large amount of noise in occurrence data due to processes unrelated to environmental responses and this noise can be particularly influential when sample sizes are small. For example, if a basic temperature response is built from data that are variably influenced by a strong land‐use history and dispersal limitation throughout the range, a failure to take that into account results in a misspecified climate response surface. While simple models have a chance of smoothing over such variations, complex models can more readily fit these latent patterns, leading to biased prediction when models are projection to other locations where the latent processes differ. Complex models fitting many features are only appropriate when there are sufficient data to meaningfully train, test and validate the model (cf. Hastie et al. ). Complex If data are available, increasing the number of predictors ensures a more accurate understanding of the drivers of distributions. If the data set is small, it is possible to use a method that can be potentially complex, as long as it is well controlled by the user to protect against over‐fitting e.g., using penalized likelihoods (Tibshirani ), a reduced set of features in MAXENT; (Phillips and Dudik , Merow et al. ), or heavy pruning in tree‐based methods. Permitting some complexity may be useful to identify counterintuitive response curves and develop stratified sampling strategies for future data collection to support or refute the model responses. Sampling bias Sampling bias arises from imperfect sampling design, which includes purposive, non‐probabilistic, or targeted sampling (Schreuder et al. , Edwards et al. ) and imperfect detection (MacKenzie et al. ). The important question is whether sampling bias – which often arises in geographic space – transfers to bias in environmental space, and further, whether some environments are completely unsampled. No statistical manipulation can fully overcome biased sampling. The main challenge when choosing complexity is that – particularly for models based on presence‐only data – it may be unclear whether patterns in environmental space derive from habitat suitability, divergence between the fundamental and realized niches (Pulliam ), transient behavior, or sampling problems (Phillips et al. , Hefley et al. , Warton et al. ). For presence–absence data with perfect detection, sampling biases may not be too detrimental as long as at least some samples exist across environments into which the model is required to predict (Zadrozny , but see Edwards et al. for contrasting results). Recommendations More flexible models will be more prone to finding patterns in restricted parts of environmental space where sampling is problematic. Poor performance on test data could identify over fitting to sampling bias, but only if the test data are unbiased. In practice, if unbiased testing data were available, they could be used to build an unbiased model in the first place. Recent advances that enable presence‐only and presence–absence data to be modeled together, and across species, will be useful in this context (Fithian et al. ). A tradeoff exists between a complex model that might fit, e.g. step functions to few data points in poorly sampled regions and simple models that predict smooth but potentially meaningless functions from just a few points. Simple The hope when using simple models for biased data is that main trends are still identified. Complex models can over‐fit to the bias (particularly if the bias is heterogeneous in space) and miss the true main trends. Methods for dealing with imperfect detection (MacKenzie and Royle , Welsh et al. ) or sampling design often specify relatively simple responses to environment because they simultaneously fit the model for sampling (Latimer et al. ), and identifiability can become an issue when too many parameters are used that might relate to either observation or occurrence. In such cases, inference will be limited to very general trends. Complex If the sampling bias is strongly linked to the environmental gradients, even simple models can predict spurious relationships (Lahoz‐Monfort et al. ). Complex models could be useful in understanding, or hypothesizing about, the nature of the sampling bias: for example, the most parsimonious explanation for sharp changes in the probability of presence in some circumstances could be sampling bias, although we know of no published examples. Detection and sampling bias models are not restricted to simple models – for instance, the former have recently been developed for boosted regression trees (Hutchinson et al. ) and the latter are often used with MAXENT (Phillips et al. ). Predictor variables: proximal vs distal A priority in selecting candidate predictors is to identify variables that are as proximal as possible to the factors constraining the species' distribution. Proximal variables (e.g. soil moisture for plants) best represent the resources and direct gradients that influence species ranges (Austin ). More distal predictors, such as using topographic aspect as a surrogate for soil moisture, do not directly affect species distributions but do so indirectly through their imperfect relationships with the proximal predictors they replace. The problem with using distal predictors is that their correlation with the proximal predictor can change across the species' range, even if the proximal predictor's relationship with the species does not (Dormann et al. ). We rarely have access to all of the most important proximal predictors across a study region, so the main question is what response shapes should we expect for more distal predictors? Imagine that a species is limited by the duration of the growing season, but that the response is instead modeled with a combination of mean annual temperature and topographic position (aspect, slope, etc.). It is difficult to anticipate the shape of the multivariate surface that mimics the species response to the proximal predictor. Recommendations Responses to proximal predictors over sufficiently large gradients should be relatively strong (Austin and references therein), and either simple or complex models should be able to identify these responses if complexity is suitably controlled. However, the extent to which the included set of predictors is proximal or distal may be unknown. Experimentation with complex and simple models may help test hypotheses about which predictors are more proximal, potentially best encapsulated in a simple response curve, and those that are more distal and better represented with more complex curves. As physiological mechanisms generally provide the best insights into how environmental gradients translate into demographic (and therefore population) patterns, the use of informed physiological understanding could provide a valuable starting point (Austin , Kearney and Porter ). Simple Ecological theory supports using unimodal or skewed smooth responses to proximal variables (Austin and Nicholls , Oksanen , Austin , , Guisan and Thuiller , Franklin ), which motivates constraining the functional form of response curves a priori (section Complexity in models; e.g. specific features in a GLM, few nodes in a GAM). Remotely sensed data, even for proximal predictors, may introduce noise to the environmental covariates due to imprecision and to use of long term averaged data (Austin , Letten et al. ), and may be prone to over‐fitting with complex models if those data generally fail to describe the local habitat conditions accurately. One can use simple models to smooth over such idiosyncrasies if the main trends are sufficiently strong or one can omit predictors if trends are weak. Parametric, latent variable models can help to deal with this imprecision (Mcinerny and Purves ). Complex Ecological theory is based on responses to idealized gradients, whereas we observe (often imperfectly) a messy reality. Specifying an overly simple model will result in over‐ and under‐estimation of the response at points throughout the covariate space (Barry and Elith ). Given that the relationship between proximal and distal predictors is unlikely linear and may vary across landscapes, it is likely that the true response to distal variables might also be complex and best represented by a model that allows flexible fits and interactions. Hence the complex viewpoint still adheres to ecological theory, but allows for a modified view of idealized relationships as seen through available data. Spatial extents and resolution Interpretation of ecological patterns is scale dependent; hence changing spatial extent and/or resolution affects the patterns and processes that can be modeled (Tobalske , Chave ). Ecologists often use hierarchical concepts to describe influences of environment on species distributions – for instance, that climate dominates distributions of terrestrial species at the global scale (coarsest grain, largest extent), while topography, lithology or habitat structure create the finer scale variation that impact species at regional to local scales together with dispersal limitations and biotic interactions (Boulangeat et al. , Dubuis et al. , Thuiller et al. ). SDMs built across large spatial extents often rely on remotely sensed, coarse resolution or highly interpolated predictors, creating inherent biases and sampling issues (section Sampling bias). The choice of extent can also determine whether the species entire range is included in the model or whether data are censored (e.g. limited by political borders). Recommendations Resolutions should be chosen that provide data from proximal rather than distal variables. Such data are becoming available at high resolutions with expanded and technologically enhanced monitoring networks and more sophisticated interpolation of climate data (e.g. PRISM). The choice of resolution hence reduces to the discussion of proximal versus distal predictors in section Predictor variables: proximal vs distal. When the extent is chosen to contain the species' entire range, models should include sufficient complexity to detect unimodal, skewed responses (section Complexity in models). Simple Smooth responses, characterized by simpler models, are to be expected at large spatial extents and coarse resolution that smooth over the confounding processes that affect finer resolution occurrence patterns (Austin ). At finer resolutions, it may also be undesirable to incorporate the full complexity of the response curve: much of the finer details may derive from factors for which no predictor variables are available or are irrelevant to the purpose of the investigation (e.g. microhabitat or regional competition effects). Complex At small spatial extents, we might have data on the relevant proximal factors (e.g. soil properties), so fitting complex models along small‐scale gradients can capture this complexity. Also, complex models may be useful for exploring the nonlinearities that arise in response curves from distal variables at broad scales in that they potentially provide insight into important unmeasured variables. Spatial autocorrelation Many processes omitted from SDMs have spatial structure. For example, dispersal limitation, foraging behavior, competition, prevailing weather patterns, and even sampling bias can all lead to spatially structured occurrence patterns that are not explained by the set of predictors included in the SDM (Legendre , Barry and Elith , but see Latimer et al , Dormann et al. ). When these spatial patterns are not appropriately accounted for, biased estimates of environmental responses may emerge. Recommendations If presence–absence data are available, one should assess the degree of spatial autocorrelation in the residuals and implement methods to control for spatial autocorrelation. Methods include spatially‐explicit models that separate the spatial pattern from the environmental response (Latimer et al. , Dormann et al. , Beale et al. ), using spatial eigenvectors as predictors (Diniz‐Filho and Bini ), or stratified sub‐sampling of the data to minimize autocorrelation (Hijmans ). Complex models should be used cautiously in the presence of spatial autocorrelation, because their flexibility may lead to them confounding aggregation in geographic space with complexity in environmental space. For example, if a large number of presences are recorded in a small region of environmental space due to social behavior in geographic space, it is more likely that a complex model can find some feature in environmental space that correlates with this clustering. This will result in biased interpretation or mapped projections in other locations where this social behavior is absent. Cross‐validation can eliminate such spurious fits, but only if it is spatially stratified at an appropriate scale. However, when used for exploratory purposes, complex models may reveal information about this spatial structure within their response curves. Simple Simple parametric models can accommodate spatial structure under assumptions about the correlation structure (Latimer et al. , Dormann et al. ). If a non‐spatial model is used, simple models can be valuable because they are not flexible enough to model discontinuities in the response curve that derive from spatial structure, however they will still exhibit bias due to aggregated observations. Another solution to dealing with spatial aggregation is to model at sufficiently coarse resolution (suggesting simple models; see Spatial extents and resolution) that geographic clustering occurs within (and not among) cells, so it can effectively be ignored. One should be cautious building complex models because in practice, obtaining spatially independent cross‐validation samples is extremely challenging when the underlying spatial process is unknown and failing to do so likely leads to over‐fitting (cf. Hijmans ). Complex It may be desirable to use complex response curves as proxies for geographic clustering for mapping applications if the model focuses on small extents where nonlinear relationships are likely to hold across the landscape of interest (e.g. interpolation). For example, Santika and Hutchinson ( ) showed that using only linear responses in logistic regression reduced the model performance by misleadingly introducing spatial autocorrelation in the residuals, instead of allowing for unimodal responses in semi‐parametric GAMs. Methods broadly dealing with spatial and temporal autocorrelation are more recently available for complex models (Hothorn et al. , Crase et al. ). Conclusions Methodological Based on our observations on the appropriate use of different statistical methods and modeling decisions, how should modelers proceed to build SDMs? Many modelers’ preferences for particular statistical methods derive from the types of data they typically use and the questions they ask, rather than any fundamental philosophy of statistical modeling. For this reason, it is valuable for modelers to have experience in both simple and complex modeling strategies. We suggest that researchers develop a comprehensive understanding of regression models in general and GLMs in particular, as these represent the foundation of almost all of the more complex modeling frameworks. Also, understanding at least one approach to building complex SDMs can allow for sequential tests of more complex model structure. Importantly, because there are many different approaches to handling the same challenges in the data, it is less critical to understand each and every modeling approach than to become an expert in applying representatives of simple and complex modeling approaches. Bias can come from over fitting complex models, and it can come from misspecified simple models. To find a model of optimal complexity, many approaches are possible and are readily justified if sufficient cross‐validation has been performed. One might consider starting simple and adding the minimum complexity necessary (Snell et al. , this issue), or conversely starting with a complex model and removing as much superfluous complexity as possible. If one can narrow down the potential complexity based on the considerations discussed here to consider models within a particular modeling approach (Table ), then traditional model selection techniques are appropriate (section Modeling decisions). Due to the exploratory nature of many SDMs and the desire to discover spatial patterns and their drivers, we recommend that analyses begin exploration using complex models to determine an upper bound on the complexity of response curves. Over fitting can be controlled through cross‐validation (e.g. k‐fold, and particularly block resampling methods), even if a full decomposition into train‐validation‐test data is not feasible. Furthermore, complex models can be used to identify smooth, simple occurrence–environment relationships if patterns are sufficiently strong and guide specification of simpler models. In contrast, it will be more difficult to overcome a misspecified simple model, should a more complex response exist. If the exploration with complex models reveals smooth relationships, one can shift to a simpler model. If instead strong nonlinearities are prevalent, one should consider biological explanations for the nonlinearities. If complex nonlinearities cannot be avoided, one should focus on minimizing the complexity, understanding it through sensitivity analysis and uncertainty analysis (below) and providing biologically based hypotheses about it. The end result is a model that adds complexity only to the extent necessary to reproduce observed patterns. Uncertainty analysis is a relatively untapped resource for understanding appropriate model complexity. When the influence of particular model components is unknown (e.g. whether a predictor or feature is relevant a priori) it is particularly critical to account for uncertainty in modeled relationships to explore the implications of our ignorance. By studying uncertainty, one can gain confidence in pronounced nonlinearities when they come with tight confidence intervals. Information on parameter uncertainty, and consequently prediction uncertainty, can be obtained from any means of simulation from parameter distributions, including posterior sampling, sampling based on point estimates and covariance matrices, or bootstrapping. Bayesian models have the advantage of using the full data set to estimate parameter uncertainty, but are generally restricted to simpler models to avoid convergence issues (Latimer et al. , Ibáñez et al. ). One way of reducing uncertainty in predictions is to analyze the importance of predictors given the model and data using ‘average predictive comparisons’ (Gelman and Pardoe ) a form of sensitivity analysis that incorporates parameter uncertainty. One can also quantify uncertainty due to our modeling decisions by using ensembles of models built with different statistical methods or decisions (Pearson et al. , Araújo and New , Thuiller et al. ), provided that each component model is built based on modeling decisions reflecting a common goal. Biological Despite the valuable insights we can gain from occurrence models, it is worth acknowledging that fundamental limitations to biological inference may emerge from these studies (Tyre et al. , Araújo and Guisan , Araujo and Peterson , Merow et al. ). Balancing complex and simple models in such a way as to discover and discuss these limits may be as important as the actual patterns identified with some datasets. More broadly, it is important to keep in mind that we are ultimately performing exploratory analyses of occurrence–environment relationships. Occurrence records are not the ideal data to predict attributes of populations, Thuiller et al. ( ) provide an interesting cautionary note by showing weak relationships between occurrence probability and various demographic parameters for 108 tree species in temperate forests. However, often no other data are available at large spatial extents that might inform range models. Thus, while the limits may be obvious, insights from occurrence‐based correlative models may be an essential step in developing new hypotheses and research programs that can lead to the next generation of mechanistic models (Schurr et al. , Thuiller et al. , Snell et al. ). A novel, and potentially important, application of SDMs is for informing mechanistic models about the shapes of response curve in demographic models (Merow et al. ), or dynamic spatio‐temporal population models (Pagel and Schurr , Boulangeat et al. , Thuiller et al. ). Simple models may be preferable for these tasks because it is important to have a clear hypothesis to evaluate when linking it to a particular process (Thuiller et al. ). For example, SDMs might inform variable selection for the growth, survival and fecundity models in Integral Projection Models (Easterling et al. ). However highly nonlinear relationships would not be desirable for vital rate models due to the unlikely transitions through the life history that they might imply (cf. Merow et al. ). It is particularly important to avoid confounding missing processes with complex environmental responses (as might occur in complex models) when the mechanistic model explicitly describes the mechanisms that produce that aggregation (e.g. dispersal or species interactions: Kissling et al. ). The challenge in using SDMs in this way lies in ensuring response curves truly reflect environmental limitations; while environmental tolerance may limit a species' distribution at one end of a gradient, other (e.g. biotic) factors may limit it at the other end (Zimmermann et al. ). Many issues of response curve complexity that we discuss are also relevant for process‐based SDMs. Representations of processes are incorporated into SDMs to improve the precision and accuracy, or to improve our understanding of ecological processes. Consequently, process‐based models are used more for prediction and hypothesis testing than description and hypothesis generation. Yet, preferences for different model complexity persist (Evans et al. , Lonergan et al. ). Study objectives influence the choice of complexity; i.e. whether the model is intended for extrapolation or for understanding the potential importance of mechanisms. In the former case, simple models are useful to make the study of the role of a mechanism more analytically tractable. In the latter case, preference might be towards more complex models, where the roles of specific mechanisms can be understood in relation to other interconnected mechanisms. When the objective is prediction, complex models are valuable to represent all known relevant mechanisms in order to obtain the ‘best guess’. Simpler models are valuable when analyses imply that only certain key mechanisms are needed for sufficient predictive accuracy (further discussion in Evans et al. ). Attributes of the available data may be less important with process‐based models when relevant test datasets are well understood. However, data considerations are important when mechanisms or parameters are inferred from data or when assessing the spatiotemporal resolution over which particular degrees of abstraction and parameter values are relevant (Evans et al. , Lonergan , Snell et al. ). In any case, we expect that progress towards improved process‐based models lies in challenging occurrence‐based SDMs with stronger biological justifications and interpretations that aim to shed light on the mechanisms that drive process‐based models. Acknowledgements This study arose from two workshops entitled ‘Advancing concepts and models of species range dynamics: understanding and disentangling processes across scales’. Funding was provided by the Danish Council for Independent Research | Natural Sciences (grant no. 10‐085056 to SN). CM acknowledges funding from NSF grant 1046328 and NSF grant 1137366. WT acknowledges support from the European Research Council under the European Community's Seven Framework Programme FP7/2007–2013 Grant Agreement no. 281422 (TEEMBIO). RW acknowledges support from the Swiss National Science Foundation (Synergia Project CRS113‐125240, Early Postdoc Mobility Grant PBZHP3_147226). JE acknowledges funding from the Australian Research Council (grant FT0991640). TE states that mention any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Journal

EcographyWiley

Published: Dec 1, 2014

There are no references for this article.