Access the full text.
Sign up today, get DeepDyve free for 14 days.
N. Raes, H. Steege (2007)
A null‐model for significance testing of presence‐only species distribution modelsEcography, 30
A. M. Latimer (2006)
Building statistical models to analyze species distributionsTrends. Ecol. Evol, 16
T. Santika, M. F Hutchinson (2009)
The effect of species response form on species distribution model prediction and inferenceGlobal Change Biol, 220
Steven Phillips, R. Anderson, R. Schapire (2006)
Maximum entropy modeling of species geographic distributionsEcological Modelling, 190
A. Guisan, W. Thuiller (2005)
Predicting species distribution: offering more than simple habitat models.Ecology letters, 8 9
C. F. Randin (2006)
Are niche‐based species distribution models transferable in space?Q. Sci. Rev, 33
J. Elith, S. Ferrier, F. Huettmann, J. Leathwick (2005)
The evaluation strip: A new and robust method for plotting predicted responses from species distribution modelsEcological Modelling, 186
J.‐C. Svenning (2011)
Applications of species distribution modeling to paleobiologyEcol. Appl, 30
M. Austin (2007)
Species distribution models and ecological theory: A critical assessment and some possible new approachesEcological Modelling, 200
Isabelle Boulangeat, D. Georges, Cédric Dentant, R. Bonet, Jérémie Es, S. Abdulhak, N. Zimmermann, W. Thuiller (2014)
Anticipating the spatio-temporal response of plant diversity and vegetation structure to climate and land use change in a protected area.Ecography, 37 12
W. Thuiller, T. Münkemüller, K. Schiffers, D. Georges, S. Dullinger, V. Eckhart, T. Edwards, D. Gravel, G. Kunstler, C. Merow, Kara Moore, C. Piedallu, Steve Vissault, N. Zimmermann, D. Zurell, F. Schurr (2014)
Does probability of occurrence relate to population dynamics?Ecography, 37 12
R. G. Pearson (2013)
Shifts in Arctic vegetation and associated feedbacks under climate changeJ. Biogeogr
A. Dubuis, Sara Giovanettina, L. Pellissier, J. Pottier, P. Vittoz, A. Guisan (2013)
Improving the prediction of plant species distribution and community composition by adding edaphic to topo-climatic variablesJournal of Vegetation Science, 24
Jorge Soberón (2007)
Grinnellian and Eltonian niches and geographic distributions of species.Ecology letters, 10 12
M. Araújo, M. Araújo, M. Araújo, A. Peterson (2012)
Uses and misuses of bioclimatic envelope modeling.Ecology, 93 7
J. H Friedman (2001)
Greedy function approximation: a gradient boosting machineEcol Lett, 29
T. Hastie, R. Tibshirani (2014)
Generalized Additive Models
M. Austin (2002)
Spatial prediction of species distribution: an interface between ecological theory and statistical modellingEcological Modelling, 157
G. Aarts, J. Fieberg, Jason Matthiopoulos (2012)
Comparative interpretation of count, presence–absence and point methods for species distribution modelsMethods in Ecology and Evolution, 3
T. C Edwards (2006)
Effects of sample survey design on the accuracy of classification tree models in species distribution modelsAnnu. Rev. Ecol. Evol. Syst, 199
J. Elith (2010)
The art of modelling range‐shifting speciesJ. Stat. Softw, 1
D. I. MacKenzie, J. A Royle (2005)
Designing occupancy studies: general advice and allocating survey effortGlobal Ecol. Biogeogr, 42
F. Schurr, J. Pagel, J. Cabral, J. Groeneveld, O. Bykova, R. O’Hara, F. Hartig, Werner Kissling, H. Linder, G. Midgley, B. Schröder, A. Singer, N. Zimmermann (2012)
How to understand species’ niches and range dynamics: a demographic research agenda for biogeographyJournal of Biogeography, 39
J. Diniz‐Filho, L. Bini (2005)
Modelling geographical patterns in species richness using eigenvector-based spatial filtersGlobal Ecology and Biogeography, 14
Lucas Janson, William Fithian, T. Hastie (2013)
Effective degrees of freedom: a flawed metaphor.Biometrika, 102 2
Isabelle Boulangeat, D. Gravel, W. Thuiller (2012)
Accounting for dispersal and biotic interactions to disentangle the drivers of species distributions and their abundances.Ecology letters, 15 6
J. Elith, C. Graham (2009)
Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution modelsEcography, 32
U Grömping (2009)
Variable importance assessment in regression: linear regression versus random forestEcol. Evol, 63
A. Krueger (1974)
The Political Economy of the Rent-Seeking SocietyThe American Economic Review, 64
C. Merow, Matthew Smith, J. Silander (2013)
A practical guide to MaxEnt for modeling species' distributions: what it does, and why inputs and settings matterEcography, 36
A. Letten, M. Ashcroft, D. Keith, John Gollan, D. Ramp (2013)
The importance of temporal climate variability for spatial patterns in plant diversityEcography, 36
C. Strobl, A. Boulesteix, T. Kneib, Thomas Augustin, A. Zeileis (2008)
Conditional variable importance for random forestsBMC Bioinformatics, 9
H. Pulliam (2000)
On the relationship between niche and distributionEcology Letters, 3
J. Lobo (2008)
AUC: a misleading measure of the performance of predictive distribution modelsMethods Ecol. Evol, 17
R. Maggini (2006)
Improving generalized regression analysis for the spatial prediction of forest communitiesGlobal Ecol. Biogeogr, 33
J Pagel, F. M Schurr (2012)
Forecasting species ranges by statistical estimation of ecological niches and spatial population dynamicsEcol. Appl, 21
A. Peterson, Jorge Soberón, R. Pearson, R. Anderson, E. Martínez‐Meyer, Miguel Nakamura, M. Araújo (2011)
Ecological Niches and Geographic Distributions
W. Thuiller, L. Brotóns, M. Araújo, S. Lavorel (2004)
Effects of restricting environmental range of data to project current and future species distributionsEcography, 27
G. J. Mcinerny, D. W Purves (2011)
Fine-scale environmental variation in species distribution modelling: regression dilution, latent variables and neighbourly adviceEcol. Model, 2
R. Hijmans (2012)
Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model.Ecology, 93 3
Mike Austin, T. Smith (1989)
A new model for the continuum conceptVegetatio, 83
S. Veloz (2009)
Spatially autocorrelated sampling falsely inflates measures of accuracy for presence‐only niche modelsJournal of Biogeography, 36
Simon Wood (2006)
Generalized Additive Models: An Introduction with R
A. Tyre, H. Possingham, D. Lindenmayer (2001)
INFERRING PROCESS FROM PATTERN: CAN TERRITORY OCCUPANCY PROVIDE INFORMATION ABOUT LIFE HISTORY PARAMETERS?Ecological Applications, 11
J. Pearce, S. Ferrier (2000)
Evaluating the predictive performance of habitat models developed using logistic regressionEcological Modelling, 133
William Fithian, J. Elith, T. Hastie, D. Keith (2014)
A Proportional Observer Bias Model for Multispecies Distribution Modeling
W. D. Kissling (2012)
Towards novel approaches to modelling biotic interactions in multispecies assemblages at large spatial extentsGlobal Ecol. Biogeogr, 39
U. Grömping (2009)
Variable Importance Assessment in Regression: Linear Regression versus Random ForestThe American Statistician, 63
J. Elith, M. Kearney, Steven Phillips (2010)
The art of modelling range‐shifting speciesMethods in Ecology and Evolution, 1
Micahael Tomz, J. Wittenberg, Gary King (2003)
Clarify: Software for Interpreting and Presenting Statistical ResultsJournal of Statistical Software, 8
T. J. Hefley (2013)
Nondetection sampling bias in marked presence‐only dataEcol. Appl, 3
M. Easterling, S. Ellner, P. Dixon (2000)
SIZE‐SPECIFIC SENSITIVITY: APPLYING A NEW STRUCTURED POPULATION MODELEcology, 81
M. Araújo, R. Pearson, W. Thuiller, M. Erhard (2005)
Validation of species–climate impact models under climate changeGlobal Change Biology, 11
M. Kearney, W. Porter (2009)
Mechanistic niche modelling: combining physiological and spatial data to predict species' ranges.Ecology letters, 12 4
D. MacKenzie, J. Royle (2005)
Designing occupancy studies: general advice and allocating survey effortJournal of Applied Ecology, 42
C. Strobl (2008)
Conditional variable importance for random forestsJ. R. Stat. Soc. B, 9
T Hastie (2009)
The elements of statistical learning: data mining, inference, and predictionEcol. Monogr
B. Crase, A. Liedloff, B. Wintle (2012)
A new method for dealing with residual spatial autocorrelation in species distribution modelsEcography, 35
L. Breiman (1983)
Classification and regression trees
Austin (1994)
Determining species response functions to an environmental gradient by means of a β-functionJ. Veg. Sci, 5
J. Diniz‐Filho, L. M Bini (2005)
Modelling geographical patterns in species richness using eigenvector-based spatial filtersJ. Veg. Sci, 14
J. Oksanen (1997)
Why the beta-function cannot be used to estimate skewness of species responsesJournal of Vegetation Science, 8
I. Ibáñez (2009)
Multivariate forecasts of potential distributions of invasive plant speciesMethods Ecol. Evol, 19
James Franklin (2005)
The elements of statistical learning: data mining, inference and predictionThe Mathematical Intelligencer, 27
D. MacKenzie, J. Nichols, Gideon Lachman, S. Droege, J. Royle, C. Langtimm (2002)
ESTIMATING SITE OCCUPANCY RATES WHEN DETECTION PROBABILITIES ARE LESS THAN ONEEcology, 83
C. Pugh (1992)
Land policies and low-income housing in developing countriesLand Use Policy, 9
J. Franklin (2010)
Moving beyond static species distribution models in support of conservation biogeographyDiversity and Distributions, 16
J. Pagel, F. Schurr (2012)
Forecasting species ranges by statistical estimation of ecological niches and spatial population dynamicsGlobal Ecology and Biogeography, 21
I. Ibáñez, J. Silander, A. Wilson, Nancy Lafleur, N. Tanaka, Ikutaro Tsuyama (2009)
Multivariate forecasts of potential distributions of invasive plant species.Ecological applications : a publication of the Ecological Society of America, 19 2
T. Hothorn, Jörg Müller, B. Schröder, T. Kneib, R. Brandl (2011)
Decomposing environmental, spatial, and spatiotemporal components of species distributionsEcological Monographs, 81
R. Holt (2009)
Bringing the Hutchinsonian niche into the 21st century: Ecological and evolutionary perspectivesProceedings of the National Academy of Sciences, 106
D. Zurell, J. Elith, B. Schröder (2012)
Predicting to new environments: tools for visualizing model behaviour and impacts on mapped distributionsDiversity and Distributions, 18
J Oksanen (1997)
Why the beta-function cannot be used to estimate skewness of species responsesEcol. Model, 8
J. Svenning, Camilla Fløjgaard, K. Marske, D. Nogués-Bravo, S. Normand (2011)
Applications of species distribution modeling to paleobiologyQuaternary Science Reviews, 30
R. Pearson, W. Thuiller, M. Araújo, E. Martínez‐Meyer, L. Brotóns, C. McClean, L. Miles, P. Segurado, T. Dawson, D. Lees (2006)
Model‐based uncertainty in species range predictionJournal of Biogeography, 33
J Fox (2003)
Effect displays in R for generalised linear modelsSociol. Methodol, 8
C. Lawson, J. Hodgson, R. Wilson, S. Richards (2014)
Prevalence, thresholds and the performance of presence–absence modelsMethods in Ecology and Evolution, 5
J. Jankowski, Gustavo Londoño, S. Robinson, M. Chappell (2013)
Exploring the role of physiology and biotic interactions in determining elevational ranges of tropical animalsEcography, 36
J Franklin (2010)
Moving beyond static species distribution models in support of conservation biogeographyAm. Stat, 16
C. Merow, A. Latimer, A. Wilson, S. McMahon, A. Rebelo, J. Silander (2014)
On using integral projection models to generate demographically driven predictions of species' distributions: development and validation using sparse dataEcography, 37
S. Normand, U. Treier, C. Randin, P. Vittoz, A. Guisan, J. Svenning (2009)
Importance of abiotic stress as a range‐limit determinant for European plants: insights from species responses to climatic gradientsGlobal Ecology and Biogeography, 18
T. Edwards, D. Cutler, N. Zimmermann, L. Geiser, G. Moisen (2006)
Effects of sample survey design on the accuracy of classification tree models in species distribution modelsEcological Modelling, 199
P. Legendre (1993)
Spatial Autocorrelation: Trouble or New Paradigm?Ecology, 74
R. Hutchinson, Li-Ping Liu, Thomas Dietterich (2011)
Incorporating Boosted Regression Trees into Ecological Latent Variable ModelsProceedings of the AAAI Conference on Artificial Intelligence
Steven Phillips, Miroslav Dudík (2008)
Modeling of species distributions with Maxent: new extensions and a comprehensive evaluationEcography, 31
C. R. Lawson (2013)
Prevalence, thresholds and the performance of presence–absence modelsJ. Appl. Ecol, 5
M. Araújo, A. Guisan (2006)
Five (or so) challenges for species distribution modellingJournal of Biogeography, 33
N. Zimmermann, T. Edwards, C. Graham, P. Pearman, J. Svenning (2010)
New trends in species distribution modellingEcography, 33
A. Welsh, D. Lindenmayer, C. Donnelly (2013)
Fitting and Interpreting Occupancy ModelsPLoS ONE, 8
T. Hefley, A. Tyre, David Baasch, E. Blankenship (2013)
Nondetection sampling bias in marked presence-only dataEcology and Evolution, 3
S. Phillips (2006)
Maximum entropy modeling of species geographic distributionsJ. Biogeogr, 190
J. Scott (2002)
Predicting Species Occurrences: Issues of Accuracy and Scale, 84
W Thuiller (2003)
BIOMOD – optimizing predictions of species distributions and projecting potential future shifts under global changeJ. Biogeogr, 9
A. Hirzel, A. Hirzel, G. Lay, V. Helfer, C. Randin, A. Guisan (2006)
Evaluating the ability of habitat suitability models to predict species presencesEcological Modelling, 199
J. Lobo, A. Jiménez‐Valverde, R. Real (2008)
AUC: a misleading measure of the performance of predictive distribution modelsGlobal Ecology and Biogeography, 17
Steven Phillips, Miroslav Dudík, J. Elith, C. Graham, A. Lehmann, J. Leathwick, S. Ferrier (2009)
Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data.Ecological applications : a publication of the Ecological Society of America, 19 1
D. Warton, I. Renner, D. Ramp (2013)
Model-Based Control of Observer Bias for the Analysis of Presence-Only Data in EcologyPLoS ONE, 8
A. Hirzel (2006)
Evaluating the ability of habitat suitability models to predict species presencesEcol. Lett, 199
A. Dubuis (2012)
Improving the prediction of plant species distribution and community composition by adding edaphic to topo-climatic variablesJ. Comput. Graph. Stat, 24
J. F Elder (2003)
The generalization paradox of ensemblesEcol. Model, 12
J. M Chase, M. A Leibold (2003)
Ecological nichesGlobal Ecol. Biogeogr
J. J. Lahoz‐Monfort (2013)
Imperfect detection impacts the performance of species distribution modelsTrends. Ecol. Evol, 23
M. R. Evans (2013)
Do simple models lead to generality in ecology?Divers. Distrib, 28
J. Friedman (2001)
Greedy function approximation: A gradient boosting machine.Annals of Statistics, 29
A. Latimer, Shanshan Wu, A. Gelfand, J. Silander (2006)
Building statistical models to analyze species distributions.Ecological applications : a publication of the Ecological Society of America, 16 1
C. Dormann, J. Elith, S. Bacher, C. Buchmann, G. Carl, Gabriel Carré, J. Márquez, B. Gruber, Bruno Lafourcade, P. Leitão, T. Münkemüller, C. McClean, P. Osborne, B. Reineking, B. Schröder, A. Skidmore, D. Zurell, S. Lautenbach (2013)
Collinearity: a review of methods to deal with it and a simulation study evaluating their performanceEcography, 36
C. Albert, W. Thuiller, N. Yoccoz, R. Douzet, S. Aubert, S. Lavorel (2010)
A multi‐trait approach reveals the structure and the relative importance of intra‐ vs. interspecific variability in plant traitsFunctional Ecology, 24
L. Breiman (2001)
Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)Statistical Science, 16
M. Lonergan (2014)
Data availability constrains model complexity, generality, and utility: a response to Evans et al.Trends in ecology & evolution, 29 6
M. Austin, A. Nicholls, M. Doherty, Jacqui Meyers (1994)
Determining species response functions to an environmental gradient by means of a β‐functionJournal of Vegetation Science, 5
J. Fox (2003)
Effect Displays in R for Generalised Linear ModelsJournal of Statistical Software, 008
K. Burnham, David Anderson (2003)
Model selection and multimodel inference : a practical information-theoretic approachJournal of Wildlife Management, 67
N. Jagannathan (1986)
Corruption, delivery systems, and property rightsWorld Development, 14
J. Chave (2013)
The problem of pattern and scale in ecology: what have we learned in 20 years?Ecology letters, 16 Suppl 1
W. Thuiller (2003)
BIOMOD – optimizing predictions of species distributions and projecting potential future shifts under global changeGlobal Change Biology, 9
T. Hothorn (2011)
Decomposing environmental, spatial, and spatiotemporal components of species distributionsGlobal Ecol. Biogeogr, 81
A. Gelman, I. Pardoe (2007)
2. Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance ComponentsSociological Methodology, 37
R. Maggini, A. Lehmann, N. Zimmermann, A. Guisan (2006)
Improving generalized regression analysis for the spatial prediction of forest communitiesJournal of Biogeography, 33
R. D Holt (2009)
Bringing the Hutchinsonian niche into the 21st century: ecological and evolutionary perspectivesJ. Biogeogr, 106
B. Renaud (1989)
Compounding financial repression with rigid urban regulations : lessons of the Korea housing marketReview of Urban & Regional Development Studies, 1
J. Lahoz‐Monfort, G. Guillera‐Arroita, B. Wintle (2014)
Imperfect detection impacts the performance of species distribution modelsGlobal Ecology and Biogeography, 23
B. Zadrozny (2004)
Learning and evaluating classifiers under sample selection biasProceedings of the twenty-first international conference on Machine learning
J. Elder (2003)
The Generalization Paradox of EnsemblesJournal of Computational and Graphical Statistics, 12
J. Elith, J. Leathwick (2010)
Species Distribution Models: Ecological Explanation and Prediction Across Space and Time
M. Evans, V. Grimm, K. Johst, T. Knuuttila, Rogier Langhe, C. Lessells, M. Merz, Maureen O’Malley, S. Orzack, Michael Weisberg, D. Wilkinson, O. Wolkenhauer, T. Benton (2013)
Do simple models lead to generality in ecology?Trends in ecology & evolution, 28 10
A. Guisan (2013)
Predicting species distributions for conservation decisionsEcol. Model, 16
W. Thuiller, Bruno Lafourcade, Robin Engler, M. Araújo (2009)
BIOMOD – a platform for ensemble forecasting of species distributionsEcography, 32
Mike Austin (1976)
On non-linear species response models in ordinationVegetatio, 33
C Merow (2014)
On using integral projection models to build demographically driven species distribution modelsNat. Clim. Change, 37
R. Snell, A. Huth, J. Nabel, Greta Bocedi, J. Travis, D. Gravel, H. Bugmann, Á. Gutiérrez, T. Hickler, S. Higgins, B. Reineking, M. Scherstjanoi, N. Zurbriggen, H. Lischke (2014)
Using dynamic vegetation models to simulate plant range shiftsEcography, 37
M. Austin, A. Nicholls (1997)
To fix or not to fix the species limits, that is the ecological question: Response to Jari OksanenJournal of Vegetation Science, 8
J. Pearce, S Ferrier (2000)
Evaluating the predictive performance of habitat models developed using logistic regressionEcol. Lett, 133
R. Tibshirani (1996)
Regression Shrinkage and Selection via the LassoJournal of the royal statistical society series b-methodological, 58
W Thuiller (2014)
Does probability of occurrence relate to demographic performance?Proc. Natl Acad. Sci. USA, 37
R. Pearson, Steven Phillips, M. Loranty, M. Loranty, P. Beck, T. Damoulas, Sarah Knight, Sarah Knight, Sarah Knight, S. Goetz (2013)
Shifts in Arctic vegetation and associated feedbacks under climate changeNature Climate Change, 3
W. Kissling, C. Dormann, J. Groeneveld, T. Hickler, I. Kühn, G. McInerny, J. Montoya, C. Römermann, K. Schiffers, F. Schurr, A. Singer, J. Svenning, N. Zimmermann, R. O’Hara (2012)
Towards novel approaches to modelling biotic interactions in multispecies assemblages at large spatial extentsJournal of Biogeography, 39
Lauren Buckley, Stephanie Waaser, H. MacLean, R. Fox (2011)
Does including physiology improve species distribution model predictions of responses to recent climate change?Ecology, 92 12
S. Barry, J. Elith (2006)
Error and uncertainty in habitat modelsJournal of Applied Ecology, 43
W. Thuiller, T. Münkemüller, S. Lavergne, D. Mouillot, N. Mouquet, K. Schiffers, D. Gravel (2013)
A road map for integrating eco-evolutionary processes into biodiversity models.Ecology letters, 16 Suppl 1
C. Beale, J. Lennon, J. Yearsley, M. Brewer, D. Elston (2010)
Regression analysis of spatial data.Ecology letters, 13 2
M. Araújo, M. New (2007)
Ensemble forecasting of species distributions.Trends in ecology & evolution, 22 1
Truly Santika, M. Hutchinson (2009)
The effect of species response form on species distribution model prediction and inferenceEcological Modelling, 220
A. Guisan, R. Tingley, J. Baumgartner, I. Naujokaitis‐Lewis, P. Sutcliffe, A. Tulloch, T. Regan, L. Brotóns, E. McDonald‐Madden, C. Mantyka‐Pringle, T. Martin, J. Rhodes, R. Maggini, S. Setterfield, J. Elith, M. Schwartz, B. Wintle, O. Broennimann, M. Austin, S. Ferrier, M. Kearney, H. Possingham, Y. Buckley (2013)
Predicting species distributions for conservation decisionsEcology Letters, 16
N. Zimmermann, N. Yoccoz, T. Edwards, E. Meier, W. Thuiller, A. Guisan, D. Schmatz, P. Pearman (2009)
Climatic extremes improve predictions of spatial patterns of tree speciesProceedings of the National Academy of Sciences, 106
H. Schreuder, T. Gregoire, J. Weyer (2001)
For What Applications Can Probability and Non-Probability Sampling Be Used?Environmental Monitoring and Assessment, 66
C Tobalske (2002)
Effects of spatial scale on the predictive ability of habitat models for the green woodpecker in SwitzerlandDivers. Distrib
C. Randin, T. Dirnböck, S. Dullinger, N. Zimmermann, M. Zappa, A. Guisan (2006)
Are niche‐based species distribution models transferable in space?Journal of Biogeography, 33
R. A. Hutchinson (2011)
Incorporating boosted regression trees into ecological latent variable modelsEcol. Appl
G. McInerny, D. Purves (2011)
Fine‐scale environmental variation in species distribution modelling: regression dilution, latent variables and neighbourly adviceMethods in Ecology and Evolution, 2
C. Dormann, J. McPherson, M. Araújo, R. Bivand, J. Bolliger, G. Carl, R. Davies, A. Hirzel, W. Jetz, W. Kissling, I. Kühn, R. Ohlemüller, P. Peres‐Neto, B. Reineking, B. Schröder, F. Schurr, R. Wilson (2007)
Methods to account for spatial autocorrelation in the analysis of species distributional data : a reviewEcography, 30
Species distribution models (SDMs), also known as ecolo gical niche models or habitat selection models, are widely used in ecology, evolutionary biology, and conservation (Elith and Leathwick , Franklin , Zimmermann et al. , Peterson et al. , Svenning et al. , Guisan et al. ). SDMs can provide insights into generalities and idiosyncrasies of the drivers of complex patterns of species' geographic distributions. SDMs are built using a variety of statistical methods – e.g. generalized linear/additive models, tree‐based models, maximum entropy – which span a range of complexity in the occurrence–environment relationships that they fit. Capturing the appropriate amount of complexity for particular study objectives is challenging. By building ‘under fit’ models, having insufficient flexibility to describe observed occurrence–environment relationships, we risk misunderstanding the factors shaping species distributions. By building ‘over fit’ models, with excessive flexibility, we risk inadvertently ascribing pattern to noise or building opaque models. As such, determining a suitable amount of complexity to include in SDMs is crucial for biological applications. Because traditional model selection is challenging when comparing models from different SDM modeling approaches (e.g. those in Table ), we argue that researchers must constrain model complexity based on attributes of the data and study objectives and an understanding of how these interact with the underlying biological processes. Here, we discuss the challenges that choosing an appropriate amount of model complexity poses and how this influences the use of different statistical methods and modeling decisions (Elith and Graham ). Common modeling paradigms used to build SDMs and decisions used to control their complexity. The variation among response curves from different modeling paradigms and different model settings suggests that they are suitable for different study objectives and attributes of the data. Response curves come from fitting SDMs to presence/background data on the overstory shrub, Protea punctata , from the Cape Floristic Region of South Africa (see Merow et al. for details of the data) with different degrees of control over the complexity of the fitted response curves. All models were constructed using the biomod2 package (Thuiller et al. ) within the statistical software R (R Core Team). Response curves of different complexity are shown which are representative of those commonly observed during SDM building. Dark grey curves were fitted using the settings at or near the default options sets in biomod2 (for illustration) with the exception of forcing the package to perform only a single fit per method using all of the presence data in model fitting. Black (light grey) curves were fitted by choosing options to make the fitted response curves simpler (more complex). Note that complexity of any of these paradigms is affected by changing the number of predictors, the order of interactions, and model averaging, hence these decisions are not explicitly included in the the table Algorithm Response curves Responses are built from Complexity controlled by Bioclimatic envelope models (BIOCLIM) quantiles, between which occurrence probability is 1 Features: step functions Quantiles Generalized linear models (GLM) parametric terms specified by user Features: polynomials, piecewise functions, splines Feature complexity specified by user Generalized additive models (GAM) combination of parametric terms and flexible smooth functions suggested by the data or the user Features: parametric terms as in GLMs and various smoothers (e.g. splines, loess) Number of nodes Penalties Multivariate adaptive regression splines (MARS) the sum of multiple piecewise basis functions of predictors suggested by the data Features: splines Number of knots Cost per degree of freedom Pruning Artificial neural networks (ANN) networks of interactions be tween simple functions of predictors suggested by the data Number of hidden layers Classification and regression trees (CART) repeated partitioning of predictors into different categories, suggested by the data, associated with different occurrence probabilities Features: threshold, with implicit interactions Minimum observations for split/terminal node Maximum node depth Complexity threshold to attempt a split Random forests (RF) an average of multiple CARTs, each constructed on bootstrapped samples of the data and using different random subsets of the full predictor set Features: threshold, with implicit interactions See CARTs Number of trees Boosted regression trees (BRT) regression trees at multiple steps; at each, models the residuals from the sum of all previous models weighted by the learning rate 2 Features: threshold, with implicit interactions See CARTs Number of trees Learning rate Maximum entropy (MAXENT) a GLM with a large number of features , which are suggested by the data or the user Features: linear, quadratic, interaction, hinge, threshold Feature classes used Regularization penalty Complexity is a fundamental feature of observed occurrence patterns because occurrence–environment relationships may be obscured by processes that are not exclusively related to the environment, such as dispersal, response to disturbance, and biotic interactions (Pulliam , Holt , Boulangeat et al. ). Consequently, SDMs can be dynamic and process‐based, explicitly representing aspects of the underlying biology. This paper focuses on the more widely used static, correlative SDMs, although many of the issues considered relate to process‐based SDMs as well. Describing this complexity is critical for many applications of SDMs, and using flexible occurrence–environment relationships allows biologists to hypothesize about the drivers of complexity or make accurate predictions that derive from their representation in SDMs. Such hypotheses are a valuable step toward the types of process‐based models discussed in this issue (Merow et al. , Snell et al. ). However, building complex models comes with the challenge of differentiating true complexity from noise (see chapter 7 in Hastie et al. for a statistical viewpoint on optimising model complexity). Some believe that flexible models are often overfit to the noise prevalent in many occurrence data sets. Thus, with such variation in both needs and opinions regarding model complexity, many modeling approaches are in current use (Table ). We characterize model complexity by the shape of the inferred occurrence–environment relationships (Table ) and the number of posited predictors and parameters used to describe them. A simpler model typically has relatively fewer parameters and fewer relationships among predictors compared to a more complex model. However, it remains a challenge to quantify complexity in a way that is appropriate across the spectrum of modeling approaches in Table (e.g. Janson et al. showed effective degrees of freedom to be an unreliable metric when defining complexity). Univariate ‘response curves’ are commonly used to give an impression of the complexity of the predicted occurrence–environment relationships. These are one‐dimensional ‘slices’ of multivariate space. The most common approach is to plot the predicted occurrence probability against the predictor of interest by holding all other predictors at their mean or median values (Elith et al. ; Table ), although other approaches are possible (Fox , Hastie et al. ). When visualized in this way, a simpler model is relatively smooth, containing fewer inflection and turning points compared to a more complex model. Though insightful, univariate curves only represent the true fitted response incompletely (3‐dimensional response surfaces or the ‘inflated response curves’ of Zurell et al. ( ) help here). Complex models contain more interactions, which can only be visualized on higher dimensional surfaces, compared to simpler models. Such responses must be interpreted as conditional on the other mean or median predictors in the model, which may be different than the responses to variables held at other values (Zurell et al. ), or to an unconditional model. Nonetheless, uni‐ and multivariate response curves remain one of the best standardized ways to assess relative model complexity. In this paper, we develop general guidelines for deciding on an appropriate level of complexity in occurrence–environment relationships. Uncertainty about how best to describe ecological complexity has to some extent divided biologists between those who prefer to use the principle of parsimony to identify model complexity (preferring the simplest model that is consistent with the data), and those who try to approximate more of the complexities of the real world relationships. We review the literature and the general modeling principles emerging from these two viewpoints, and we discuss the ways in which these overlap or differ in light of study objectives and attributes of the data. We make a variety of recommendations for choosing levels of complexity under different circumstances, while highlighting unresolved scenarios where viewpoints differ. We conclude with suggestions for drawing from the strengths of each modeling approach in order to advance our knowledge of current and future species geographical ranges. Complexity in ecology Many interacting biotic and abiotic processes influence species distributions and can manifest as complex occurrence–environment relationships (Soberón , Boulangeat et al. ). One essential challenge to recovering primary environmental drivers of these distributions, however, is to differentiate the signals of range determinants from sampling and environmental noise. Before embarking on statistical analyses of range determinants, ecological theory can focus an investigation (Austin , , , Pulliam , Chase and Leibold , Holt ). There is, a priori, a set of common drivers of populations that can be used to propose general shapes of occurrence–environment relationships. For example, we expect that for many variables, response curves describing a fundamental niche should be smooth because sudden jumps in fitness along an environmental gradient are unlikely to exist (Pulliam , Chase and Leibold , Holt ). For other variables, e.g. related to thermal tolerance, steep thresholds may exist due to loss of physiological function (Buckley et al. ). However, response curves describing realized niches might exhibit discontinuities due to the multiple interacting factors that can limit a species' occurrence in any particular location. Unimodal responses are expected (e.g. a bell‐shaped curve) because conditions too extreme for survival often exist at either end of a proximal gradient (Austin ). However, response curves can be linear where only part of the environmental range of the species has been sampled (e.g. one side of a unimodal response; Albert et al. ). Austin and Smith's ( ) continuum concept for plant species distributions predicts that skewed unimodal response curves are likely when plant species distributions are predominantly determined by one or a few environmental variables that strongly regulate survivorship and or reproduction (e.g. by temperature thresholds), but that more irregular response curves are expected given that species are influenced by a range of regulatory factors (e.g. different limiting nutrients, biotic and abiotic interactions) and historical contingencies (Austin et al. , Normand et al. ). Even with single factors, the processes that determine fitness may be different across the range, e.g. where one temperature extreme leads to abrupt loss of function while the other extreme causes gradually reduced performance. Interaction terms can be desirable to capture covariation between predictors or tradeoffs along resource gradients (e.g. higher temperatures are tolerable with greater rainfall). Many applications of SDMs do not explicitly consider such theoretical constraints on the shape of response curves (but see Santika and Hutchinson ), perhaps because it is difficult to work out how they translate into observations. We are faced with the challenge of inferring unknown levels of ecological complexity through the lens of data and models that imperfectly capture it. Complexity in models Two attributes of model fitting determine the complexity of inferred occurrence–environment relationships in SDMs: the underlying statistical method and modeling decisions made about inputs and settings. Together, these define what we will call different modeling approaches, a number of which are illustrated in Table . Statistical methods One of the primary differences among the available statistical methods for fitting SDMs is the range of transformations of predictors that they typically consider (in machine learning parlance: which ‘features’ to allow), and this helps to define the upper limit of complexity for their fitted response surfaces. We detail commonly used modeling approaches and demonstrate examples of their response curves in Table . Rectilinear or convex‐hull environmental envelopes (e.g. BIOCLIM or DOMAIN) and distance‐based approaches in multivariate environmental spaces (e.g. Malahanobis) are used in the simplest SDMs. Their response curves are simple functions (e.g. linear, hinge or step; Elith et al. ). Generalized linear models (GLMs), which are typically fitted with linear or polynomial features up to second order terms (rarely third or fourth order) for SDMs, and often without interactions, admit more complexity. Generalized additive models (GAMs) are potentially more complex because they allow non‐parametric smooth functions of variable flexibility (Hastie and Tibshirani , Wood ). Decision trees (Breiman et al. ) can also become quite complex because these can use a large number of step functions (each requiring a parameter) and can implicitly include high order interaction terms to depict response curves of arbitrary complexity. Modeling decisions Decisions that affect model complexity apply to all the statistical methods described above. For example, if a large set of predictors are available, then model complexity will differ depending on whether the full set, or a small subset, is used. One must also determine which features are considered in the model. Each feature requires at least one parameter in the occurrence–environment relationship and hence increases model complexity (see increased complexity of black vs grey MAXENT response curves due to increase in number of features; Table ). Large numbers of predictors are more commonly used in machine‐learning approaches because they automate feature selection whereas fewer are often used in simpler models where features are specified a priori. For example, maximum entropy models (MAXENT) can consider any number of linear, quadratic, product, threshold (step functions) or hinge transformations of the predictors (Phillips et al. , Phillips and Dudik ). In principle, this same complexity could be fit in a traditional GLM but this is typically impractical and not of interest to ecologists. SDM complexity is amplified when interactions between predictors are included to account for nonadditive relationships. GLMs and GAMs can include interactions that have been specified during model formulation as potentially ecologically relevant, but are usually used only sparingly. Decision trees include interactions implicitly through their hierarchical structure; i.e. the response to one variable depends on values of inputs higher in the tree, meaning that high order interaction terms (that depend on all the predictors along a branch) are possible. However interactions between variables are fitted automatically if supported by the data and cannot be explicitly controlled by the user (except to specify the permissible order of the interactions considered). Using ensembles of models can increase or decrease complexity. Ensembles are combinations of models in which the component models can be chosen based on selected criteria (e.g. predictive performance on held out data; Araújo and New ) or with an ensemble algorithm (a machine learning method). For instance, regression models selected via an information criterion can be combined using ‘multi‐model inference’, allowing distributions over effect sizes and over predictions to new sites (Burnham and Anderson ). A typical machine learning approach to ensembles uses an algorithm to build an ensemble of simple models that together predict better than any one component model. Examples include bagging and boosting – while these can be used on any component models, in ecology the most used component models are decision trees (e.g. in random forests, Brieman 2001; and boosted regression trees, Friedman ). Bagging (bootstrap aggregation) can be used to fit many models to bootstrapped replicates of the dataset (with and without random subsetting of predictors used across trees as in random forests). In contrast, boosting uses a forward stagewise method to build an ensemble, at each step modeling the residuals of the models fitted to date. Taking ensembles of relatively simple models usually increases complexity because combinations of simple models will not necessarily be simple. In contrast, ensembles of more complex models can average over idiosyncrasies of individual models to produce smoother response curves (Elder ). Model comparison To avoid overfitting and underfitting, it is common to compare models of differing complexity and select the model that optimizes some measure of performance. However, comparing models across modeling approaches (e.g. those in Table ) can be challenging. This is one of our motivations for constraining model complexity based on study objectives and data attributes. Information theoretic measures are a conventional way to choose model complexity and are relatively easy to apply for models where estimating the number of degrees of freedom is possible. However these cannot be calculated for ensemble‐based methods nor for many other methods in common use (Janson et al. ). In fact, Janson et al. ( ) warn, ‘contrary to folk intuition, model complexity and degrees of freedom are not synonymous and may correspond very poorly’. One way to compare models produced by different algorithms is to adopt a common currency for model performance by evaluating model predictions on either the training data or independent testing data. Measures such as AUC, Cohen's Kappa, and the True Skill Statistic are based on correctly distinguishing presences from absences. Measures based on non‐thresholded predictions are also relevant and preferable in many situations (Lawson et al. ). However, each of these metrics has weaknesses in different circumstances (Lobo et al. ) and further, only represent heuristic diagnostics for presence‐only data, because presences must be compared to pseudoabsence/background data (Hirzel et al. ). Once one has determined a suitable modeling approach tuning of the amount of complexity is more straightforward using a range of model selection techniques. Feature significance (e.g. p‐values), measures of model fit (e.g. likelihood), and information criteria (e.g. AIC, AICc, BIC; Burnham and Anderson ) can be applied to regression‐based methods. Cross‐validation or other resampling techniques are also used to set the smoothness of splines in GAMs (Wood ) or to determine tuning parameters in most machine learning methods (Hastie et al. ). Shrinkage or regularization is often used in regression, MAXENT and boosted regression trees to constrain coefficient estimates so models predict reliably (Phillips et al. , Hastie et al. ). Loss functions, which penalize for errors in prediction, can be constructed for any of the modeling approaches we consider (Hastie et al. ). An alternative approach employs null models to evaluate whether additional complexity has lead to spurious predictive accuracy (Raes and terSteege ). Evaluation against fit to training data alone cannot control for over fitting and risks selecting excessively complex models (Pearce and Ferrier , Araújo et al. ). In general, best practice involves splitting the data into training data to fit the model, validation data for model selection, and test data to evaluate the predictive performance of the selected model (Hastie et al. ). Recent studies have emphasized that care should also be taken in how data is partitioned into training, evaluation and test data, in particular to control for spatial autocorrelation (Latimer et al. , Dormann et al. , Veloz , Hijmans ; see below for more details). Hence methods such as block cross‐validation (where blocks are spatially stratified) are gaining momentum (Hutchinson et al. , Pearson et al. , Warton et al. ). Failure to factor out spatial autocorrelation in data partitioning can lead to misleadingly good estimates of model predictive performance. Basing model comparison on holdout data presents some practical challenges. Sample size may be insufficient to subset the data without introducing bias. Subsets of data can contain the same or different biases compared to the full data set. In particular, it can be difficult to remove spatial correlation between training and holdout data when the sampling design for the occurrence data is unknown or when a species is restricted geographically or environmentally (this is discussed below). Importantly, all these approaches to model comparison have strengths and weaknesses and none can unambiguously select between models of differing complexity built with different statistical methods and underlying assumptions. The tried and tested methods of statistics and machine learning for model selection are valuable when working within a particular modeling approach, but to benefit from these, it is valuable to narrow the scope of the feasible models based on biological considerations. We therefore now move to exploring approaches for identifying the appropriate level of complexity for particular study objectives based on data limitations and the underlying biological processes. Philosophical, statistical and biological considerations when choosing complexity In this section, we discuss factors that should influence the choice of model complexity. First, we outline general considerations and philosophical differences underlying both simple and complex modeling strategies (section Simple versus complex: fundamental approaches to describing natural systems). Next, we discuss how the study goals (section Study objectives) and data attributes (section Data attributes) interact with model complexity. Figure summarizes our findings. Importantly, a general consensus for choosing model complexity is not possible in many cases. To reflect the different schools of thought, we divide our facts, ideas and opinions into those that are relatively uncontroversial (subsections denoted ‘Recommendations’), those that favor simple models (denoted ‘Simple’), and those that favor more complex models (denoted ‘Complex’). We recall that ‘simple’ and ‘complex’ refer to the extremes along a gradient of complexity in response curves produced by distinct statistical methods and modeling decisions (section Complexity in models and Table ). Influence of attributes of study objectives and data attributes on the choice of model complexity. Green arrows illustrate attributes where the choice of complexity is of no particular concern. Red arrows illustrate the situations where caution and/or experimentation with model complexity is needed. Gray arrows indicate decisions that involve interactions with other study goals or data attributes. The thickness of the arrows illustrates the strength of the arguments in favor of choosing a specific level of complexity, with thicker arrows indicating stronger arguments. Simple versus complex: fundamental approaches to describing natural systems Simple Simple models tend towards a conservative, parsimonious approach and typically avoid over‐fitting. They link model structure to hypotheses that posit occurrence–environment relationships a priori and examine whether the resulting model meets these expectations. Simple models have greater tractability, can facilitate the interpretation of coefficients (cf. Tibshirani ), can help in understanding the primary drivers of species occurrence patterns, and are likely to be more easily generalized to new data sets (Randin et al. , Elith et al. ). Although complex responses surely exist in nature, we cannot often detect them because their signal is weak or they are confounded with sampling noise, bias or spatial autocorrelation. By using models that are too complex, one can inadvertently assign patterns due to either data limitations or missing processes, or both, to environmental suitability and fit the patterns simply by chance. Complex Complex models are often semi‐ or fully non‐parametric, and are preferred when there is no desire to impose parametric assumptions, specific functional forms or pre‐select predictors for models a priori. This does not mean that they are not biologically motivated, but rather emphasizes the reality that Nature is complex. Simple models may be readily interpretable but misleading (Breiman ), and for many applications of SDMs a preference for predictive accuracy in new data sets over interpretability is justifiable. Also, complex models are not necessarily difficult to interpret. Indeed, their complexity can be valuable for suggesting novel, unexpected responses. If we do not explore the full spectrum of complexity, there is a risk of obtaining an overly simplified, or even biased, view of ecological responses. Complex models can, depending on how they are structured, still identify simple relationships if responses are strong and robust. Study objectives Niche description vs range mapping Two prominent applications of SDMs are characterizing the predictors that define a species' niche and projecting fitted models across a landscape. Niche characterization quantifies the variables, primarily climate and physical, that affect a species' distribution. This is often done by analyzing response curves, the functions (coefficients or smoothing terms) that define them, and their relative importance in the model. Projecting these fitted models across a landscape can predict the geographic locations where the species may occur in the present or in the future. In some studies, focus lies in the final mapped predictions rather than how they derive from the underlying fitted models. Recommendations Some evaluation of the biological plausibility of the shape and complexity of response curves is always valuable, even if the objective is not niche description. Such evaluation is particularly critical for extrapolation (section Interpolate vs extrapolate), though it is admittedly quite challenging in multivariate models. Modelers should also carefully evaluate whether maps built from complex models substantially differ from maps built from simple models. If the predictions differ, the source of this should be explored. If the interest lies in interpretation, it is important to assess whether the mapped predictions are right for the right reason, and that complex environmental responses have not become proxies for sources of spatial aggregation in the data that lead to bias when projected to other locations (whether interpolation or extrapolation; section Spatial autocorrelation). Simple Simple models are preferable for niche description because they usually yield straightforward, smooth response curves that can be linked directly to ecological niche theory (section Complexity in models; Austin ), in contrast to the often irregular shapes that result from complex models (Table ). Assumptions about species responses are more transparent when simple models are being projected in new situations. Complex Complex models can be valuable for describing a species' niche when only qualitative descriptors of response curves are necessary (e.g. positive/negative, modality, relative importance) – i.e. even complex responses can be described in terms of main trends. Allowing complexity might offer more chance of identifying relevant response shapes. Complex models can be powerful for accurately mapping within the fitting region (Elith et al. , Randin et al. ) when one is not necessarily concerned with an ecological understanding of the complexity of underlying models. Although the source of complex relationships may remain unknown, complex models have the flexibility to describe these. Abrupt steps in response curves might be helpful to uncover strictly unsuitable sites when mapping distribution in space. Hypothesis testing vs hypothesis generation Some SDM studies are focused on testing specific hypotheses related to how species are distributed in relation to particular predictors or features. In others, little is known about the predictors shaping the distribution and the objective is to explore occurrence–environment relationships and generate hypotheses for explanation. For example, SDMs are valuable exploratory analyses for detecting the processes that confound occurrence–environment relationships, such as transient dynamics, dispersal, biotic interactions, or human modification of landscapes. The indirect effect of such processes can be seen in occurrence patterns, often due to abrupt changes or nonlinearities in response curves, leading to hypothesis generation. Whether one is testing or generating hypotheses critically affects the level of complexity permitted because hypothesis testing depends on being able to isolate the affects of particular features, whereas this matters less when exploring data in order to generate hypotheses. Recommendations When testing hypotheses, insights from ecological theory can guide the selection of features to include. A higher degree of control over the specific details of the underlying response surface is likely needed for hypothesis testing, which is made much easier using simple models. Hypothesis testing is more challenging in complex models with correlated features that can trade off with one another. Complex models are well suited to hypothesis generation, enabling a wider range of environmental covariates and modeling options than can be conveniently explored with simple models. Simple When the goal is hypothesis testing, simple parametric models allow investigation of the strength and shape of relationships between species occurrence and a small set of features. Furthermore, parametric models allow for hypothesis tests to examine if specific nonlinear features should be included in the selected model(s). The problem with complex models in such a setting is that with the large suite of potential features that they use, it is challenging to determine the significance of a single feature or attribute of the response curve or to compare alternative models. Instead, one is constrained to accept the features selected by the statistical method (e.g. features classes in MAXENT; splits in tree‐based methods) to represent that predictor (within some user‐specified bounds). Rather, it is preferable to specify a set of features (or multiple sets for competing models) to determine the suitability for describing a particular pattern. For example, when features are selected automatically, it may be challenging to determine whether a quadratic term that makes the response unimodal is important or how much better/worse the model might be without it. Complex The starting premises, for hypothesis testing, is a priori ecological understanding enabling the user to select a small set of features. However, we do not always have this prior understanding. Complex models explore much larger sets of nonlinear features and interactions than simple models and are suited for generating hypotheses about underlying processes (Boulangeat et al. ) derived from potentially flexible responses that would not often be detected with simpler models (e.g. bimodality). This same flexibility can be used to augment existing knowledge. For example, if we know that a species is associated with dry, high elevation locations, we don't need a simplified model to describe this, but rather more insight from a potentially complex model to capture bimodality or strong asymmetries. Complex models also provide tools for evaluating predictor importance, which is useful for both generation and testing of hypotheses and can lead to inference that differs little from simpler models (Grömping ). These importance indices can be generated from permutation tests (Strobl et al. , Grömping ), contribution to the likelihood (e.g. ‘percent contribution’ in MAXENT), or proportion of deviance explained (decision trees). Interpolate vs extrapolate When predicting species' distributions over space and time, it is important to distinguish between interpolation and extrapolation. When a point is interpolated by a fitted model, it lies within the known data range of predictors, but was not measured for its response. Alternatively, an extrapolated point is one that lies outside the observed range of the predictor. Both interpolation and extrapolation can occur in geographic or environmental space (cf. Peterson et al. , Aarts et al. ). Extrapolation requires caution in all scenarios but cannot be avoided when assessing questions relating to ‘no‐analogue’ climate scenarios (Araújo et al. ) or range expansion. The correlative models discussed here are not optimal for extrapolation in many cases; process‐based models are generally preferred because the functional form of the response curve captures the processes that apply beyond the range of observed data (Kearney and Porter , Thuiller et al. , Merow et al. ). Recommendations The challenges associated with interpolation and extrapolation, though differing in the way they manifest, are apparent for models of any complexity and hence simple and complex perspectives align. Interpolation within the range of the observed data will be accurate if the model includes all processes operating in the interpolation extent and is based on well‐structured data. Without that, prediction to unsampled sites will average across unrepresented processes and might reflect biases in the sample. More generally, it may not matter whether a response curve is complex as long as it retains the basic qualities of a simpler model. For example, a line or a sequence of small step functions parallel to the line can produce similar predictions. Some caution should be taken with complex models, as complex combinations of features can become proxies for unmeasured spatial factors in unintended ways and inadvertently model clustering in geographic space as complexity in environmental space, which can lead to errant interpolation (section Spatial autocorrelation). Extrapolation always requires that response curves have been checked for biological plausibility (cf. section Niche description vs range mapping). Of course, even simple models can extrapolate poorly. For example, Thuiller et al. ( ) showed that a simple GLM or GAM run on a restricted and incomplete range could create spurious termination of the smoothed relationships, leading to errant extrapolation. Hence, the importance of extrapolation can depend on the chosen spatial extent and on the selected features (section Spatial extents and resolution). Complex models should be carefully monitored at the edges of the data range, both because small sample sizes and the ways different statistical methods handle extrapolation can have drastic effects on predictions (Pearson et al. ). When using complex models, feature space may be sparsely sampled, which means that when one expects to interpolate a predictor, there may be inadvertent extrapolation of nonlinear features. For example, in a model with interaction terms, one may adequately sample the linear features for all predictors while poorly sampling the relevant combinations of these predictors (Zurell et al. ). Complex models can lead to different combinations of features producing similar model performance in the present (Maggini et al. ), but vastly diverging spatial predictions when transferred to other conditions (Thuiller , Thuiller et al. , Pearson et al. , Edwards et al. , Elith et al. ). Narrowing the range of possibilities using a simpler model that controls for the biological plausibility of the response curves (cf. section Complexity in models) can reduce this divergence (Randin et al. ). Data attributes Sample size The number of occurrence records is a critical limiting factor when building SDMs. With presence–absence data, the number of records in the least frequent class determines the amount of information available for modeling. Small sample sizes can lead to low signal to noise ratios, thereby making it difficult to evaluate the strength of any occurrence–environment pattern in the presence of confounding processes. Recommendations Simple models are necessary for species with few occurrences to avoid over‐fitting (Fig. ). This suggests few predictors and only simple features. Support for features can be found by reporting intervals on response curves (e.g. from confidence intervals or subsamples), with an eye for tight intervals around pronounced nonlinearities. For large data sets, any of the modeling approaches described earlier are potentially suitable, dependent on study objectives. Simple We expect a large amount of noise in occurrence data due to processes unrelated to environmental responses and this noise can be particularly influential when sample sizes are small. For example, if a basic temperature response is built from data that are variably influenced by a strong land‐use history and dispersal limitation throughout the range, a failure to take that into account results in a misspecified climate response surface. While simple models have a chance of smoothing over such variations, complex models can more readily fit these latent patterns, leading to biased prediction when models are projection to other locations where the latent processes differ. Complex models fitting many features are only appropriate when there are sufficient data to meaningfully train, test and validate the model (cf. Hastie et al. ). Complex If data are available, increasing the number of predictors ensures a more accurate understanding of the drivers of distributions. If the data set is small, it is possible to use a method that can be potentially complex, as long as it is well controlled by the user to protect against over‐fitting e.g., using penalized likelihoods (Tibshirani ), a reduced set of features in MAXENT; (Phillips and Dudik , Merow et al. ), or heavy pruning in tree‐based methods. Permitting some complexity may be useful to identify counterintuitive response curves and develop stratified sampling strategies for future data collection to support or refute the model responses. Sampling bias Sampling bias arises from imperfect sampling design, which includes purposive, non‐probabilistic, or targeted sampling (Schreuder et al. , Edwards et al. ) and imperfect detection (MacKenzie et al. ). The important question is whether sampling bias – which often arises in geographic space – transfers to bias in environmental space, and further, whether some environments are completely unsampled. No statistical manipulation can fully overcome biased sampling. The main challenge when choosing complexity is that – particularly for models based on presence‐only data – it may be unclear whether patterns in environmental space derive from habitat suitability, divergence between the fundamental and realized niches (Pulliam ), transient behavior, or sampling problems (Phillips et al. , Hefley et al. , Warton et al. ). For presence–absence data with perfect detection, sampling biases may not be too detrimental as long as at least some samples exist across environments into which the model is required to predict (Zadrozny , but see Edwards et al. for contrasting results). Recommendations More flexible models will be more prone to finding patterns in restricted parts of environmental space where sampling is problematic. Poor performance on test data could identify over fitting to sampling bias, but only if the test data are unbiased. In practice, if unbiased testing data were available, they could be used to build an unbiased model in the first place. Recent advances that enable presence‐only and presence–absence data to be modeled together, and across species, will be useful in this context (Fithian et al. ). A tradeoff exists between a complex model that might fit, e.g. step functions to few data points in poorly sampled regions and simple models that predict smooth but potentially meaningless functions from just a few points. Simple The hope when using simple models for biased data is that main trends are still identified. Complex models can over‐fit to the bias (particularly if the bias is heterogeneous in space) and miss the true main trends. Methods for dealing with imperfect detection (MacKenzie and Royle , Welsh et al. ) or sampling design often specify relatively simple responses to environment because they simultaneously fit the model for sampling (Latimer et al. ), and identifiability can become an issue when too many parameters are used that might relate to either observation or occurrence. In such cases, inference will be limited to very general trends. Complex If the sampling bias is strongly linked to the environmental gradients, even simple models can predict spurious relationships (Lahoz‐Monfort et al. ). Complex models could be useful in understanding, or hypothesizing about, the nature of the sampling bias: for example, the most parsimonious explanation for sharp changes in the probability of presence in some circumstances could be sampling bias, although we know of no published examples. Detection and sampling bias models are not restricted to simple models – for instance, the former have recently been developed for boosted regression trees (Hutchinson et al. ) and the latter are often used with MAXENT (Phillips et al. ). Predictor variables: proximal vs distal A priority in selecting candidate predictors is to identify variables that are as proximal as possible to the factors constraining the species' distribution. Proximal variables (e.g. soil moisture for plants) best represent the resources and direct gradients that influence species ranges (Austin ). More distal predictors, such as using topographic aspect as a surrogate for soil moisture, do not directly affect species distributions but do so indirectly through their imperfect relationships with the proximal predictors they replace. The problem with using distal predictors is that their correlation with the proximal predictor can change across the species' range, even if the proximal predictor's relationship with the species does not (Dormann et al. ). We rarely have access to all of the most important proximal predictors across a study region, so the main question is what response shapes should we expect for more distal predictors? Imagine that a species is limited by the duration of the growing season, but that the response is instead modeled with a combination of mean annual temperature and topographic position (aspect, slope, etc.). It is difficult to anticipate the shape of the multivariate surface that mimics the species response to the proximal predictor. Recommendations Responses to proximal predictors over sufficiently large gradients should be relatively strong (Austin and references therein), and either simple or complex models should be able to identify these responses if complexity is suitably controlled. However, the extent to which the included set of predictors is proximal or distal may be unknown. Experimentation with complex and simple models may help test hypotheses about which predictors are more proximal, potentially best encapsulated in a simple response curve, and those that are more distal and better represented with more complex curves. As physiological mechanisms generally provide the best insights into how environmental gradients translate into demographic (and therefore population) patterns, the use of informed physiological understanding could provide a valuable starting point (Austin , Kearney and Porter ). Simple Ecological theory supports using unimodal or skewed smooth responses to proximal variables (Austin and Nicholls , Oksanen , Austin , , Guisan and Thuiller , Franklin ), which motivates constraining the functional form of response curves a priori (section Complexity in models; e.g. specific features in a GLM, few nodes in a GAM). Remotely sensed data, even for proximal predictors, may introduce noise to the environmental covariates due to imprecision and to use of long term averaged data (Austin , Letten et al. ), and may be prone to over‐fitting with complex models if those data generally fail to describe the local habitat conditions accurately. One can use simple models to smooth over such idiosyncrasies if the main trends are sufficiently strong or one can omit predictors if trends are weak. Parametric, latent variable models can help to deal with this imprecision (Mcinerny and Purves ). Complex Ecological theory is based on responses to idealized gradients, whereas we observe (often imperfectly) a messy reality. Specifying an overly simple model will result in over‐ and under‐estimation of the response at points throughout the covariate space (Barry and Elith ). Given that the relationship between proximal and distal predictors is unlikely linear and may vary across landscapes, it is likely that the true response to distal variables might also be complex and best represented by a model that allows flexible fits and interactions. Hence the complex viewpoint still adheres to ecological theory, but allows for a modified view of idealized relationships as seen through available data. Spatial extents and resolution Interpretation of ecological patterns is scale dependent; hence changing spatial extent and/or resolution affects the patterns and processes that can be modeled (Tobalske , Chave ). Ecologists often use hierarchical concepts to describe influences of environment on species distributions – for instance, that climate dominates distributions of terrestrial species at the global scale (coarsest grain, largest extent), while topography, lithology or habitat structure create the finer scale variation that impact species at regional to local scales together with dispersal limitations and biotic interactions (Boulangeat et al. , Dubuis et al. , Thuiller et al. ). SDMs built across large spatial extents often rely on remotely sensed, coarse resolution or highly interpolated predictors, creating inherent biases and sampling issues (section Sampling bias). The choice of extent can also determine whether the species entire range is included in the model or whether data are censored (e.g. limited by political borders). Recommendations Resolutions should be chosen that provide data from proximal rather than distal variables. Such data are becoming available at high resolutions with expanded and technologically enhanced monitoring networks and more sophisticated interpolation of climate data (e.g. PRISM). The choice of resolution hence reduces to the discussion of proximal versus distal predictors in section Predictor variables: proximal vs distal. When the extent is chosen to contain the species' entire range, models should include sufficient complexity to detect unimodal, skewed responses (section Complexity in models). Simple Smooth responses, characterized by simpler models, are to be expected at large spatial extents and coarse resolution that smooth over the confounding processes that affect finer resolution occurrence patterns (Austin ). At finer resolutions, it may also be undesirable to incorporate the full complexity of the response curve: much of the finer details may derive from factors for which no predictor variables are available or are irrelevant to the purpose of the investigation (e.g. microhabitat or regional competition effects). Complex At small spatial extents, we might have data on the relevant proximal factors (e.g. soil properties), so fitting complex models along small‐scale gradients can capture this complexity. Also, complex models may be useful for exploring the nonlinearities that arise in response curves from distal variables at broad scales in that they potentially provide insight into important unmeasured variables. Spatial autocorrelation Many processes omitted from SDMs have spatial structure. For example, dispersal limitation, foraging behavior, competition, prevailing weather patterns, and even sampling bias can all lead to spatially structured occurrence patterns that are not explained by the set of predictors included in the SDM (Legendre , Barry and Elith , but see Latimer et al , Dormann et al. ). When these spatial patterns are not appropriately accounted for, biased estimates of environmental responses may emerge. Recommendations If presence–absence data are available, one should assess the degree of spatial autocorrelation in the residuals and implement methods to control for spatial autocorrelation. Methods include spatially‐explicit models that separate the spatial pattern from the environmental response (Latimer et al. , Dormann et al. , Beale et al. ), using spatial eigenvectors as predictors (Diniz‐Filho and Bini ), or stratified sub‐sampling of the data to minimize autocorrelation (Hijmans ). Complex models should be used cautiously in the presence of spatial autocorrelation, because their flexibility may lead to them confounding aggregation in geographic space with complexity in environmental space. For example, if a large number of presences are recorded in a small region of environmental space due to social behavior in geographic space, it is more likely that a complex model can find some feature in environmental space that correlates with this clustering. This will result in biased interpretation or mapped projections in other locations where this social behavior is absent. Cross‐validation can eliminate such spurious fits, but only if it is spatially stratified at an appropriate scale. However, when used for exploratory purposes, complex models may reveal information about this spatial structure within their response curves. Simple Simple parametric models can accommodate spatial structure under assumptions about the correlation structure (Latimer et al. , Dormann et al. ). If a non‐spatial model is used, simple models can be valuable because they are not flexible enough to model discontinuities in the response curve that derive from spatial structure, however they will still exhibit bias due to aggregated observations. Another solution to dealing with spatial aggregation is to model at sufficiently coarse resolution (suggesting simple models; see Spatial extents and resolution) that geographic clustering occurs within (and not among) cells, so it can effectively be ignored. One should be cautious building complex models because in practice, obtaining spatially independent cross‐validation samples is extremely challenging when the underlying spatial process is unknown and failing to do so likely leads to over‐fitting (cf. Hijmans ). Complex It may be desirable to use complex response curves as proxies for geographic clustering for mapping applications if the model focuses on small extents where nonlinear relationships are likely to hold across the landscape of interest (e.g. interpolation). For example, Santika and Hutchinson ( ) showed that using only linear responses in logistic regression reduced the model performance by misleadingly introducing spatial autocorrelation in the residuals, instead of allowing for unimodal responses in semi‐parametric GAMs. Methods broadly dealing with spatial and temporal autocorrelation are more recently available for complex models (Hothorn et al. , Crase et al. ). Conclusions Methodological Based on our observations on the appropriate use of different statistical methods and modeling decisions, how should modelers proceed to build SDMs? Many modelers’ preferences for particular statistical methods derive from the types of data they typically use and the questions they ask, rather than any fundamental philosophy of statistical modeling. For this reason, it is valuable for modelers to have experience in both simple and complex modeling strategies. We suggest that researchers develop a comprehensive understanding of regression models in general and GLMs in particular, as these represent the foundation of almost all of the more complex modeling frameworks. Also, understanding at least one approach to building complex SDMs can allow for sequential tests of more complex model structure. Importantly, because there are many different approaches to handling the same challenges in the data, it is less critical to understand each and every modeling approach than to become an expert in applying representatives of simple and complex modeling approaches. Bias can come from over fitting complex models, and it can come from misspecified simple models. To find a model of optimal complexity, many approaches are possible and are readily justified if sufficient cross‐validation has been performed. One might consider starting simple and adding the minimum complexity necessary (Snell et al. , this issue), or conversely starting with a complex model and removing as much superfluous complexity as possible. If one can narrow down the potential complexity based on the considerations discussed here to consider models within a particular modeling approach (Table ), then traditional model selection techniques are appropriate (section Modeling decisions). Due to the exploratory nature of many SDMs and the desire to discover spatial patterns and their drivers, we recommend that analyses begin exploration using complex models to determine an upper bound on the complexity of response curves. Over fitting can be controlled through cross‐validation (e.g. k‐fold, and particularly block resampling methods), even if a full decomposition into train‐validation‐test data is not feasible. Furthermore, complex models can be used to identify smooth, simple occurrence–environment relationships if patterns are sufficiently strong and guide specification of simpler models. In contrast, it will be more difficult to overcome a misspecified simple model, should a more complex response exist. If the exploration with complex models reveals smooth relationships, one can shift to a simpler model. If instead strong nonlinearities are prevalent, one should consider biological explanations for the nonlinearities. If complex nonlinearities cannot be avoided, one should focus on minimizing the complexity, understanding it through sensitivity analysis and uncertainty analysis (below) and providing biologically based hypotheses about it. The end result is a model that adds complexity only to the extent necessary to reproduce observed patterns. Uncertainty analysis is a relatively untapped resource for understanding appropriate model complexity. When the influence of particular model components is unknown (e.g. whether a predictor or feature is relevant a priori) it is particularly critical to account for uncertainty in modeled relationships to explore the implications of our ignorance. By studying uncertainty, one can gain confidence in pronounced nonlinearities when they come with tight confidence intervals. Information on parameter uncertainty, and consequently prediction uncertainty, can be obtained from any means of simulation from parameter distributions, including posterior sampling, sampling based on point estimates and covariance matrices, or bootstrapping. Bayesian models have the advantage of using the full data set to estimate parameter uncertainty, but are generally restricted to simpler models to avoid convergence issues (Latimer et al. , Ibáñez et al. ). One way of reducing uncertainty in predictions is to analyze the importance of predictors given the model and data using ‘average predictive comparisons’ (Gelman and Pardoe ) a form of sensitivity analysis that incorporates parameter uncertainty. One can also quantify uncertainty due to our modeling decisions by using ensembles of models built with different statistical methods or decisions (Pearson et al. , Araújo and New , Thuiller et al. ), provided that each component model is built based on modeling decisions reflecting a common goal. Biological Despite the valuable insights we can gain from occurrence models, it is worth acknowledging that fundamental limitations to biological inference may emerge from these studies (Tyre et al. , Araújo and Guisan , Araujo and Peterson , Merow et al. ). Balancing complex and simple models in such a way as to discover and discuss these limits may be as important as the actual patterns identified with some datasets. More broadly, it is important to keep in mind that we are ultimately performing exploratory analyses of occurrence–environment relationships. Occurrence records are not the ideal data to predict attributes of populations, Thuiller et al. ( ) provide an interesting cautionary note by showing weak relationships between occurrence probability and various demographic parameters for 108 tree species in temperate forests. However, often no other data are available at large spatial extents that might inform range models. Thus, while the limits may be obvious, insights from occurrence‐based correlative models may be an essential step in developing new hypotheses and research programs that can lead to the next generation of mechanistic models (Schurr et al. , Thuiller et al. , Snell et al. ). A novel, and potentially important, application of SDMs is for informing mechanistic models about the shapes of response curve in demographic models (Merow et al. ), or dynamic spatio‐temporal population models (Pagel and Schurr , Boulangeat et al. , Thuiller et al. ). Simple models may be preferable for these tasks because it is important to have a clear hypothesis to evaluate when linking it to a particular process (Thuiller et al. ). For example, SDMs might inform variable selection for the growth, survival and fecundity models in Integral Projection Models (Easterling et al. ). However highly nonlinear relationships would not be desirable for vital rate models due to the unlikely transitions through the life history that they might imply (cf. Merow et al. ). It is particularly important to avoid confounding missing processes with complex environmental responses (as might occur in complex models) when the mechanistic model explicitly describes the mechanisms that produce that aggregation (e.g. dispersal or species interactions: Kissling et al. ). The challenge in using SDMs in this way lies in ensuring response curves truly reflect environmental limitations; while environmental tolerance may limit a species' distribution at one end of a gradient, other (e.g. biotic) factors may limit it at the other end (Zimmermann et al. ). Many issues of response curve complexity that we discuss are also relevant for process‐based SDMs. Representations of processes are incorporated into SDMs to improve the precision and accuracy, or to improve our understanding of ecological processes. Consequently, process‐based models are used more for prediction and hypothesis testing than description and hypothesis generation. Yet, preferences for different model complexity persist (Evans et al. , Lonergan et al. ). Study objectives influence the choice of complexity; i.e. whether the model is intended for extrapolation or for understanding the potential importance of mechanisms. In the former case, simple models are useful to make the study of the role of a mechanism more analytically tractable. In the latter case, preference might be towards more complex models, where the roles of specific mechanisms can be understood in relation to other interconnected mechanisms. When the objective is prediction, complex models are valuable to represent all known relevant mechanisms in order to obtain the ‘best guess’. Simpler models are valuable when analyses imply that only certain key mechanisms are needed for sufficient predictive accuracy (further discussion in Evans et al. ). Attributes of the available data may be less important with process‐based models when relevant test datasets are well understood. However, data considerations are important when mechanisms or parameters are inferred from data or when assessing the spatiotemporal resolution over which particular degrees of abstraction and parameter values are relevant (Evans et al. , Lonergan , Snell et al. ). In any case, we expect that progress towards improved process‐based models lies in challenging occurrence‐based SDMs with stronger biological justifications and interpretations that aim to shed light on the mechanisms that drive process‐based models. Acknowledgements This study arose from two workshops entitled ‘Advancing concepts and models of species range dynamics: understanding and disentangling processes across scales’. Funding was provided by the Danish Council for Independent Research | Natural Sciences (grant no. 10‐085056 to SN). CM acknowledges funding from NSF grant 1046328 and NSF grant 1137366. WT acknowledges support from the European Research Council under the European Community's Seven Framework Programme FP7/2007–2013 Grant Agreement no. 281422 (TEEMBIO). RW acknowledges support from the Swiss National Science Foundation (Synergia Project CRS113‐125240, Early Postdoc Mobility Grant PBZHP3_147226). JE acknowledges funding from the Australian Research Council (grant FT0991640). TE states that mention any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
Ecography – Wiley
Published: Dec 1, 2014
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.