Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model

Hijmans, R.J., 2012. Cross-validation of species distribution models: removing sorting bias and calibration with a null model. EcologyLink to paper

Spatial sampling biases, or the observation that testing presence points tend to be closer in space than do testing absence points (and the credibility of cross-validation for assessing model accuracy) remain large issues for SDMs. Hijmans (2012) evaluates two different ways of selecting testing-presence data and two ways of selecting testing-absence data in order to better understand how spatial sampling biases and cross-validation may lead to inflated confidence in SDMs. Indeed he found that a null model, based solely on distance to nearest presence point, performed just as well (.69) as Bioclim (.64) and Maxent (.73). This suggests that it can be difficult to directly interpret uncallibrated cross-validation results (as is seen in most studies using SDMs) and that calilibrating with a null model could lead to more accurate predictions. This study calls into question many results from SDMs, especially those using data that is inherently clumpy (e.g. museum records). I think this is an especially open area for research with questions such as: How can knowledge of a species biology be used to pre-process (filter) species occurrence data before being input into SDMs? Or how does clumpiness of species occurrence data affect predictability of species range?

The Crucial role of the accessible area in ecological niche modeling and species distribution modeling

Barve, Narayani, et al. “The crucial role of the accessible area in ecological niche modeling and species distribution modeling.” Ecological Modelling 222.11 (2011): 1810-1819.

doi:10.1016/j.ecolmodel.2011.02.011

Conceptual biases remain little explored in broad-scale ecological niche modeling and species distribution modeling. Species can respond environment in diversy ways: ecological niches may evolve or remain conserved. According to the conceptions in the BAM diagram (Fig 1), the region where species can be found is the intersection of A (environmental factors with values not dependent on species population dynamics, B (sets of variables that are dependent on species population), and M (regions that are accessible by the species but are unrelated with A). Region M depends on opportunities for and constraints on movements of species and is often not included in modeling efforts. Barve et. al. examined the conceptual and empirical reasons behind the choice of study area extent and presented 3 approaches for M estimation: 1. Biotic regions. Regions within which a species is known to occur; 2. Niche-model-based regions. The reconstructed historical distributions of species from models based on their current ecological niche characteristics; and 3. Full dynamic dispersal model, which takes into consideration exolicitly the spatially path-dependent nature of effects of environmental change. They asserted that the accessible area over relevant time periods are the most appropriate for model development, testing, and comparison. Although Barye et. al. emphasized on estimating the set of areas that species were sampled for niche modeling, this idea also has implications for biogeography, macrogeography, and phylogeography.Screen Shot 2016-03-02 at 12.21.23 PM

Data prevalence matters when assessing species’ responses using data-driven species distribution models

Fukuda, S. & De Baets, B., 2016. Data prevalence matters when assessing species’ responses using data-driven species distribution models. Ecological Informatics, 32, pp.69–78.  link to paper

The accuracy of SDMs is highly dependent on the quality and quantity of data used such as size (i.e. the number of data points in a data set) and data prevalence (i.e. the proportion of presences in a data set) matter for SDM accuracy. Fukuda et al. (2016) investigated this observation by simulating nineteen sets of virtual species data in real habitat conditions (using field observations) and hypothetical habitat suitability curves under four conditions. Then they built SDMs in order to assess the effects of data prevalence on model accuracy and habitat information. The three SDMs they tested were the Fuzzy Habitat Suitability Model (FHSM), Random Forests (RF), and Support Vector Machines (SVMs). The effects of data prevalence on species distribution modeling were evaluated based on model accuracy (AUC & MSE) and habitat information such as species response curves. Data prevalence affected both model accuracy and the assessment of species’ response, with a stronger influence on species response curves. The effects of data prevalence on model accuracy were less pronounced in the case of RF and SVMs. Data prevalence also affected the shapes of the response curve where response curves obtained from a data set with higher prevalence were less dependent on unsuitable habitat conditions, emphasizing the importance of accounting for data prevalence in the assessment of species–environment relationships. Taken together, these results show that data prevalence should be controlled for when building SDMs.

 

 

 

 

Generating realistic assemblages with a joint species distribution model

Harris, D. J. (2015), Generating realistic assemblages with a joint species distribution model. Methods in Ecology and Evolution, 6: 465–473. doi: 10.1111/2041-210X.12332


 

The last article I reported on examined stacked species distribution models (SDMs) to predict species richness across a landscape. This paper extends the idea of using SDMs for studies at the community level, incorporating information ignored by stacked SDMs (i.e., data on species co-occurrences). One method that incorporates data on species co-occurrences is joint species distribution modeling (JSDM). Here, the author extends this approach using a stochastic neural network approach (which he refers to as mistnet). This approach is compared to two common approaches. First, a stacked SDM of trained boosted regression models for each species. Second, a deterministic neural network approach. All approaches used breeding bird survey data. These data were split into train and test sets, where test data consisted of 280 routes and the training set of 1559 routes, separated by a 150 km buffer (see Figure 2 from paper). The deterministic neural net performed comparably to mistnet in predicting species occurrence probabilities, but mistnet outperformed the deterministic neural net when predicting community composition at a given site. The traditional joint SDM did not perform well in either task. The article doesn’t go into the tuning of mistnet (e.g., number of hidden layers), but it looks really cool, and all the code is available on Github.

A probabilistic approach to niche-based community models for spatial forecasts of assemblage properties and their uncertainties

Pellissier, Loïc, et al. “A probabilistic approach to niche‐based community models for spatial forecasts of assemblage properties and their uncertainties.” Journal of Biogeography 40.10 (2013): 1939-1946.


 

Species distribution models (SDMs) are typically developed for a single species, because most of the time the goal is to predict habitat suitability for the occurrence of a single species. However, could there be more information about latent environmental traits, or about the probability of species occurrence in data on the presences of other species? Probably. These authors investigated an approach to predict uncertainty in predictions of community properties from stacked species distribution models. Stacked species distribution models are simply a set of independently trained species distribution models that are then laid on top of one another to predict community composition or species richness across a landscape. They don’t incorporate co-occurrence data directly, which is a flaw in my opinion, and this is recognized and has been tackled in other papers. To assess the ability of stacked SDMs to predict species richness, the authors compared a hard threshold approach (each binary SDM was converted into presence-absence predictions, the sum of the predicted presences formed the species richness in a given cell), and a probabilistic approach (each SDM predicted a probability, and these probabilities were compared relative to a 10,000 draws from a binomial distribution). The latter approach resulted in a stronger correlation between expected and observed species richness values. Further, the authors argue that this approach gets at uncertainty in model predictions, by using the variability from the 10,000 draws to get at uncertainty. This demonstrates the utility in considering community context in species distribution modeling. Methods directly incorporating information on co-occurring species will likely provide an even better view of the realized niche of species, or of community composition across a landscape.

Grassland species loss resulting from reduced niche dimension

Harpole, W. Stanley, and David Tilman. “Grassland species loss resulting from reduced niche dimension.” Nature 446.7137 (2007): 791-793.


 

This study aimed to test a hypothesis derived from niche theory called the ‘niche dimension hypothesis’. This hypothesis posits that the addition of co-limiting resources should reduce species diversity while also increasing productivity. To test this, the authors used data on a previous enrichment study, combined with a similar experiment to get at the role of co-limiting nutrients on plant community dynamics in a grassland community. They varied the number of limiting resources they added (nitrogen, phosphorous, calcium, and water) in all possible pairs, finding that no one resource was strongly limiting, but many resources were co-limiting. They found the number of resources added was negatively and non-linearly related to the number of species in the community, but positively related to above-ground biomass. This suggests that a small subset of species are able to dominate in high resource environments, and is some of the motivating work behind the biodiversity-productivity navel-gazing fest that is currently taking place among ecosystem ecologists (see these papers).

I read this paper because I thought it was going to specifically discuss plant niches and dimensionality reduction. They use dimensionality to discuss the combined effects of the limiting nutrients on species diversity. They further argue for the possibility that competition isn’t the only factor in reducing species diversity, but that plants sensitive to nutrient additions could be exposed to abiotic conditions outside of their niche boundaries. They also discuss the effect of increased leaf litter, which is not a direct competitive interaction (like competition for light).

Model‐based uncertainty in species range prediction

Pearson, Richard G., Wilfried Thuiller, Miguel B. Araújo, Enrique Martinez‐Meyer, Lluís Brotons, Colin McClean, Lera Miles, Pedro Segurado, Terence P. Dawson, and David C. Lees.
Journal of Biogeography 33, no. 10 (2006): 1704-1711. doi:10.1111/j.1365-2699.2006.01460.x

This paper overall addresses the source of uncertainty in assessments of the impacts of climate change on biodiversity. Pearson et al. used a variety of environmental niche modelling techniques (artificial neural network, climate envelope range, constrained Gower metric, classification tree analysis, genetic algorithm, generalized additive model, genetic algorithm for rule-set prediction, and generalized linear model) to evaluate the impact (magnitude of variation) of model choice on predicted species distribution under current and predicted climate change scenarios and why model outputs may differ. They used data on four endemic plant species of Protoeacea found in S. Africa collected from 3996 sampled sites located within different 1’X1’ cells and used identical input variables that are considered critical to plant physiology and survival. Model predictions were compared by testing agreement between observed and simulated distributions for present day (using AUC and kappa statistics) and assessed consistency in prediction of range size changes under future climate using cluster analysis. Distribution was characterized by the number of grid cells occupied. Technique was applied to 70% randomly selected sites and 30% was used to test agreement between observed and modelled distributions. Under climate change scenarios, for all models, except CER and GA, the suitability for each cell was calculated at decision thresholds increasing from 0-1 and used cluster analysis to group predicted ranges from different methods under current and future climate conditions. They found that: variation between model predictions can be attributed to models that use presence-only data vs. presence-absence data (so realized vs. fundamental niche predictions) as they had performed differently. Another key factor that should be carefully considered for ENMs is model extrapolation assumptions. For example, instances of extrapolating environmental variables under climate change range expansion yielded uncertainty in model predictions. Similar to class discussion this week, this paper presents models on an endemic (and plant) species, it would be interesting to apply the same objective to a non-endemic vertebrate species and compare model predictions.

Support vector machines to map rare and endangered native plants in Pacific islands forests

Pouteau, Robin, et al. “Support vector machines to map rare and endangered native plants in Pacific islands forests.” Ecological Informatics 9 (2012): 37-46.
doi:10.1016/j.ecoinf.2012.03.003

Occurrence records are scarce for rare species, which results in small training sample available for species distribution models. Support Vector Machine (SVM) was traditionally used in remotely sensed data classification for classifying object reflectance, which is substantially the same than classifiers used in species distribution models. Since the decision made by SVM is solely based on few meaningful pixels, this method is much appropriate for predicting distribution of species with scarce occurrence records. Pouteau et. al. compared two machine-learning methods, random forest (RF) and SVM, to determine which method is the most relevant to map rare species and to predict potential habitat with their current observed range. The comparison was performed using three rare plants found at the island of Moorea. Biophysical variables including elevation, climate, geology, soil substrate, disturbance regime, floristic region, plant dispersal capacities, and ecological plant type and function. Their results showed that SVM preformed constantly better than RF in distribution prediction in terms of Kappa coefficient and the area under the curve (AUC). In this case, the predicted distribution generated from SVM has high enough accuracy with only 13 training pixels. This was contributed by the ability of SVM to train model with few meaningful pixels and fit limitation information and the ability to resist noise from insignificant pixels. By comparing species potential habitat with current observed range, we will be able to better understand the causes of the conservation status of the targeted species. So far, there are only limited applications of SVM for special distribution models. It would be interesting to repeat the application for other rare plants or animals.

Capture

The Influence of spatial errors in species occurrence data used in distribution

Graham, Catherine H., et al. “The influence of spatial errors in species occurrence data used in distribution models.” Journal of Applied Ecology 45.1 (2008): 239-247.

http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2007.01408.x/full

 

Graham et al. set out to determine what effect spatial error in species presence data can have on a spectrum of Species Distribution Models. Error in species location data can be produced by mistakes in recording or copying of information and broad or imprecise locality information which can be difficult to accurately georeference. Although some of these erroneous points can be identified and removed in data cleaning, this decreases the sample size of training points and in turn the potential accuracy of predictive models. The authors used a data set consisting of 4 geographic regions each with extremely accurate presence/absence data for 10 species. All models were trained using a subset of this data including exclusively presence points (to simulate the typical lack of reliable absence data in museum collections and the like) as well as a version of this data set manipulated such that the x and y coordinate of each presence point was shifted in a random direction by an amount sampled from a Normal distribution with a mean of 0 and a standard deviation of 5km. Area under the receiver operator curve (AUC) was used as a measure of fit of each model, tested against a held-out presence absence data set. Models were directly compared using ranked AUC (i.e. for a specific species and treatment the model with the highest AUC was given rank 1 etc.) in order to account for the fact that direct comparisons of differences in AUC can be a questionable metric. Models tested fell into a few distinct categories, Presence-only models (BIOCLIM, DOMAIN, LIVES), regression based presence-pseudoabsence models (Generalized Additive Models, Generalized Linear Models, Multivariate Adaptive Regression Splines), and relatively new machine learning based approaches Maximum Entropy and Boosted Regression Trees. In general model performance across all region was lower when trained on the error-manipulated data than when trained on the accurate data. There were, however, a number of instances when a model trained on the error-added data performed better than its non-error counterpart. The smallest effect of error on performance occurred in the Australian Wet Tropics where AUC values were relatively low in general and often close to random meaning that not much decrease in performance could be expected. The predictions made by all presence only models, along with GARP and BRT declined significantly with the addition of error. Nonetheless BRT was consistently the best performing model across both data sets (though it was not significantly different from MaxEnt on the error-added data). BIOCLIM and LIVES were consistently the lowest performing models. Presence-only techniques likely suffered the most from added error because they did not have the benefit of the randomly sampled background points with which to weight their models. The authors recognize that this is a useful but relatively limited study with only one spatial data degradation treatment and suggest a number of potential avenues for advancement of this research. Beyond simply increasing the number of different treatments they highlight the need for study of the effects of error in environmental variables used in models and potential methods of mitigating the effects of such error. Although certainly in need of extension and more systematic clarification this study provides some comfort that, even in the face of inaccurate spatial data, many of our preferred modeling methods will only slightly decrease in performance.

graham figure

Classification in conservation biology: A comparison of five machine-learning methods

Kampichler, C., Wieland, R., Calmé, S., Weissenberger, H., & Arriaga-Weiss, S. (2010). Classification in conservation biology: A comparison of five machine-learning methods. Ecological Informatics, 5(6), 441–450. http://doi.org/10.1016/j.ecoinf.2010.06.003


 

Machine learning methods have recently been adopted by ecologists to use in classification (eg. bioindicator identification, species distribution models, vegetation mapping) and there is an increasing amount of literature comparing the strengths and weaknesses of different machine learning techniques over a variety of applications. Kampichler et al add to this base of knowledge by comparing five machine learning techniques against the more conventional discriminant function analysis in their application to an analysis of abundance and distribution loss of the ocellated turkey (Meleagris ocellata) in the Yucatan Peninsula. They used data on turkey flock abundance (including absences) from the study area and 44 explanatory variables, including prior turkey abundance in local and regional cells, vegetation and land use types, and socio-demographic variables.

The techniques investigated were
– Classification trees (CT): uses a binary branching tree to describe the relationships between explanatory and predictor variables
– Random forests (RF): constructs many trees and then bags the trees to select the explanatory variables
– Back-propagation neural networks (BPNN): creates a network whose nodes are weighting by the training data
– Automatically induced fuzzy rule-based models (FRBM): processes variables based on algorithms using fuzzy logic
– Support vector machines (SVM): maps training data into an n-dimensional hyperplane and applies a kernel function to maximize seperation between the classes
– Discriminant analysis (DA): combines the explanatory variables linearly in an effort to “maximize the ratio between the separation of class means and within-class variance”

They compared the techniques based on their ability to correctly classify training and test data and using the normalized mutual information criterion, which is based on the confusion matrices and measures similarities between predictions and observations from 0 (random) to 1 (complete correspondence). In general, RF and CT performed the best, however the authors ranked CT first because of its high interpretability. An interesting point brought up is the fact that, in spite of the recent influx of machine learning in the scientific literature, most conservation decisions do not consider their results, most likely because of the lack of their interpretability and expertise needed to optimize the models. With this in mind, SVM, which performs relatively well, may not be the appropriate choice for conservation management because they are not well understood by ecologists lacking the proper mathematical training.

Screen Shot 2016-02-21 at 3.57.18 PM

 

Screen Shot 2016-02-21 at 3.57.45 PM