Invasive species distribution modeling (iSDM): Are absence data and dispersal constraints needed to predict actual distributions?

Václavík, T. and R. K. Meentemeyer (2009). “Invasive species distribution modeling (iSDM): Are absence data and dispersal constraints needed to predict actual distributions?” Ecological Modelling 220(23): 3248-3258.

http://www.sciencedirect.com/science/article/pii/S0304380009005742

Vaclavik and Meentemeyeer focus on the specific problem of modeling the distribution of an invasive species in the process of invading across a landscape. iSDM forces the modeler to address the likely generally common problem of dispersal limitations because there will necessarily be a number of locations across the landscape which are environmentally suitable but currently inaccessible to the species. This paper examines how the addition of a measure of dispersal to iSDMs will affect the performance of models, alongside an attempt to determine the differences in performance of models trained using presence-absence, presence-psuedoabsence, and presence-only data respectively. The authors perform this analysis using Phytophthora ramorum an invasive generalist pathogen responsible for sudden oak death. 890 field plots were exhaustively sampled for evidence of the pathogen in the summers of 2003, 2004, and 2005, providing a reliable presence-absence data set. A number of pseudoabsences, points randomly chosen that could potentially be inhabited by the pathogen but which were not sampled, equal to the number of verified absences was also generated. Eight environmental variables were used to fit models including a spatial distribution of the key infectious host of the pathogen. All models were trained either exclusively on these environmental variables or on these variables and a measurement of “force of invasion” at a given point. Force of invasion was modeled using the following equation:

where djk is the Euclidean distance between each potential source of invasion k and the target plot i. The parameter a determines the shape of the dispersal kernel where low values of a indicate high dispersal limitation, and can only be estimated from presence-absence data. For models trained without true absence data a simpler force of infection metric based exclusively on the above-mentioned Euclidean distances is used. Two models using just presence-only data (ENFA, and MaxEnt) and two using presence-absence or presence-pseudoabsence (GLM and CT) were used. GLM and CT based on presence-absence data, including dispersal constraints were the highest performing models. The inclusion of dispersal constraints significantly increased the performance of most models. Without dispersal constraints presence-only models outperformed the other types of models (though this phenomenon was clearly driven by the good performance of MaxEnt). Presence-only models generally predicted larger areas of invasion than both presence-absence and presence-pseudoabsence but all models showed a clear reduction in predicted area when dispersal constraints were included. This paper clearly illustrates the importance of including probability of dispersal into SDMs for species in the process of invading a landscape. The estimates of “force of dispersal” seem as if they would suffer substantially from any sort of bias in the sampling of presence points but they may have been able to account for this in their sampling strategy. It would be interesting and useful to determine how these concepts could be applied to non-invasive species which nonetheless have dispersal restrictions preventing them from accessing some favorable areas of a landscape, allowing us to generally relax the assumption of equilibrium of distribution across the landscape. Such applications would likely require a more complex estimation of force of dispersal. In these closer to equilibrium cases there are likely landscape features which significantly slow the rate of dispersal across certain areas, which in turn creates the pattern, so we cannot assume an even rate of dispersal over time and space.

 

vlacivik figure

The Influence of spatial errors in species occurrence data used in distribution

Graham, Catherine H., et al. “The influence of spatial errors in species occurrence data used in distribution models.” Journal of Applied Ecology 45.1 (2008): 239-247.

http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2007.01408.x/full

 

Graham et al. set out to determine what effect spatial error in species presence data can have on a spectrum of Species Distribution Models. Error in species location data can be produced by mistakes in recording or copying of information and broad or imprecise locality information which can be difficult to accurately georeference. Although some of these erroneous points can be identified and removed in data cleaning, this decreases the sample size of training points and in turn the potential accuracy of predictive models. The authors used a data set consisting of 4 geographic regions each with extremely accurate presence/absence data for 10 species. All models were trained using a subset of this data including exclusively presence points (to simulate the typical lack of reliable absence data in museum collections and the like) as well as a version of this data set manipulated such that the x and y coordinate of each presence point was shifted in a random direction by an amount sampled from a Normal distribution with a mean of 0 and a standard deviation of 5km. Area under the receiver operator curve (AUC) was used as a measure of fit of each model, tested against a held-out presence absence data set. Models were directly compared using ranked AUC (i.e. for a specific species and treatment the model with the highest AUC was given rank 1 etc.) in order to account for the fact that direct comparisons of differences in AUC can be a questionable metric. Models tested fell into a few distinct categories, Presence-only models (BIOCLIM, DOMAIN, LIVES), regression based presence-pseudoabsence models (Generalized Additive Models, Generalized Linear Models, Multivariate Adaptive Regression Splines), and relatively new machine learning based approaches Maximum Entropy and Boosted Regression Trees. In general model performance across all region was lower when trained on the error-manipulated data than when trained on the accurate data. There were, however, a number of instances when a model trained on the error-added data performed better than its non-error counterpart. The smallest effect of error on performance occurred in the Australian Wet Tropics where AUC values were relatively low in general and often close to random meaning that not much decrease in performance could be expected. The predictions made by all presence only models, along with GARP and BRT declined significantly with the addition of error. Nonetheless BRT was consistently the best performing model across both data sets (though it was not significantly different from MaxEnt on the error-added data). BIOCLIM and LIVES were consistently the lowest performing models. Presence-only techniques likely suffered the most from added error because they did not have the benefit of the randomly sampled background points with which to weight their models. The authors recognize that this is a useful but relatively limited study with only one spatial data degradation treatment and suggest a number of potential avenues for advancement of this research. Beyond simply increasing the number of different treatments they highlight the need for study of the effects of error in environmental variables used in models and potential methods of mitigating the effects of such error. Although certainly in need of extension and more systematic clarification this study provides some comfort that, even in the face of inaccurate spatial data, many of our preferred modeling methods will only slightly decrease in performance.

graham figure

Classification in conservation biology: A comparison of five machine-learning methods

Kampichler, C., Wieland, R., Calmé, S., Weissenberger, H., & Arriaga-Weiss, S. (2010). Classification in conservation biology: A comparison of five machine-learning methods. Ecological Informatics, 5(6), 441–450. http://doi.org/10.1016/j.ecoinf.2010.06.003


 

Machine learning methods have recently been adopted by ecologists to use in classification (eg. bioindicator identification, species distribution models, vegetation mapping) and there is an increasing amount of literature comparing the strengths and weaknesses of different machine learning techniques over a variety of applications. Kampichler et al add to this base of knowledge by comparing five machine learning techniques against the more conventional discriminant function analysis in their application to an analysis of abundance and distribution loss of the ocellated turkey (Meleagris ocellata) in the Yucatan Peninsula. They used data on turkey flock abundance (including absences) from the study area and 44 explanatory variables, including prior turkey abundance in local and regional cells, vegetation and land use types, and socio-demographic variables.

The techniques investigated were
– Classification trees (CT): uses a binary branching tree to describe the relationships between explanatory and predictor variables
– Random forests (RF): constructs many trees and then bags the trees to select the explanatory variables
– Back-propagation neural networks (BPNN): creates a network whose nodes are weighting by the training data
– Automatically induced fuzzy rule-based models (FRBM): processes variables based on algorithms using fuzzy logic
– Support vector machines (SVM): maps training data into an n-dimensional hyperplane and applies a kernel function to maximize seperation between the classes
– Discriminant analysis (DA): combines the explanatory variables linearly in an effort to “maximize the ratio between the separation of class means and within-class variance”

They compared the techniques based on their ability to correctly classify training and test data and using the normalized mutual information criterion, which is based on the confusion matrices and measures similarities between predictions and observations from 0 (random) to 1 (complete correspondence). In general, RF and CT performed the best, however the authors ranked CT first because of its high interpretability. An interesting point brought up is the fact that, in spite of the recent influx of machine learning in the scientific literature, most conservation decisions do not consider their results, most likely because of the lack of their interpretability and expertise needed to optimize the models. With this in mind, SVM, which performs relatively well, may not be the appropriate choice for conservation management because they are not well understood by ecologists lacking the proper mathematical training.

Screen Shot 2016-02-21 at 3.57.18 PM

 

Screen Shot 2016-02-21 at 3.57.45 PM

Insights into the area under the receiver operating characteristic curve (AUC) as a discrimination measure in species distribution modeling

Jiménez Valverde, A., 2012. Insights into the area under the receiver operating characteristic curve (AUC) as a discrimination measure in species distribution modelling. Global Ecology and Biogeography, 21(4), pp.498–507. http://doi.wiley.com/10.1111/j.1466-8238.2011.00683.x

The AUC has been popularized as an omnipotent statistic in assessing the predictive accuracy of species distribution models. Most studies rationalize using the AUC value as a means to rank models by claiming that it avoids setting arbitrary thresholds for predictive decisions. Here, this claim is examined in relation to the relatedness between the AUC and sensitivity/specificity for modeling realized and potential niches. By definition, the AUC should not depend on any particular point on the ROC curve but in both simulated and real data there was a strong relationship between AUC values and certain points (the point closest to perfect detection and the point where specificity=sensitivity) on the ROC curve. In different settings (ie. studying the realized vs. potential niche), the fact that the AUC depends on certain points could be problematic because weighting errors should not be the same in each circumstance. For instance, type 1 errors (false positives) should not count as much as false negatives in modeling potential distributions as they do in modeling realized distributions. Thus, the author suggests that instead of reporting AUC values only, reporting contingency tables with varying thresholds for sensitivity and specificity may actually give us more insight into the predictability of SDMs. Overall, I agree with the author that researchers evaluating model performance need to be aware of the problems associated with using AUC values, but I am unsure of a systematic approach that would be appropriate to reporting contingency tables with thresholded values of sensitivity and specificity.

Discrimination capacity in species distribution models depends on the representativeness of the environmental domain

Jiménez‐Valverde, Alberto, et al. “Discrimination capacity in species distribution models depends on the representativeness of the environmental domain.” Global Ecology and Biogeography 22.4 (2013): 508-516. DOI: 10.1111/geb.12007

Discrimination capacity, or the effectiveness of the classifier as was discussed in class, is usually the only characteristic that is assessed in the evaluation of the performance of predictive models. In SDM, AUC is widely adopted as a measurement for discrimination capacity, and what is important for AUC is the ranking of the output value, but not their absolute difference. However, calibration or how well the estimate probability of presence represents the observed proportion of presences is another aspect of the performance of model evaluation.

Jiménez‐Valverde et. al. thus examined how changes in the distribution of probability of occurrences make discrimination capacity is a context-dependent characteristic. Through simulation, they found that a well-calibrated model, where the probability of randomly chosen positives have higher S then randomly chosen negatives (P) is equal to S, will not attain high AUC value, which is 0.83. and confirmed that discrimination depends on the distribution of the probabilities. Figure 2 shows some extreme cases demonstrating trade-offs between discrimination capacity and calibration reliability. When a model is well calibrated, dots should line up along the solid line.

Screen Shot 2016-02-17 at 12.05.25 PM

This paper not only well explained the difference between discrimination and calibration and why the increase of one compromises another, it also pointed out two implications in the field of SDM: first, it explains the devilish effect of the geographic extent, which is the reason for the negative relation between the relative occurrence area and discrimination capacity; second, discrimination may not be used to compare different modeling techniques for the same data population and to generalize conclusions beyond that population. It is noteworthy to aware limitations and conditions when evaluating our own models. One practical way is to not report AUC alone, but also be accompanied with information about the distribution of scoring system and, if possible, the model calibration plots.

AUC: a misleading measure of the performance of predictive distribution models

Lobo, J. M., Jiménez-Valverde, A., & Real, R. (2008). AUC: a misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography, 17(2), 145–151. http://doi.org/10.1111/j.1466-8238.2007.00358.x


 

With the increase in the use of predictive distribution models, especially with regards to species niche modeling, many are turning to the the area under the receiver operating characteristic curve (AUC) to assess the predictive accuracy of the models. Lobo et al have five main issues with the use of AUC in this manner. According to Lobo, AUC…

1) is insensitive to transformations of predicted probabilities, if ranks are preserved, meaning that models that are well fit may have poor discrimination and vice versa
2) summarizes test statistics in areas of extreme false-positive and –negative rates that researchers are rarely interested, leading the authors to suggest partial AUC
3) weights omission and commission the same. In the case of presence-absence data, false absences are more likely than false presence data, therefore their respective errors are not equal
4) plots do not describe the spatial distribution of errors, which would allow researchers to examine whether errors are spatially heterogeneous
5) does not accurately assess accuracy if the environmental range is larger than the geographical extent of presence data, as is the case for most SDM predictions

Additionally, AUC is often used to determine a ‘threshold’ probability of species distribution when converting a SDM to a binary, in spite of the fact that a ‘benefit’ of AUC is it is independent of the chosen threshold, and its corresponding subjectivity. The only instance in which the authors encourage use of AUC is in distinguishing between species whose distribution is more general (low AUC score) vs restricted. In order to combat the failings of AUC, Lobo et al suggest that sensitivity and specificity also be reported and that AUC only be used to compare models of the same species over an identical extent. I think another important point to include would be the quality of data. A cause of several of these problems is the bias of absence data in species distributions, and extra effort to combat this bias and ensure more complete presence-absence data sets would reduce the bias introduced by AUC.

What do we gain from simplicity versus complexity in species distribution models?

Merow, Cory, et al. “What do we gain from simplicity versus complexity in species distribution models?.” Ecography 37.12 (2014): 1267-1281.

A variety of methods can be used to generate Species distribution models (SDMs), such as generalized linear/regression models, tree-based models, maximum entropy, etc. Building models with an appropriate level of complexity is critical for robust inference. An “under-fit” model will introduce risk of misunderstanding factors that shape species distribution, whereas an “over-fit” model brings risks inadvertently ascribing pattern to noise or building opaque models. However, it is usually difficult to compare models from different SDM modeling approaches. Focusing on static, correlative SDMs, Merow et. al. defined the complexity for SDM as the shape of the inferred occurence-environemnt relationships and the number of parameters used to describe them. By making a variety of recommendations or choosing levels of complexity under different circumstances, they developed negeral guidelines for deciding on an appropriate level of complexity.

There are two attributions determining the complexity of inferred occurrence-environment relationships in SDMs: the underlying statistical method (from simple to complex: BIOCLIM, GLM, GAM, and decision trees) and modeling decisions made about input and settings. As for modeling decision, larger numbers of predictors are often used in machine-learning methods instead of traditional GLM. Incorporating model ensembles and predictor interactions will also increase model complexity.

 

Screen Shot 2016-02-10 at 1.39.59 PM

Figure 1 summarized their finding in terms of general considerations and philosophical differences underlying modeling strategies. They suggested that before making any decisions on model approaches, researchers should experience both simple and complex modeling strategies, and carefully measure their study objectives (Niche description or range mapping? Hypothesis testing or generation? Interpolate or extrapolate? ) and data attributes ( sample size, sampling bias, proximal predictors and distal ones, spatial resolution and scale, and spatial autocorrelation). Generally speaking, complex models work better when objective is to predict, and simpler models are valuable when analyses imply only certain variables are needed for sufficient accuracy. Finally, they concluded that combining insights from both simple and complex SDM approaches will advance our knowledge of current and future species ranges.

Relating this paper with the Breiman paper, Merow et. al. regarded data modeling models simpler than algorithmic models, which are usually semi- or fully non-parametric. But they also acknowledged that this conception is rather relative: the interpretability of complex models is not necessarily difficult, and the complexity can still identify simple relationships. However, it can be seen that Merow et. al. regarded interpretation of models one of the goals of modeling. They think there is no absolute situations that simple models or complex models violate the nature of science, but their merits are more case-dependent. I think the combing of simple and complex models, as Merow et al suggested, is the trend of statically modeling, and Breiman maybe over-emphasized the distinctions between the two cultures in modeling world.

Looking Forward by Looking Back:
Using Historical Calibration to Improve Forecasts of Human Disease Vector Distributions

Sohanna, A. & Thomas, K., 2015. Looking Forward by Looking Back: Using Historical Calibration to Improve Forecasts of Human Disease Vector Distributions. Vector-Borne and Zoonotic Diseases, 15(3), pp.173–183. link

With rising concerns about how environmental change impacts disease vector distributions, many studies aim to predict future vector distributions under varying climate change scenarios using information available at present time. Many types of species distribution models enable us to produce highly accurate present-day data on vectors of disease. However, when trying to forecast or ‘hindcast’ species distributions many models are never validated with independent data on past or separately observed distributions. This review paper focuses on (1) methods of validation for present day spatial models, (2) how these models should be projected into the future, and (3) introduce the method of historical calibration for validation. The authors explain three methods of validation for present day spatial models and their limitations: the commonly used split-data approach (training & test data), independent dataset validation (geographically or temporally distinct data sets for validation), and validation via occurrence of disease in reservoir species. Next, the authors reviewed the use of GCMs to model future climates and their limitations including ignoring biological processes and non-linearities as well as using constant change environmental increments without setting theoretical limitations. Lastly, they suggest that historical calibration, a validation method rooted in macroecology, is more temporally transferrable in the context of projecting vector distributions and when coupled with reliable ensemble models could reduce current shortcomings in forecasting species distributions.

 

Modelling ecological niches with support vector machines

JM Drake, C Randin, and A Guisan. 2006. “Modelling ecological niches with support vector machines” J of Applied Ecology. 43, 424-432.

Machine learning approaches get around common obstacles to species distribution modeling (autocorrelated data, presence-only data, etc.), but are relatively recent tools for species distribution modeling. The authors promote and demonstrate the use of Support Vector Machines (SVMs) to model ecological niches, using 106 alpine plant species as a case study.

Benefits of SVM approach

1. Not based on statistical distribution (no independence requirement)
2. SVMs are a one-class approach, simplifying the classification problem (presence-only)
3. Fewer tuning parameters, and deterministic results (model will always converge to same solution given a dataset)
4. Not many observations needed (n=40). Not sure how this compares to other methods though.
5. SVMs are cross-validated
6. SVM and niche both defined as boundary in hyperspace, so using SVM is on firm conceptual ground

The authors test 3 different methods that used dimensionality reduction or variable removal. SVMs performed comparably to MaxEnt, ENFA, and other methods (they didn’t examine all methods on their data, but compared the accuracy they obtained with other published studies on different systems). SVMs without any feature reduction or variable transformations performed the best.

Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models

Elith, J., & Graham, C. H. (2009). Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models. Ecography. http://doi.org/10.1111/j.1600-0587.2008.05505.x

With the expansion of SDM has come an increasing emphasis on machine-learning models, however there are few resources available for newcomers to help guide which models to choose for which application, or end goal. As a first step in creating such a guide, Elith & Graham use a simulated plant presence-absence data set and assessed the success of five algorithms to achieve three common goals in SDM: 1) understanding the relationship between a species and its environment, 2) creating a map of habitat suitability, and 3) extrapolating to new environmental conditions. The five algorithms were a generalized linear model, boosted regression trees, random forests, MaxEnt, and GARP, the last two using presence-only data. They compared each algorithm’s performance for each of the three applications of SDM, using four different measures of statistics. Their results are summed up in the table below, and I’m not going to rehash them here. An important conclusion they drew from their comparisons, however, are that the researcher must have an understanding of the algorithm they are using and the ecological background of their system in order to choose the best model for their application and system. For example, GARP does not model categorical variables well, and presence only models may not be well calibrated depending on the range of suitability. I found it interesting that, even though these algorithms still represent a ‘black box’, a user’s understanding of their strengths and weaknesses will allow the user to better interpret the somewhat subjective output in choosing a model of ‘best fit’ for their chosen goal.

Screen Shot 2016-02-08 at 6.35.33 PM