Mapping species distributions with MAXENT using a geographically biased sample of presence data: a performance assessment of methods for correcting sampling bias

Fourcade, Y., et al. (2014). “Mapping species distributions with MAXENT using a geographically biased sample of presence data: a performance assessment of methods for correcting sampling bias.” PLoS One 9(5): e97122.

 

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0097122

 

Fourcade et al. attempt to assess the effectiveness of a number of methods for correcting sampling bias in species distribution modeling with MaxEnt. Bias in species sampling has been well established as an important and difficult problem in species distribution modeling. The authors take large, and likely spatially unbiased, presence only sample sets for 1 virtual and 2 real species and impose 4 types of spatial bias on them to simulate sampling bias. These types of biases are (1) Two Areas, the northern region has high sample density and the southern region low density, (2) Gradient, a density gradient decreasing from north to south, (3) Center, the density decreases gradually from the core of the distribution to the edges, (4) Travel Time, probability of keeping a record was highest when it had the lowest travel time to the nearest city. 5 different data-processing methods were used to limit the effect of these spatial biases: (a) Systematic Sampling, a grid of a defined cell size was superimposed on the distribution and 1 record was chosen per grid cell, (b) Bias File, MaxEnt can be given a file representing sampling effort with which it weights the sampled points, (c) Restricted Background, MaxEnt’s background points were drawn exclusively from buffer areas around biased occurrences, (d) Cluster, a PCA was performed on the environmental predictors then occurrence points were analyzed for clustering in the 2 dimensional environmental PCA space and 1 record was randomly sampled per cluster, (e) Split, occurrences were split into a northern and southern group and MaxEnt was applied independently to each area. These methods all seem relatively well grounded individually but the combination fails to make much sense. Most notably, it is entirely unclear why grid based selection is used for spatial thinning and cluster analysis for environmental thinning. Models were compared using AUC, overlap species probabilities in environmental (Denv) and geographic (Dgeo) space. Biased models invariably had lower AUCs than unbiased models and clearly deviated from the unbiased model by all measures. Decrease in AUC, however, was small and the AUC of biased models was usually still in the range generally accepted as a well fit model. The effect size of each bias type depended on the species and evaluation method. The authors focused on Dgeo (Denv was strongly correlated) as the main measure of effectiveness of bias correction. Overall only 29% of all combinations (species*bias type*bias intensity*correction method) showed improvement over the biased model with the simulated species substantially easier to correct (57% were successful corrections). Restricted Background (c) failed in almost all cases (6% successful). All other methods performed better but were differentially ranked depending on the combination of factors. Systematic sampling (a), though not always ranked first, performed most consistently and slightly better overall than the competing methods (33% successful). Bias file (b) and cluster (d) methods sometimes outperformed Systematic Sampling but were slightly less successful overall (23%, 23%). Probably the most important result of this work is the bad performance of the Restricted Background (c) method, as it seems to be consistently used/recommended when fitting MaxEnt to biased data. Though Systemtatic Sampling (a) performs well and consistently, the authors acknowledge that the second main conclusion is that the best way of handling bias is often context specific and so one ought to attempt multiple different correction methods in practice. Their somewhat strangely chosen set of correction methods further reinforces this point as other methods that have been demonstrated as effective went untested or were replaced with minor variants that may have changed their effect.

 

Fourcade et al.,

Effects of sample size on the performance of species distribution models

Wisz, M.S., Hijmans, R.J., Li, J., Peterson, A.T., Graham, C.H. and Guisan, A., 2008. Effects of sample size on the performance of species distribution models. Diversity and Distributions, 14(5), pp.763-773.

http://onlinelibrary.wiley.com/doi/10.1111/j.1472-4642.2008.00482.x/pdf

 

Wisz et al. set out to address the ways in which limited sample size (which is often a problem when constructing Species Distribution Models) on model performance. The importance of sample size to SDM comes in the fact of the lower uncertainty of parameter estimation with increasing sample size. This importance is compounded by the high dimensionality of the environmental space often being modeled and the fact that the interactions between multiple environmental dimensions are often important. All this serves to increase the total number of parameters to be estimated, placing further demands on sample size. Therefore, the authors chose to test the sensitivity of 12 different species distribution models (see Table 1 in paper for full list) to sample size variation. Models were trained with presence-only data from natural history collections and evaluated with independent presence-absence data from planned surveys, via AUC. All 12 models were trained on 10, 30, and 100 randomly sampled presence points for each species. Each of these training subsets was evaluated for its degree of overlap in environmental space with the evaluation points for that species. Finally, linear mixed effects models were used to determine which factors significantly affect AUC. Smaller sample sizes exhibited significantly lower environmental range overlap than larger samples. LMEs showed a significant effect of the interaction between modeling method and sample size. Nearly all algorithms performed better with more records (exceptions being DOMAIN and LIVES). MaxEnt performed best at low sample sizes and was the second best performer at intermediate and high sample sizes (outperformed by GBM) with intermediate variances at all sizes. BIOCLIM exhibited the lowest variance across all models but also very low AUCs. At 10 records MAXENT, OM-GARP, and DOMAIN all had high AUCs with intermediate variances. As expected, methods that model complex predictor relationships were particularly sensitive to sample size (e.g. GAM, GBM, BRUTO). A major open-ended question of the work is what the difference between randomly subsetted data and “naturally data-depauperate” species records resulting from biological rarity. It stands to reason that the above analyses may not accurately apply to such a situation.

 

wisz et al.

Potential for spread of the white-nose fungus (Pseudogymnoascus destructans) in the Americas: use of Maxent and NicheA to assure strict model transference

Escobar et al. (2014) model the range of Pseudogymnoascus destructans, the pathogenic fungus that causes white nose syndrome (WNS) in bats, in North America and Europe. They collected 218 observations of WNS for use in Maxent species distribution models. Variables in their model included the usual suspects (temperature, etc) along with some that are more important for modeling bat & fungus distribution (mean temperatures of coldest winters, annual precipitation, etc). Maxent models were then carried out on buffer areas of 500 km around European and North American occurrences. After calibration with Maxent of the training data in the two continents, projections of distributions to South America were generated. Three different approaches for projection were used: one without extrapolation, one with extrapolation, and NICHE-A models. NICHE-A models describe the fundamental niche in environmental space, thus identifying potential geographic places that may fall inside the minimum environmental ellipsoid containing all training points. In order to understand habitat differences in North America and Europe, the authors conducted tests for background similarity. They found that the two areas did not diverge significantly across the two continents, suggesting no real niche differentiation has demonstrably taken place between the two continents. However, depending on the projection approach and training data origin (from N America, Europe, or N America & Europe), Maxent transfer models differed markedly. Their use of NICHE-A, which reminds me of RangeBag, produced similar results to Maxent. I’m not sure I fully buy their results of projection to South America, given the differing community composition of bat species in South America and lack of clear evidence that WNS can infect these differing bat species. Even if these results give rise to error of commission in modeling WNS, I think the study is important because it can point ecologists and conservationists to regions at risk for transmission.

 

Escobar, L.E. et al., 2014. Potential for spread of the white-nose fungus (Pseudogymnoascus destructans) in the Americas: use of Maxent and NicheA to assure strict model transference. Geospatial Health, 9(1), pp.221–229. link to paperScreen Shot 2016-03-21 at 10.18.50 PM

 

POISSON POINT PROCESS MODELS SOLVE THE “PSEUDO-ABSENCE PROBLEM” FOR PRESENCE-ONLY DATA IN ECOLOGY

Warton, David I.; Shepherd, Leah C. Poisson point process models solve the “pseudo-absence problem” for presence-only data in ecology. Ann. Appl. Stat. 4 (2010), no. 3, 1383–1402. doi:10.1214/10-AOAS331. http://projecteuclid.org/euclid.aoas/1287409378.

“Pseudo-absences” is commonly used by ecologists to model species distribution so that researchers can apply traditional presence/absence regression methods. However, there are three main weaknesses of this approach. which are related to model specification, interpretation, and implementation. Warton and Shepherd proposed point process models as an appropriate tool for species distribution modeling of presence-only data, given that presence data are actually a set of locations. Assuming locations of point events are independent, the intensity at point is modeled as a function of explanatory variables. They also linked point process model to logistic regression approach, showing that when logistic regression model is applied with an increasing number of pseudo-absences, slope parameters will converge to the point process slope estimates. As an illustration, they constructed Poisson point process models for the intensity of Angophora constata records as a function of a set of explanatory data. They have summarized how point process model can address the three weakness shown by logistic regression approach:
Specification – Point process is a plausible model for the data generation mechanism for presence-only data, while logistic regression is coercing the data to fit the model rather than choosing a model that fits the original data.
Interpretation – the intensity at a point has a natural interpretation as the expected number of presence per unit area, which is not sensitive to choice of quadeature points.
Implementation – PPM offers a framework for choosing pseudo-absences, which is not available for logistic regression.
The point process model introduced by this paper directly addressed some key concerns that are currently raised by “pseudo-absence” approaches for species distribution modeling. Though the dependency of points, as the basic assumption by point process models, may result in some lack of fit for specific set of data, it can be addressed by modeling spatial clustering to fit spatial dependency. It would be great to see some example employing point process models with systematic consideration of sample bias, point independency analysis, modeling fitting, and model diagnose.

Historically calibrated predictions of butterfly species’ range shift using global change as a pseudo-experiment

Kharouba, Heather M., Adam C. Algar, and Jeremy T. Kerr. “Historically calibrated predictions of butterfly species’ range shift using global change as a pseudo-experiment.” Ecology 90.8 (2009): 2213-2222.

DOI: 10.1890/08-1304.1

Case study conducted by Kharouba et. al. used a climate and land use change scenario in Canada for a pseudo experiment to test model reliability for predicting species range shifts over long time periods (30-60 years) and very large geographical areas for 297 butterfly species. They used historical distribution data with six environmental predictor variables over a 10 million km2 range and modeled with MaxEnt. Steps included: generating a historic species distribution model (1900-1930), projecting these models with environmental data from 1960-1990 (projected model), model species distribution using current environmental data and species occurrence records (current model), and test the ability of using projected models to predict the current models (by comparing actual current distribution) and determine whether this method is suitable to predict species distribution change over time. The accuracy of each model was determined using AUC. Models that constructed historic and current species distribution individually had high value AUC, but when historic model was used to project current distribution, it both underestimated and overestimated suitable habitat when actually compared to the current distribution. Results depended on the species of interest and how that species responds to environmental change. Using this method to predict future distribution in response to climate change can be considered reliable, but projection accuracy depends on scale (pixel vs. region). Other factors to be considered when using this method of modeling, or could make this method even stronger, should include plant responses (butterfly resources) to climate change, feeding habits of the butterfly (i.e. generalist vs. specialist butterflies), species traits and their responses to climate change, and species response to community-level changes.

Forecasting Chikungunya spread in the Americas via data-driven empirical approaches

Escobar, L. E., Qiao, H., & Peterson, A. T. (2016). Forecasting Chikungunya spread in the Americas via data-driven empirical approaches. Parasites & Vectors, 1–12. http://doi.org/10.1186/s13071-016-1403-y


 

The goal of this paper was to predict the spread of Chikungunya in the Americas during the epidemic using 1) ecological niche models of Aedes aegypti and Aedes albopictus, 2) air travel data as a measure of imported cases and 3) fitted curves to reported CHIKV data as a measure of local transmission.  Case data was reported from the Pan-American Health Organization by country for the Americas.  In addition to the lack of standardization in reporting, the case data showed ‘surveillance fatigue’, in which reporting became erratic and uneven in the later stages of the epidemic, suggesting that reports from earlier in the epidemic may create more accurate models. By combining imported and local cases, the model predictions based on earlier reports matched later case data, suggesting that air travel is an important and accurate predictor of country-to-country transmission.

The ecological niches were estimated using climate envelopes, which create ellipsoids, similar to a convex hull method.  The minimum-volume ellipsoid method of climate envelopes creates semi-axes which reduce the Euclidian distances between occurrence points in environmental space. Rather than using all of the WorldClim variables, the authors used a principle components analysis to reduce any correlation amongst them, and chose the first three principle components as their environmental axes. The authors chose to use occurrence data from the global distribution of the vectors, in an attempt to estimate the fundamental niche and not the realized niche.  The output of the climate envelope was a niche centroid, where the semi-axes crossed in environmental space.  Hotspots were defined as areas closest to the niche centroid in environmental space.  It seems, then, that the envelope is not a boundary classifier, but ranks locations based on distance to the niche centroid, so may not perform as well at the edges of the species range as estimators such as support vector machines. They found that the niche models generally agreed with CHIKV case data, with areas closest to the niche centroid in the Carribean, where CHIKV was first introduced in the Americas.

The use of a minimum-volume ellipsoid was well suited to the study of the start of the CHIKV epidemic because this is also the area most well-suited for the vector.  I do not think it would be as appropriate when applied to more temperate areas further from the niche centroid, because the centroid seems to be where the model is most accurate.

Ecological niche and potential distribution of Anopheles arabiensis in Africa in 2050

Drake, J. M., & Beier, J. C. (2014). Ecological niche and potential distribution of Anopheles arabiensis in Africa in 2050. Malaria Journal, 13(1), 213–23. http://doi.org/10.1186/1475-2875-13-213


Anopheles arabiensis is an important vector of malaria in sub-Saharan Africa because it is exophilic, and therefore less likely to be controlled by current elimination efforts focused on indoor residual spraying and insecticide treated nets. Drake & Beier used a presence-only method of ecological niche modeling to predict the distribution of this vector in 2050, based on climate projection models. This is one of the few studies to use LOBAG-OC since the original paper was published in 2014. LOBAG-OC was chosen because it is a better discriminator of niche boundaries, the area at which other species distribution models tend to fail. The model used 307 occurrence points, of which 246 were in the training set, and 86 environmental features constructed from WorldClim data, all of which were clipped to the African continent.  The authors conducted a principal components analysis on the environmental features to examine the gross structure and found the majority of variation was explained by the first two principal components. The fit model describes An. arabiensis as a climate generalist, because of its wide baseline distribution across the African continent.  When the fit model was applied to three climate change scenarios in 2050 based on IPCC projections, all three scenarios predicted significant reductions in area suitable for An. arabiensis.  Variation amongst the three sceanarios was calculated as a measure of uncertainty, finding strong congruence among models. The key drivers of the predicted decrease in area are temperature and precipitation during the dry season.  It is suggested that a cordon sanitaire may help control this fragmented, reduced population of malaria vectors in the future. Given the importance of urban areas in current and future vector-borne disease risk, I would be interested in seeing a similar method applied to incorporate predictions of population growth.  It may be that these reductions (many in rural areas) are counter-balanced by increases in urban areas, and the overall per capita burden is unchanged.

Spatially explicit predictions of blood parasites in a widely distributed African rainforest bird

Sehgal, R. N. M., et al. “Spatially explicit predictions of blood parasites in a widely distributed African rainforest bird.” Proceedings of the Royal Society of London B: Biological Sciences 278.1708 (2011): 1025-1033.


Predicting the potential spatial distribution of parasite species has both obvious rewards (e.g., mitigating human disease) and inherent difficulties. One of these difficulties is that the distribution of parasites is commonly determined by two different, but interacting, filters. Parasite species are obligate at some stage, meaning their distributions are constrained by host distributions. Further, they are still subject to the external environment. Here, the authors use infected host records as point occurrences to train Maximum Entropy models. Specifically, occurrence records consisted of olive sunbird hosts infected by one of two avian parasites (_plasmodium_ or _trypanosoma_). These parasites exist on other hosts, and the host likely exists outside of the area examined (West Africa). Using these occurrence records, they created geographic maps of occurrence probability of infected birds (as that is what their occurrence records are). They determined environmental variable importance (Figure 1 in the paper) for both parasites, and then combined results from a random forest analysis to predict pathogen prevalence across space. This was done by training random forests on prevalence data using environmental covariates, and then projecting the reuslts onto unsampled regions in space, constrained by the MaxEnt occurrence probability predictions. Neat idea, neat paper, lots of questions raised about their approach. There are many assumptions built-in using infected hosts as occurrence points, and even more in projecting prevalence-environment relationships onto point predictions from MaxEnt (i.e., doesn’t this assume transmission is not a function of host density, population genetics, interacting community of hosts/non-hosts, etc., but is instead only a function of environment?).

MaxEnt versus MaxLike: empirical comparisons with ant species distributions

Screen Shot 2016-03-16 at 11.29.42 AMFitzpatrick, M. C., N. J. Gotelli, and A. M. Ellison. 2013. MaxEnt versus MaxLike: empirical comparisons with ant species distributions. Ecosphere 4(5):55. http://dx.doi.org/10.1890/ES13-00066.1
MaxEnt is one of the most widely used tools for species distribution modeling using presence-background data. Despite its popularity, the exponential model implemented by MaxEnt does not directly estimate occurrence probability but an index of relative habitat suitability. Royle et al suggested the logistic output of MaxEnt may differ substantially from underlying occurrence probabilities. MaxLike is a relatively new maximum-likelihood estimators for the probability of occurrence using presence-only data. Fitzpatrick et al compared the performance and relative merites of MaxEnt and MaxLike using occurrence records for six species of ants in New England. They evaluated model outputs in terms of their statistical fit to the traning data (AIC), their spatial predictions of occurrence relative to testing data (minimum predicted area and AUC), and their professional judgment. Though MaxEnt accounts for sampling bias and include greater model complexity, their results showed that MaxLike exceeds MaxEnt with relatively few occurrence data and limited spatial range coverage. They therefore suggested using MaxLike as alternative to the wildly-used MaxEnt framework. I think it is necessary to remain critical towards wildly-used modeling methods and think about alternatives. It would be interesting to test these two methods based on species other than ants. Since MaxLike is a relative new method, the robustness of it remains to be tested by more implications, while MaxEnt has already been used in a variety of species.

Measuring ecological niche overlap from occurrence and spatial environmental data

Broennimann, Olivier, et al. “Measuring ecological niche overlap from occurrence and spatial environmental data.” Global Ecology and Biogeography 21.4 (2012): 481-497.

  Authors put forth a new method that measures niche overlap between two similar species or the same species but in different geographic regions (endemic and invasive). The framework follows three steps: first calculate the density of species occurrence and of environmental factors along the environmental axes of a multivariate analysis, second  measure the niche overlap along the gradients in the multivariate analysis, and third compute niche equivalency and similarity. To account for differences in sampling strategy, researchers use a kernel density function in the environmental space for species occurrence. The same function is also applied to the occurence of environmental  cells.

a

Comparison of niche overlap is then determined by using the D metric.

b

Where Z1ij is species 1 occupancy and Z2ij is species 2 Occupancy, output varies between 0 (no overlap) and 1 (complete overlap). Comparing the two niches statistically entail investigating niche similarity in two geographic ranges (equivalency) and the same location (similarity).

 

In order to evaluate the proposed method, researchers conducted a simulation study of two virtual entities with varying degrees of niche overlap. However, the environmental parameters that drive species distribution were based off of climate conditions found in North America and Europe. Researchers also tested the provided method against two cases of species invasion. Finally, researchers compared their framework between species distribution models (EG: MaxEnt) and ordination techniques.

Results in niche detection were variable with traditional SDM methods (figures 3 – 5). Among ordination methods that did not depend on prior grouping, PCA-env performed best on both EU and NA sets of data. No method was considered bester amongst those that depended on prior grouping. For the SDM methods, MaxEnt achieved the best result in measuring niche overlap.

C

Figure 4. Sensivity analysis of simulated versus detected niche overlap for different SDM algorithsm. (a) generalized linear models, (b) MaxEnt, (c) gradiemt boosting machine, and (d) random forests.

Results demonstrate their ability to determine range overlap between and within species. Methods presented here improve on previous first in two ways. First, it removes the dependency of species occurrence from the frequency of different climatic conditions that can occur across a region. Secondly, smoothing species densities allows for species occurrence to be independent of both sampling effort and of the resolution of environmental. Both of these improvements help minimize the influence of of data resolution on the measurement of niche overlap.