Fast and flexible Bayesian species distribution modelling using Gaussian processes

Golding and Purse suggest that Gaussian process (GP) species distribution models (SDM) via Bayesian priors may be beneficial for ecologists that wish to incorporate prior knowledge of their system and retain the speed and accuracy of predictions granted by other models. Gaussian processes are able to fit complex (i.e. more statistical terms) statistical models, but typically require computationally extensive methods (e.g. Markov chain Monte Carlo methods). Consequently, the authors evaluate another method of incorporating GP SDMs by comparing its predictive ability and run time with other commonly used approaches in a dataset from the North American Breeding Bird Survey for both presence/absence and presence-only data. Models compared in their study include: a GP model, a generalized additive model (GAM), and a boosted regression tree model (BRT). Instead of fitting GP SDM models with MCMC, they evaluate the efficacy of a more efficient deterministic inference procedure called Laplace approximation and expectation propagation. Deterministic approximations are subject to error that may decrease accuracy of predictions, but the authors argue that even with these limitations GP models fitted with deterministic inference are a promising method for SDM analyses. They found that the predictive accuracy of GP SDMs fitted by Laplace approximation was higher than BRT, GAMs, and logistic regression for presence/absence data and higher than all compared models for presence-only data. Additionally, GP SDMs were just as fast as GAMs. For situations when data on species occurrence is sparse, such vector abundance and distribution, but distributions of hosts is better documented (e.g. cattle or humans) this method would allow integration of multiple types of prior information.

 

Golding, N. & Purse, B.V., 2016. Fast and flexible Bayesian species distribution modelling using Gaussian processes. Methods in Ecology and Evolution. doi: 10.1111/2041-210X.

Species distribution models that do not incorporate global data misrepresent potential distributions: a case study using Iberian diving beetles

Species distribution models have been used since the 1980s to predict probable distribution using a combination of species occurrence data and predictive environmental data thought to influence their distribution. While distribution modeling presents a way to predict species distribution with incomplete data, using data that does not encompass the entire range of a species may lead to geographic bias in the potential distribution predicted by the model. This study aims to determine whether modeling using regionally biased data predicts incomplete potential distributions and examine why regional data may not adequately describe the potential distribution. Their results show that distributions predicted with regional data provide an incomplete description of the environmental limits of a species when compared to distributions modeled using data covering the entire species range. Due to this issue it is recommended that potential distributions be modeled using data from all known populations or a subsample from population across the entire range. While this study reflects the importance of utilizing data from across the entire known range when trying to predict potential distributions as predicted by climate it does not consider other factors that may influence distribution. Some areas within the range of the beetles do not have records of presence which may be due to limitations of the natural dispersal of these species as opposed to the climate variables in those areas.

 

Sanchez-Fernandez, D., Lobo, J. M. and Hernandez-Manrique, O. L. 2011. Species distribution models that do not incorporate global data misrepresent potential distributions: a case study using Iberian diving beetles. Diversity and Distributions, 17, 163-171. DOI: 10.1111/j.1472-4642.2010.00716.x

Anonymous nuclear markers reveal taxonomic incongruence and long-term disjunction in a cactus species complex with continental-island distribution in South America

Motivation:The Pilosocereus aurisetus complex is comprised of 8 cactus species associated with the rocky savannas in eastern Brazil. Species have been defined by morphological and genetic traits. However, different genetic markers lead to different conclusions. For these reasons the authors attempt to answer the following questions regarding the complex diversification:

(1) Are the northern P. aurisetus populations more related to the other conspecific populations in the Espinhaço Mountain range or to population from other species in Central Brazil, as shown by cpDNA data?

(2) Is the currently recognized P. machrisii species composed of two distinct lineages?

(3) What is the relationship of P. jauruensis with the other species of the complex?

Additionally,  the authors also tested climatic niche differences between the observed geographic lineages with the hopes of making some inference of the complex’s phylogeographical history.

MethodsAmplicons from AFLP of 40 Pilosocereus samples consisting of 4 species from P. aurisetus species group and and out group. These species have the widest distribution and were the most phylogenetically unresolved. Sequences were processed to identify loci and then alleles across the species and populations.  The alleles were used to infer the most likely number of interbreeding groups in the data set without any sampling site information.  The most likely number of interbreeding groups were then treated as operational taxonomic units (OTUs) and used to estimate a species phylogenic tree.  Species occurrence data was obtained by GPS measurements during transacts of the range in addition to occurrences in the global biodiversity information facility. Sample sizes were generally small for each species and therefore not prone to over fitting. Climatic divergence in addition to genetic divergence was tested by grouping the occurrences according to the genetic lineages recovered by phylogenic analysis. The effects of past climatic oscillations on the niche of each lineage were determined by fitting the models in the present, 21 kya (LGM), and 135 kya (LIG) scenarios using 3 different algorithms. Of the 19 BIOCLIM variables, the authors used 6 which were which were showed to have low correlation and high informativeness. The model outputs were converted into presence/absence data based on a threshold value where the ratio of true positives to actual positives and true negatives to actual negatives is equal. In a area with at least 3 overlapping projections was considered suitable – climactic stable areas were suitable in all 3 time periods.

Results and Discussion: The genetic analysis inferred 5 mating groups split between two main geographic lineages. The two lineages had minimal overlap in all time periods, this overlap was even smaller far stable areas (overlap in all three times). The climatic niche does not appear to have changed over time indicating that range shifts were not crucial for present day distributions. 

Perezetal_Image

Thoughts: The genetic analysis was very thorough and well developed. However, the niche mapping wasn’t fully integrated into the rest of the study.  I think this is a good example of the consequences of developing easy to use data (WorldClim). It is not really clear how the determining the climatic niche over time strengthen the authors’ phylogenetic conclusions.


Manolo F. Perez, Bryan C. Carstens, Gustavo L. Rodrigues, Evandro M. Moraes. Anonymous nuclear markers reveal taxonomic incongruence and long-term disjunction in a cactus species complex with continental-island distribution in South America. Molecular Phylogenetics and Evolution. Volume 95, February 2016, Pages 11–19 doi:10.1016/j.ympev.2015.11.005

Evaluating alternative data sets for ecological niche models of birds in the Andes

Typically, researchers use interpolated climate data or remotely sensed environmental data to build Ecological niche models (ENMs). Parra et. al. conducted the first assessment of the relative performance of models created by three different datasets: climate data, Normalized Difference Vegetation Index (NDVI), and elevation data. They compared predicted versus expected distribution of six bird species in the Ecuadorian Andes. They developed seven models based on three datasets and all their combinations using BIOCLIM. Predictive maps were compared with expert knowledge based maps, and sensitivity, specificity, positive predictive power, and Kappa were calculated. They found that models included climate variables performed well across most measures, whereas ones only use NDVI performed the worst. In the mean while, elevation data based models showed high over-prediction errors. They concluded that it is usually beneficial to include various datasets into ENMs when possible. Data quality of remote sensing data should be evaluated carefully before being included, especially for regions with complex topography or cloudy weather. This comparison result, however, may revealed a regional trend for Ecuadorian Andes but not a general rule, considering the special landscape, high levels of endemism, and species richness of the study area. Therefore, similar modeling comparison will benefit further understanding for effects of data choosing on ENMs.

Screen Shot 2016-01-19 at 10.26.49 PM

Sensitivity of predictive species distribution models to change in grain size

Sensitivity of predictive species distribution models to change in grain size

When using species distribution models, grain (resolution) size is a spatial factor that may influence predictive model outcomes. Guisan et. al. (2007) tested the effect of grain size on SDM by comparing model performance of 10 predictive modelling techniques (DIVA-GIS, DOMAIN, GLM, GAM, BRUTO, MARS, BRT, OM-GARP, GDMSS, and MAXENT-T) on presence only data of 50 species in 5 different regions (from Elith et al 2006) and also determined whether affects observed were dependent on the type of region, modelling technique, or organism considered. Model performance at two grain sizes (original and 10-fold) was assessed and prediction success was compared and ranked using Area under ROC curve. Increasing grain size did not affect model performance however it did degrade models on average. Although surprised by the outcome, the somewhat fundamental question reflects realistic issues in SDM. The testing 10 modelling techniques was a well thought out approach to determining factors that apparently aren’t influenced by grain (unless original data lacked predictive power that wouldn’t be influenced by scale anyway). It would be interesting for a follow up paper to test other variables that may be more affected by changes in grain size (sessile organism, species with small home ranges, or factors at the microhabitat level).

The role of land cover in bioclimatic models depends on spatial resolution

DOI: 10.1111/j.1466-8238.2006.00262.x

The spatial scale on which species distribution modeling is undertaken is of fundamental importance for ecological studies. The current paradigm indicates climate governs species distribution on broad biogeographical scales whereas land cover and habitat suitability affect species occupancy patterns, especially at fine resolution. With this context, Luoto et. al. tested whether the integration of land cover data affect bioclimatic models by constructing Generalized additive models for 80 bird species as a function of (1) pure climate and (2) climate and land cover variables. Models were constructed at 10km, 20km, 40km, and 80km resolutions. They evaluated their models using area under the curve (AUC), and found that model performance generally increased when land cover was included at 10km and 20km. In contrast, the inclusion of land cover decreased model AUC at 80km resolution. Therefore, they concluded that the determinants of bird species distributions are hierarchically structured, and that integrating land cover at 10km-20km resolution can improve our understanding of biogeographical patterns of birds in their study area. This paper examined effects of spatial resolution over a range of scales. However, whether a certain spatial resolution is fine or course is species-dependent and question-driven. It would be interesting to discuss about a protocol that help determine appropriate spatial scale for general species distribution modeling.

2 (1)
Projected distributions of two species with different modelling accuracies and habitat preferences: the occurrence of marsh harrier (Circus aeruginosus): (a) climate model and (b) climate-land cover model; and the occurrence of grey-headed woodpecker (Picus canus): (c) climate model and (d) climate-land cover model. Black dots represent the sampling plots where the species was present, and shaded areas are the areas modelled as suitable for the species. To determine the probability thresholds at which the predicted values for species occupancy are optimally classified as absence or presence values, we used prevalence of the species as the probability level as suggested by Liu et al. (2005). D2 = percentage of explained deviance and AUC = the area under the curve of a receiver operating characteristic (ROC) plot.

SDMdata: A Web-Based Software Tool for Collecting Species Occurrence Records

Obtaining data dynamically and programmatically is necessary for reproducible research. This a blanket statement. What I mean specifically is that the ability to access data programmatically from a source that is version controlled allows for the consistent use of data. Currently, many databases are accessible through web-based interfaces, but have no API or method to access the data programmatically. This matters because subsequent analysis of the data is based only on that snapshot from a potentially dynamic database. Ideally, a complete workflow would include pulling the data from a database, cleaning it, analyzing it, and outputting results. This paper introduces a tool to download and clean species occurrence data from GBIF (Global Biodiversity Information Facility). This tool is web-based, written in Python, that takes a species name list, and outputs occurrence data from GBIF. They argue that the current _R_ implementation (`rgbif`) is flawed because of memory limitations (which is a pretty facile argument). I do like that `SDMdata` has an error-checking feature that will flag suspected errors. However, the proliferation of tools to query databases tends to “muddy the waters” in my opinion. Several resources already exist for programmatic data acquisition from GBIF in R, SQL, and Python. Perhaps this tool adds something novel; perhaps we should focus on making existing tools better.

 

Link to paper

Link to software 

Not the time or the place: the missing spatio‐temporal link in publicly available genetic data

Data archiving is mandatory for many journals in order to encourage data openness, and the re-use of scientific data. However, **how** the data are archived can be really important in determining the usefulness of the data to researchers. Further, data archiving itself does not ensure that the study which the data was used is reproducible. This article demonstrates that 31% of genetic datasets archived as a condition for publication in _Molecular Ecology_. This was largely a metadata problem, in that the data and metadata were not linked well. Furthermore, the quality of the data was a bit coarse in some instances, with geographic data provided in terms of geopolitical location instead of geographic coordinates. Taken together, the authors stress that the data deposition policy has promoted the re-use of data, and the quality of data has increased from 2009 – 2013, but that current genetic data formats that do not allow the inclusion of metadata should be revised, and that data should well-documented and curated in an appropriate repository instead of as a supplementary file.

 

Article available here

Interpretation of Models of Fundamental Ecological Niches and Species’ Distributional Areas

Soberón and Peterson present a discussion that considers two broad ways that researchers generally estimate the fundamental niche of a species. The first method discussed is the mechanistic approach which considers the studied physiology that contributes to positive fitness with information provided from a geographic information system to display suitable habitats. The second method indirectly identifies important characteristics of species fitness by utilizing survey data and climate factors associated with species occurrence. While the first method may provide a deeper understanding of within species drivers that contribute to their distribution, it may neglect the effects of species interactions. While the second method provides opportunity to explicitly model species interactions, yet the correlative approach may be subject to some bias. Soberón and Peterson also consider what role scale plays in species distribution, and how various factors can differ in their importance due to changes in scale. Another consideration is how absence species information needs to be carefully considered with regards to study objective. Lastly, Soberón and Peterson stress the importance for model validation and suggest the need for well developed methods. This paper provides insight into key differences between mechanistic niche modeling and the ‘correlative approach’. However, one improvement to the findings in this paper could be a better developed case study (potentially two) or more mathematical reasoning.

Soberon2005

DOI: http://dx.doi.org/10.17161/bi.v2i0.4

Changing habitat areas and static reserves: challenges to species protection under climate change

Garden, J. G., O’Donnell, T. and Catterall, C. P. 2015. Changing habitat areas and static reserves: challenges to species protection under climate change. Landscape Ecology, 30, 1959-1973. DOI: 10.1007/s10980-015-0223-3

Changing climates can lead to shifts in the spatial distribution of a species and its suitable habitat, potentially altering the effectiveness of previously fixed protected areas. This paper develops a broad approach to characterizing species’ climate-induced distributional changes due to location displacement or refugial dynamics along with the effectiveness of the protected area network. Distributional data, climate data, and other environmental data were used to produce species distribution models for 13 species. Areas of suitable habitat for each species were predicted according to three climate regimes and overlaid with GIS maps of protected areas. Suitable habitat extent decreased across climate regimes for all 13 species as did the proportion of refugia extent within the original suitable habitat extent. The amount of protected habitat decreased under future climates though this is likely due to overall decreases in the habitat extent as the proportion of habitat protected in the study area did not change over time. This study forecasts a decline in suitable habitat for forest obligate species within the study area as the climate changes. Patterns of species response to the changing climate were better characterized by refugial dynamics rather than location displacement. These findings are consistent with species ranges shrinking in the future around refugia within or near the current distribution as opposed to shifting in location. The purpose of this study was to predict the impact of climate change on the habitat extent of these species and as such other threats to habitat, such as deforestation, were intentionally not considered. In order to better predict suitable habitat extent future research would need to include all threats to habitat in the study area.