Forecasting Chikungunya spread in the Americas via data-driven empirical approaches

Escobar et al. Forecasting Chikungunya spread in the Americas via data-driven empirical approaches. Parasit Vectors. 2016; 9: 112. Published online 2016 Feb 29. doi:  10.1186/s13071-016-1403-y

Chikungunya  is endemic to Africa and Asia and is transmitted  primarily by Aedes aegypti and Aedes albopictus. The authors map disease risk of the Americas using novel computational tools and data streams: weekly CHIKV reports, air travel, geographic distance and connectivity, and climate suitability of vector species. Using these data sources, the authors quantified imported cases, local cases at the country level, and geographic hotspots.

The geographic transmission hotspots were identified used SDM where transmission is limited by climate. The fundamental ecological niche was estimated using a climate envelope, based on minimum-volume ellipsoid describing ecological features of the occupied range. The number of WorldClim variables were reduced by used the top three components of a PCA instead of all variables. The niche centroid of the 3 components was then used to quantify proximity to centroid on a continuous map. Summary metrics were calculated for each country.

CHIKV was introduced to the Americas in regions with highly competent vectors. Identified hot spots for A. aegypti are Haiti, Dominican Republic, Puerto Rico, Guadeloupe, Dominica, Martinique, St Lucia, Saint Vincent and the Grenadines, and Grenada, plus on the mainland in coastal Venezuela and Brazil, across Central America, and in the lowlands of Peru and Bolivia. While Ae. albopictus had high areas of transmission in southeastern United States, southern Brazil, central Chile, Central America, and across the Andes Mountains in Bolivia.

Eight (and a half) deadly sins of spatial analysis

Hawkins, B. A. (2012), Eight (and a half) deadly sins of spatial analysis. Journal of Biogeography, 39: 1–9. doi: 10.1111/j.1365-2699.2011.02637.x 


Spatial autocorrelation is not the only issue of spatial analysis. Additionally, this autocorrelation is not just a data quality issue. Issues raised are focused on regression models.

1. Spatial autocorrelation generates bias
Nature is autocorrelated, species are distributed non-randomly. Understanding the pattern in autocorrelation the goal of ecology and biogeographers. However, statistical parametric modeling often requires random data- so perhaps this approach, specifically significance testing, is not appropriate.

2. Spatial regression is best
A common assertion in the literature: If ordinary least square regression is biased, then generalized least square must be the best (and only) method.
But there are multiple ways to to cope with the bias (or uncertainty), there is no single best approach. Alternatives include presenting multiple models or model averaging, however, this will never correct for uncritical use of multiple regression.

3. The world is stationary
Stationarity is the assumption that predictor/response variables are invariant throughout data. The consequences of this violation varies with model choice- but will influence the interpretation of parameter values. Despite non-stationarity being common in ecological data, very few studies test or account for this assumption. This needs to be done at the very least, if ideally the authors do not incorporate non-parametric methods such as CART.

4. Partial regression coefficients mean something
Ecologists would like to identify the most important influence on spatial patterns, but multiple regression is designed to ignore correlations among predictors making this a very poor approach. Alternatives, such as, CART or SEM are better suited to assert causal links.

5. Regression coefficients identify effects
`Correlation is not causation’ is well known, and ignored. The distinction between statistical effect and mechanistic effect need to be clearer in both communication and thought.

6. Species richness generates bias
This is a misunderstanding of sampling theory. All samples will converge to the parametric mean, if the sample is random. The non-random assortment of species are the patterns we are trying to test. The need to correct for species richness is the result of confusion between bias and precision. It is clear that the claim that richness generates bias in estimates of means is without foundation.

7. The earth is round (P<0.05)
P-values and AIC/BIC are not complementary tests for model evaluation. Either the model should be compared to the null (as in p-value) or the most parsimonious model should be chosen (AIC/BIC). CART can lend itself to model selection based on information theory.

8. Spatial processes explain spatial patterns

Legendre (1993) provided a heuristic method for distinguish- ing environmental and spatial structure in ecological data by means of a partial regression (or constrained ordinations) that partitions ‘(a) nonspatial environmental variation’, ‘(b) spa- tially structured environmental variation’ and ‘(c) spatial variation of the target variable(s) that is not shared by the environmental variables’ (p. 1666). His use of the language was careful, and this method is now widely used, but it is not uncommon to read that (c) is the effect of pure space, or the effect of spatial processes. Is it?

8 and half. Spatial autocorrelation causes red shifts in regression models

Overemphasize on the importance of broad scale (vs local) predictors is called a red shift. If anything, we have this backwards. Range maps contain false positives, and survey data contain false negatives. Range maps are created by filling in ‘presences’ between points, meaning that closer cells will have more distortion than distant cells. Of course, the level of distortion is grain dependent, but so are the processes that influence diversity.

Can changes in the distributions of resident birds in China over the past 50 years be attributed to climate change?

Among vertebrates, birds may be the most sensitive to climate change.  Over the past 100 years, the global mean air temperature has increased by 0.85°C. In the last decade, this shift in temperature has been accompanied by a northern shift of bird species in China. The author’s use species distribution models to ask if the rising temperature caused the changes in 9 resident bird species (20 subspecies)  range over the last 50 years.

The 9 chosen species are endangered in China and have a large point distribution data set, additionally these birds have been found outside of the historical boundaries in recent years. Given that the dataset consists of presence-only data and uncertainty in the biotic and abiotic variables, the authors used a fuzzy envelope model trained on data from 1951-1960. Climate factors were chosen based on there influence on environmental suitability for reproduction. From this the suitability for each grid cell for each year between 1961-2010 was calculated. The total suitability for each grid was calculated by summing the suitability across the years. The model accuracy was evaluated using kappa-statistic (k) using the 1951-1960 as baseline for each decade. 

Wu_fig7

The range centers of 7 species shifted northern, 6 species east, and 3 species west ward. The suitable range of 9 subspecies increased with climate changes, while others exhibited no change. 

Novel methods for the design and evaluation of marine protected areas in offshore waters

Leathwick, J., Moilanen, A., Francis, M., Elith, J., Taylor, P., Julian, K., Hastie, T. and Duffy, C. (2008), Novel methods for the design and evaluation of marine protected areas in offshore waters. Conservation Letters, 1: 91–102. doi: 10.1111/j.1755-263X.2008.00012.x


Marine protect areas (MPAs) are essential to buffer marine populations from human impact. While there is consensus for a global network of MPAs, they only currently protect 0.6% of the oceans. A major hurdle is determining areas that maximize conservation benefits while minimize economic loss of fisheries (due to excessive reserve size). This paper provides an analytical guide to finding this balance while incorporating:

(1) realistic interpolation of species distributions based on biological and environmental data; (2) ability to handle relatively fine-scale data over large geographic areas; (3) obviation of the need for prior definition of planning units; and (4) identification of a nested set of reserve solutions that comprehensively describe trade-offs between conservation benefits and reserve extent.

The authors applied this guide to the waters off the coast of New Zealand; by law 10% must be MPAs. 96 bottom dwelling fish species’ distributions were predicted by a boosted regression tree using catch data from 21,000 research trawls. Environmental predictors were chosen based on functional relevance, including: trawl depth, temperature, salinity, primary productivity, and zone of ocean mixing/currents.    The BRT was built using 17,000 trawls, and validating on the remaining. Given that the data was zero-skewed (many absences)2 BRT models for each species were built- the first predicted the probability of catch from presence absence, while the second was fit to trawls where that species was occurred. Both models were evaluated with AUC.  These models were then used to make environmental – based predictions for the catch per unit effort for each species for the 1.59 million grid 1 x 1 km surrounding New Zealand.  These predictions were based on fixed trawl parameters. The probability and catch predictions were then multiplied together to form one predictive data layer for each species. This layer was then fed into MPA design software.

Depth, temperature, and salinity have the strongest contributions to predicting species distributions. Together they accounted for roughly half of the variation in catch. The distribution BRT models had high predictive ability when assessed using cross validation and in predicting independent trawls (AUC range 0.86 – 0.99).

New trends in species distribution modelling

Zimmermann, N. E., Edwards, T. C., Graham, C. H., Pearman, P. B. and Svenning, J.-C. (2010), New trends in species distribution modelling. Ecography, 33: 985–989. doi: 10.1111/j.1600-0587.2010.06953.x


 

*Keep in mind this was written in 2010*

From 2000 to 2010, SDMs underwent rapid development; taking advantage of new computational resources. Major areas of improvement include:

  1. implementation of new statistical models
  2. the evaluation of sampling design on performance
  3. sample size and and prevalence impact on accuracy
  4. removal of spatial autocorrelation from model fitting
  5. comparison of a range of statistical methods
  6. model evaluation

More recent studies have shifted focus to clarification of the niche concept, model parameterization schemes, model selection, model evaluation and variable selection methods.

The papers focus on five active areas of research involving SDMs, including: 1) historical legacies; 2) niche stability and evolution; 3) biotic interactions; 4) the importance of sample designs; and 5) species invasions. We believe these papers set the stage for future SDM research questions, and represent several next logical steps in SDM research and application.

1) Legacy of history: The effect of history on range size and distribution patterns is generally not considered, in other words, the assumption of range equilibrium. However, violating this assumption can lead to incorrect conclusions.

2) Niche stability and evolution: Niche stability can be thought of as a measure of phylogenetic conservation. Stable niches of closely related species will have similar environmental constraints, while differences can be attributed to local adaptation.  SDMs can be used in this area by examining niche response to environmental drivers at a sub-species level.

3) Biotic interactions: Commonly, biotic interactions are ignored when modeling species ranges in large spatial scales. However, inclusion of biotic factors have increase model performance given an environmental disturbance.

4)Design for sampling: Available datasets are highly biased due to the haphazard sampling nature. Exploring the impact of different biased sampling in silico or controlled surveys, will guide future SDM sampling bias corrections.

5) Species invasion: SDM are often used to assess invasion risk. However, the equilibrium assumption in the native range or novel favorable habitat in the foreign range may lead to an underestimation.  Developing methods to overcome these limitations will greatly improve SDM accuracy with respect to invasions.

 

Static species distribution models in dynamically changing systems: how good can predictions really be?

Zurell, D., Jeltsch, F., Dormann, C. F. and Schröder, B. (2009), Static species distribution models in dynamically changing systems: how good can predictions really be?. Ecography, 32: 733–744. doi: 10.1111/j.1600-0587.2009.05810.x


SDMs are often used to predict changes in species’ distribution under climate change. However, these models implicitly assume equilibrium, and do not incorporate dispersal, demographic processes or biotic interactions explicitly. In order to understand the implications of such assumptions, the authors created a spatially explicit multi-species model. The 2 dimensional lattice of 148 X 113 sites with absorbing boundaries. The system was populated with butterflies and parasitoids that were able to leave, but not return. Climate which influenced habitat was assigned to each site, and changed with time. Simulations were 150 years. Occurrence data, collected by a ‘virtual ecologist’, was fit to a GLM and boosted regression tree.

Zurell_Fig1

Under average climate, GLMs and BRT had high predictive accuracy.  Abrupt range shifts caused a loss in predictive power, but was regained after a small lag period settling at a new equilibrium.  Generally, BRT out preformed GLMs under range expansion, and long-dispersal (vs short dispersal) organisms were tracked better.

Application of bioclimatic models coupled with network analysis for risk assessment of the killer shrimp, Dikerogammarus villosus, in Great Britain

Gallardo et al. Application of bioclimatic models coupled with network analysis for risk assessment of the killer shrimp, Dikerogammarus villosus, in Great Britain. Biol Invasions (2012) 14:1265–1278 DOI 10.1007/s10530-011-0154-0


Freshwater systems are particularly prone to invasive species. Propagules lead to established populations when the invaded system matches the species’ ecological requirements. The environmental match between native and foreign systems are commonly modeled using SDM, which climate as the main driver.  Dispersal of established species are limited by hydrological connectivity; this can be modeled as a network.  The authors combine these two approaches to model the potential spread of killer shrimp in Great Britain, which is currently established in 3 confined locations.

First, areas in Great Britain of climatic similarity to the native range were identified. These areas are considered high risk. A total of 248 European occurrences and a set of 6 bioclimatic factors were used to build a 2 class support vector machine. The data set was split 80/20 into training/testing. Pseudo-absences were drawn from a European-wide background, and used to evaluate the model via AUC. The minimum training presence was also reported. The SVM model was projected onto Europe, where probabilities could be derived via Platt’s (ie. fit a logistic regression model to the estimated decision values). Models were converted to binary outcomes based on the threshold that maximized specificity and sensitivity. Hydrological networks were used to model 3 different speeds of dispersal: high (100km/yr) medium (60km/yr) and low (20km/yr). Areas that would be colonnaded within 5 years were considered highest risk.

SVM had a high accuracy score (AUC=0.97), in the minimum training presents was relatively low (11%).  Based on the model, habitat suitability was greatest altitude below 500 m, maximum temperature between 20 and 30°C, minimum temperature between -5 and 5°C and annual precipitation lower than 1000 mm. Unfortunately, 44% of Great Britain showed climate suitability higher than 50%. Regardless of speed, the network analysis indicated the north east part of the study site is at high risk of being invaded in the next 5 year. Areas of highest suitability within Great Britain already support a well-established and abundant population of a Ponto-Caspian species (zebra mussel).

What do we gain from simplicity versus complexity in species distribution models?

Merow, C., Smith, M. J., Edwards, T. C., Guisan, A., McMahon, S. M., Normand, S., Thuiller, W., Wüest, R. O., Zimmermann, N. E. and Elith, J. (2014), What do we gain from simplicity versus complexity in species distribution models?. Ecography, 37: 1267–1281. doi: 10.1111/ecog.00845


The variety of methods and implementations of SDMs allow for a wide range of complexity, however it is critical to match study objectives and complexity for robust inference.   On one hand, “under fit” models insufficiently describe observed occurrence – environment relationships, risking misunderstanding the factor shaping species distributions. On the other hand, “over fit” models risk inadvertently ascribing pattern to noise or building opaque models. Finding the balance between over and under fit models must be constrained by the attributes of data and study objective rather than traditional model selection.   The authors characterize model complexity by the shape of the inferred occurrence – environment relationships, see table 1. This paper develops guidelines for deciding the appropriate level of model complexity as outlined in Fig 1.

wk5_fig

Ecologist’s preference for simple or complex models are often influenced by their past experience with data types and questions- rather than philosophical approach.

Testing projected wild bee distributions in agricultural habitats: predictive power depends on species traits and habitat type

Marshall et al. Testing projected wild bee distributions in agricultural habitats: predictive power depends on species traits and habitat type. Ecology and Evolution 2015; 5(19): 44264436. DOI: 10.1002/ece3.1579


Pollinators are ecologically and economically important, but have been in decline. Some conservation initiatives have been implemented, but the effectiveness depends on the characteristics of the surrounding landscape and other environmental variables. Creating species distribution models (SDM) for wild bees can be challenging given their high mobility. Additionally, SDMs the often data aggregated over number of years and are rarely validated with external data. Authors examine the performance of SDMs in correctly predicting wild bee occurrences from field surveys. Furthermore, they attempt to identify species and/or traits that are better suited to SDMs.

They expect species with highly specialized habitat needs or rare species to have higher predicted habitat suitability by the SDM. Additionally, the authors expect better performance in agriculture areas that are stable such as orchards rather than agriculture subject to crop rotation.

The distribution of wild bees in the Netherlands was modeled using a total of 43,989 observations including for 193 species across 25 genera. Records dated back to as early as 1990. The MAXENT model included 13 variables: seven land use, five climate and elevation. Background points were sampled from areas where wild bee species had been found since 1990. AUC values recalculated from a 10 fold cross validation scheme, in the final model was validated with independent field surveys from agricultural sites.  

 The performance of SDM to predict wild bee occurrences in field surveys depended on species trait, target habitat, and sampling technique. Generally, the model performed better for highly specialized species with restricted habitats. This is promising, given that most species identified for conservation purposes are often specialists.  M onany species were found in predicted unsuitable habitats, but this is most likely due to the seasonal changes in crop flowering or crop rotation that is not captured in the SDM.  This study demonstrates the need to incorporate more specific information about landscape type, crop type, including fine-scale vegetation and information on flower availability by seasons into SDMs used for conservation purposes.

Predicting the conservation status of data-deficient species

Bland, L. M., Collen, B., Orme, C. D. L. and Bielby, J. (2015), Predicting the conservation status of data-deficient species. Conservation Biology, 29: 250–259. doi: 10.1111/cobi.12372


One-sixth of the >65,000 species assessed by the IUCN are classified as data deficient (DD) due to a lack of information on taxonomy, geographic distribution, population status, or threats. Field surveys of DD species is not feasible, but large amounts of life history, ecological, and phylogenetic information are available can be combined for a comparative study of extinction risk based on species trait data.

The authors address the following questions:

  1. What are the relative abilities of 7 different ML methods (classification trees, random forests, boosted trees, k nearest neighbors, support vector machines, neural networks, and decision stumps) to predict extinction risk in terrestrial mammals?

Random forests, boosted trees, support vector machines, and neural networks performed particularly well. Classification trees and k nearest neighbors performed relatively poor.

  1. How accurately can those methods predict current geographical patterns of extinction risk?

The presented models were less likely to assign narrow-ranging non-threatened species and wide-ranging threatened species to their correct status.

  1. Using the models obtained, what is the predicted level of extinction risk faced by DD species?

313 of 493 (63.5%) of DD species are predicted as threatened, this increases the global proportion of threatened terrestrial mammals from 22% to 27%.

  1. How do our findings change current geographical patterns of extinction risk for terrestrial mammals?

Not really

Methods: The authors collated a database of 4461 terrestrial mammals classed as either non-threatened, threatened, vulnerable, endangered, critically endangered or data deficient. Additionally,  life history traits biogeographic distribution and habitat suitability were collected for each mammal. ML models (to predict threatened/non-threatened status) were developed using all mammals, along with separate models of rodents, bats, primates, and carnivores to explore the taxonomic transferability of ML predictive accuracy. Highly correlated and low variance variables were removed before fitting any models.  The training/testing (75/25) data set did not include any DD species. All models were tuned to maximize AUC values.  The Youden index was used to set the probability threshold to distinguish between the two classes. Predicted (from the best global ML) threatened species’ range maps were then compared to current global patterns of extinction risk.