Data prevalence matters when assessing species’ responses using data-driven species distribution models

Fukuda, S. & De Baets, B., 2016. Data prevalence matters when assessing species’ responses using data-driven species distribution models. Ecological Informatics, 32, pp.69–78.  link to paper

The accuracy of SDMs is highly dependent on the quality and quantity of data used such as size (i.e. the number of data points in a data set) and data prevalence (i.e. the proportion of presences in a data set) matter for SDM accuracy. Fukuda et al. (2016) investigated this observation by simulating nineteen sets of virtual species data in real habitat conditions (using field observations) and hypothetical habitat suitability curves under four conditions. Then they built SDMs in order to assess the effects of data prevalence on model accuracy and habitat information. The three SDMs they tested were the Fuzzy Habitat Suitability Model (FHSM), Random Forests (RF), and Support Vector Machines (SVMs). The effects of data prevalence on species distribution modeling were evaluated based on model accuracy (AUC & MSE) and habitat information such as species response curves. Data prevalence affected both model accuracy and the assessment of species’ response, with a stronger influence on species response curves. The effects of data prevalence on model accuracy were less pronounced in the case of RF and SVMs. Data prevalence also affected the shapes of the response curve where response curves obtained from a data set with higher prevalence were less dependent on unsuitable habitat conditions, emphasizing the importance of accounting for data prevalence in the assessment of species–environment relationships. Taken together, these results show that data prevalence should be controlled for when building SDMs.