Effects of sample size on the performance of species distribution models

Wisz, M.S., Hijmans, R.J., Li, J., Peterson, A.T., Graham, C.H. and Guisan, A., 2008. Effects of sample size on the performance of species distribution models. Diversity and Distributions, 14(5), pp.763-773.

http://onlinelibrary.wiley.com/doi/10.1111/j.1472-4642.2008.00482.x/pdf

 

Wisz et al. set out to address the ways in which limited sample size (which is often a problem when constructing Species Distribution Models) on model performance. The importance of sample size to SDM comes in the fact of the lower uncertainty of parameter estimation with increasing sample size. This importance is compounded by the high dimensionality of the environmental space often being modeled and the fact that the interactions between multiple environmental dimensions are often important. All this serves to increase the total number of parameters to be estimated, placing further demands on sample size. Therefore, the authors chose to test the sensitivity of 12 different species distribution models (see Table 1 in paper for full list) to sample size variation. Models were trained with presence-only data from natural history collections and evaluated with independent presence-absence data from planned surveys, via AUC. All 12 models were trained on 10, 30, and 100 randomly sampled presence points for each species. Each of these training subsets was evaluated for its degree of overlap in environmental space with the evaluation points for that species. Finally, linear mixed effects models were used to determine which factors significantly affect AUC. Smaller sample sizes exhibited significantly lower environmental range overlap than larger samples. LMEs showed a significant effect of the interaction between modeling method and sample size. Nearly all algorithms performed better with more records (exceptions being DOMAIN and LIVES). MaxEnt performed best at low sample sizes and was the second best performer at intermediate and high sample sizes (outperformed by GBM) with intermediate variances at all sizes. BIOCLIM exhibited the lowest variance across all models but also very low AUCs. At 10 records MAXENT, OM-GARP, and DOMAIN all had high AUCs with intermediate variances. As expected, methods that model complex predictor relationships were particularly sensitive to sample size (e.g. GAM, GBM, BRUTO). A major open-ended question of the work is what the difference between randomly subsetted data and “naturally data-depauperate” species records resulting from biological rarity. It stands to reason that the above analyses may not accurately apply to such a situation.

 

wisz et al.