The recently published paper by Jane Elith and Catherine Graham et.al.”Novel methods improve prediction of species’ distributions from occurrence data” (EG06) is sure to be a landmark study in the field. EG06 compares 16 modeling methods using 226 well-surveyed species in 6 regions of the world. Measures of statistical skill on held back data show a spread from a wide range of methods including: the older methods such as BIOCLIM, DOMAIN, through GARP, GLM and GAM to the newer arrivals from machine learning MAXENT, BRT and community based method GDM, prompting the conclusion “novel methods improve prediction”. The work of a great many people is appreciated, as these results will no doubt be very helpful to many biodiversity modellers in the future.
Why novel methods work
EG06 attributes the success of the newer methods to representing complexity of the relationships of species to their environment.
One feature that they all share in common is a high level of flexibility in fitting complex responses
The same thing was found in the early 80’s when novel machine learning methods — particularly neural nets, decision trees (CART), and genetic algorithms (GARP) — were first used for species distribution modeling. When BIOCLIM and GLM were the only species distribution methods, these early experiments showed heuristic approaches from machine learning would benefit the field.
The opposing view still widely held is that approaches should be based strictly on ecological theory, such as using BIOCLIM to represent the Hutchinsonian niche. This view is valid too, if your primary aim is to evaluate the theory and not necessarily maximize accuracy. This is a familiar theme in ML — that real world performance requires heuristic complexity at the expense of theoretical elegance. Happily the species prediction problem has come to the attention of the leading edge researchers in present day ML, and both theory and practice will benefit from the interplay.
Historic progression of models
In addition to the complexity dimension, the range of statistical skill across methods also represents an historic progression. GARP in the early 80’s used genetic algorithms to combine the major methods of the time, BIOCLIM, GLM and surrogate into a multi-model rule-set. The strategy of using multiple models for prediction is also used in the highest performing method BST in EG06 (Boosted Regression Trees). Under ideal conditions the ensemble approach in GARP would be expected to be better than the worst of the methods it uses (BIOCLIM), but no better than the best of the methods (GLM), and this is shown in EG06. It is likely that other approaches with high performance such as boosted regression trees (BRT) have evolved experiences with from earlier regression tree algorithms such as classification and regression trees (CART).
Unanswered issues
The major unanswered issue in species distribution modeling is environmental data selection. In EG06, selection of environmental data reflects the typical practices:
The environmental data used for each region were determined according to their relevance to the species being modeled (Austin 2002) as determined by the data provider (Tables 1 and 3).
EG06 does not address the problem of environmental data selection. No more than about 13 environmental data sets were used in each region. No statistics were provided for the power of these datasets. This in no way undermines their conclusions. However, people want to develop the best models possible. It has been shown in “Improving ecological niche models by data mining large environmental datasets for surrogate models†(S06) that monthly climate variables may be much more effective than annual climate averages, suggesting the variables typically used are not the best possible variables.
Just as GARP arose out of concern with arbitrariness in use of functional forms and generalized over them, WhyWhere arises out of concern with arbitrariness in the selection of environmental data. This problem has only become apparent, as the number of data sets available has burgeoned. There is also the large dataset problem of modeling species distribution in the Marine environment where the depth parameters multiples the number of possible variables enormously (e.g. nutrient levels at each depth).
Where WhyWhere fits in
One of the findings in “Effects of sample size on accuracy of species distribution models” — (SP04) a major comparison of 1060 species in Mexico using logistic regression (LR), GARP and simple surrogate model (SS) — was that the old SS method performed surprisingly well. This was interesting, as a very simple approach can be the basis for a very high performance algorithm, enabling the analysis large data sets. Rather than use a more theoretically precise approach to clustering the environment, such as kmeans which I have found to be inefficient and unreliable, a quicker more reliable heuristic method for classifying colors in images was used to develop a practical approach to data mining in the order of 1000’s of datasets, called WhyWhere.
It was subsequently shown in S06 that a relatively simple, low dimensional SS model searching a very large set of data could outperform a more complex model using a small set of general environment variables. This is because some specific variables, particularly monthly climate variables, correlate well with most species, but these vary from species to species (see Surprising finding #3 for recent results). The opposing view is that variable selection should follow ecological theory. That a small set of climate variables represent ecological determinants adequately. This is rhetorically similar to the arguments for ecologically-based models. Valid, if and only if you want to sacrifice maximum accuracy — real world performance at the expense of theoretical elegance.
There is also a larger agenda. Just as GARP and other algorithms demonstrated the value of machine learning approaches in the 80’s, WhyWhere is promoting biodiversity modeling by demonstrating the value of data mining approaches. This strategy enables the best minds in computer science to engage productively with the biodiversity field, and promotes biodiversity modeling, still a minor player compared with climate and population modeling.
Role of WhyWhere
WhyWhere can be used in a ‘pre-modeling’ stage. Points can be run through the server here just to see which variables give greatest accuracy. After objectively determining the best variables from the currently 528 terrestrial variables available, you can include them in other approaches if required.
The advantages of using WhyWhere in a pre-modelling stage are:
- Greater objectivity in environmental variable selection
- Applicability to environments with a large number of variables (e.g. marine and to the depth dimension)
- Generality to applications other than species distribution (e.g. house prices and climate).
- Potentially more accurate models using optimal datasets
- No need for each person to develop a new set of variables
Research questions:
1. On average it appears the Worldclim monthly climate average datasets are most frequently the highest performers (see Surprising finding #3 for recent results). While BIOCLIM may have performed relatively poorly in EG06, these preliminary results suggest the datasets largely associated with them may be very powerful. Will the combination of monthly climate and other variables with the novel methods increase skill?
2. Can the novel methods be run on 1000 environmental variables, many of which are categorical with 100s of categories each? For example regarding BRT:
Therefore, it is not prudent to analyze categorical dependent variables (class variables) with more than, approximately, 100 or so classes.
The approach of using a low dimension models with a piece-wise fit seems necessary for achieving reliable high performance on large numbers of environmental data.
3. How should we best address the problem of burgeoning numbers of environmental correlates? This is the next nettle that must be grasped to continue to move the field forward.
4. Are the data and algorithms in EG06 freely available for benchmarking other models?
Conclusions
EG06 also provides an objective and comprehensive evaluation of statistical skill of a wide range of methods at predicting the distribution of well-surveyed species from presence-only data using a small number of generic data sets, identifying the best methods for future studies. The study did not address the important role of environmental data sets selection. Results and logic suggest that a wider range of environmental data than are currently used, such as the monthly climate averages, will improve accuracy even more. People should start using them.
References
EG06 – Jane Elith, Catherine H. Graham, Robert P. Anderson, Miroslav DudÃk, Simon Ferrier, Antoine Guisan, Robert J. Hijmans, Falk Huettmann, John R. Leathwick, Anthony Lehmann, Jin Li, Lucia G. Lohmann, Bette A. Loiselle, Glenn Manion, Craig Moritz, Miguel Nakamura, Yoshinori Nakazawa, Jacob McC. M. Overton, A. Townsend Peterson, Steven J. Phillips, Karen Richardson, Ricardo Scachetti-Pereira, Robert E. Schapire, Jorge Soberón, Stephen Williams, Mary S. Wisz and Niklaus E. Zimmermann, 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29: 129-.
S06 – Stockwell D.R.B. 2006. Improving ecological niche models by data mining large environmental datasets for surrogate models Ecological Modelling 192: 188–196.
SP04 – Stockwell DRB, Peterson AT, 2002. Effects of sample size on accuracy of species distribution models Ecological Modelling 148 (1): 1-13.
Worldclim – Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978.