Novel methods continue to improve prediction of species' distributions.

The recently published paper by Jane Elith and Catherine Graham et.al.”Novel methods improve prediction of species’ distributions from occurrence data” (EG06) is sure to be a landmark study in the field. EG06 compares 16 modeling methods using 226 well-surveyed species in 6 regions of the world. Measures of statistical skill on held back data show a spread from a wide range of methods including: the older methods such as BIOCLIM, DOMAIN, through GARP, GLM and GAM to the newer arrivals from machine learning MAXENT, BRT and community based method GDM, prompting the conclusion “novel methods improve prediction”. The work of a great many people is appreciated, as these results will no doubt be very helpful to many biodiversity modellers in the future.

Why novel methods work

EG06 attributes the success of the newer methods to representing complexity of the relationships of species to their environment.

One feature that they all share in common is a high level of flexibility in fitting complex responses

The same thing was found in the early 80’s when novel machine learning methods — particularly neural nets, decision trees (CART), and genetic algorithms (GARP) — were first used for species distribution modeling. When BIOCLIM and GLM were the only species distribution methods, these early experiments showed heuristic approaches from machine learning would benefit the field.

The opposing view still widely held is that approaches should be based strictly on ecological theory, such as using BIOCLIM to represent the Hutchinsonian niche. This view is valid too, if your primary aim is to evaluate the theory and not necessarily maximize accuracy. This is a familiar theme in ML — that real world performance requires heuristic complexity at the expense of theoretical elegance. Happily the species prediction problem has come to the attention of the leading edge researchers in present day ML, and both theory and practice will benefit from the interplay.

Historic progression of models

In addition to the complexity dimension, the range of statistical skill across methods also represents an historic progression. GARP in the early 80’s used genetic algorithms to combine the major methods of the time, BIOCLIM, GLM and surrogate into a multi-model rule-set. The strategy of using multiple models for prediction is also used in the highest performing method BST in EG06 (Boosted Regression Trees). Under ideal conditions the ensemble approach in GARP would be expected to be better than the worst of the methods it uses (BIOCLIM), but no better than the best of the methods (GLM), and this is shown in EG06. It is likely that other approaches with high performance such as boosted regression trees (BRT) have evolved experiences with from earlier regression tree algorithms such as classification and regression trees (CART).

Unanswered issues

The major unanswered issue in species distribution modeling is environmental data selection. In EG06, selection of environmental data reflects the typical practices:

The environmental data used for each region were determined according to their relevance to the species being modeled (Austin 2002) as determined by the data provider (Tables 1 and 3).

EG06 does not address the problem of environmental data selection. No more than about 13 environmental data sets were used in each region. No statistics were provided for the power of these datasets. This in no way undermines their conclusions. However, people want to develop the best models possible. It has been shown in “Improving ecological niche models by data mining large environmental datasets for surrogate models” (S06) that monthly climate variables may be much more effective than annual climate averages, suggesting the variables typically used are not the best possible variables.

Just as GARP arose out of concern with arbitrariness in use of functional forms and generalized over them, WhyWhere arises out of concern with arbitrariness in the selection of environmental data. This problem has only become apparent, as the number of data sets available has burgeoned. There is also the large dataset problem of modeling species distribution in the Marine environment where the depth parameters multiples the number of possible variables enormously (e.g. nutrient levels at each depth).

Where WhyWhere fits in

One of the findings in “Effects of sample size on accuracy of species distribution models” — (SP04) a major comparison of 1060 species in Mexico using logistic regression (LR), GARP and simple surrogate model (SS) — was that the old SS method performed surprisingly well. This was interesting, as a very simple approach can be the basis for a very high performance algorithm, enabling the analysis large data sets. Rather than use a more theoretically precise approach to clustering the environment, such as kmeans which I have found to be inefficient and unreliable, a quicker more reliable heuristic method for classifying colors in images was used to develop a practical approach to data mining in the order of 1000’s of datasets, called WhyWhere.

It was subsequently shown in S06 that a relatively simple, low dimensional SS model searching a very large set of data could outperform a more complex model using a small set of general environment variables. This is because some specific variables, particularly monthly climate variables, correlate well with most species, but these vary from species to species (see Surprising finding #3 for recent results). The opposing view is that variable selection should follow ecological theory. That a small set of climate variables represent ecological determinants adequately. This is rhetorically similar to the arguments for ecologically-based models. Valid, if and only if you want to sacrifice maximum accuracy — real world performance at the expense of theoretical elegance.

There is also a larger agenda. Just as GARP and other algorithms demonstrated the value of machine learning approaches in the 80’s, WhyWhere is promoting biodiversity modeling by demonstrating the value of data mining approaches. This strategy enables the best minds in computer science to engage productively with the biodiversity field, and promotes biodiversity modeling, still a minor player compared with climate and population modeling.

Role of WhyWhere

WhyWhere can be used in a ‘pre-modeling’ stage. Points can be run through the server here just to see which variables give greatest accuracy. After objectively determining the best variables from the currently 528 terrestrial variables available, you can include them in other approaches if required.

The advantages of using WhyWhere in a pre-modelling stage are:

  1. Greater objectivity in environmental variable selection
  2. Applicability to environments with a large number of variables (e.g. marine and to the depth dimension)
  3. Generality to applications other than species distribution (e.g. house prices and climate).
  4. Potentially more accurate models using optimal datasets
  5. No need for each person to develop a new set of variables

Research questions:

1. On average it appears the Worldclim monthly climate average datasets are most frequently the highest performers (see Surprising finding #3 for recent results). While BIOCLIM may have performed relatively poorly in EG06, these preliminary results suggest the datasets largely associated with them may be very powerful. Will the combination of monthly climate and other variables with the novel methods increase skill?

2. Can the novel methods be run on 1000 environmental variables, many of which are categorical with 100s of categories each? For example regarding BRT:

Therefore, it is not prudent to analyze categorical dependent variables (class variables) with more than, approximately, 100 or so classes.

The approach of using a low dimension models with a piece-wise fit seems necessary for achieving reliable high performance on large numbers of environmental data.

3. How should we best address the problem of burgeoning numbers of environmental correlates? This is the next nettle that must be grasped to continue to move the field forward.

4. Are the data and algorithms in EG06 freely available for benchmarking other models?

Conclusions

EG06 also provides an objective and comprehensive evaluation of statistical skill of a wide range of methods at predicting the distribution of well-surveyed species from presence-only data using a small number of generic data sets, identifying the best methods for future studies. The study did not address the important role of environmental data sets selection. Results and logic suggest that a wider range of environmental data than are currently used, such as the monthly climate averages, will improve accuracy even more. People should start using them.

References

EG06 – Jane Elith, Catherine H. Graham, Robert P. Anderson, Miroslav Dudík, Simon Ferrier, Antoine Guisan, Robert J. Hijmans, Falk Huettmann, John R. Leathwick, Anthony Lehmann, Jin Li, Lucia G. Lohmann, Bette A. Loiselle, Glenn Manion, Craig Moritz, Miguel Nakamura, Yoshinori Nakazawa, Jacob McC. M. Overton, A. Townsend Peterson, Steven J. Phillips, Karen Richardson, Ricardo Scachetti-Pereira, Robert E. Schapire, Jorge Soberón, Stephen Williams, Mary S. Wisz and Niklaus E. Zimmermann, 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29: 129-.

S06 – Stockwell D.R.B. 2006. Improving ecological niche models by data mining large environmental datasets for surrogate models Ecological Modelling 192: 188–196.

SP04 – Stockwell DRB, Peterson AT, 2002. Effects of sample size on accuracy of species distribution models Ecological Modelling 148 (1): 1-13.

Worldclim – Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978.

Advertisements

0 thoughts on “Novel methods continue to improve prediction of species' distributions.

  1. David, thanks for the thoughtful review of this important data. I think your comment about the environmetal datasets is quite important. I work in the marine environment, with museum-type presence data that comes from a broad range of locations and depths. Climatologies of marine parameters often have a couple dozen depth levels, and the size and number of datasets quickly gets quite large. Some of the methods in the paper were more able than others to handle large numbers of environmental data layers, and your suggestion about using WhyWhere to weed out environmental parameters to a manageable number seems like a good one.

  2. David, thanks for the thoughtful review of this important data. I think your comment about the environmetal datasets is quite important. I work in the marine environment, with museum-type presence data that comes from a broad range of locations and depths. Climatologies of marine parameters often have a couple dozen depth levels, and the size and number of datasets quickly gets quite large. Some of the methods in the paper were more able than others to handle large numbers of environmental data layers, and your suggestion about using WhyWhere to weed out environmental parameters to a manageable number seems like a good one.

  3. Thanks Karen. Modeling has always consisted of a stage where relevant variables were identified, then models developed and validated using those variables. There are still alternative methods to use at each stage, and whether one method can handle both stages remains an open question. I think it is important to remember there are two stages and not do only one or the other. Regards

  4. Thanks Karen. Modeling has always consisted of a stage where relevant variables were identified, then models developed and validated using those variables. There are still alternative methods to use at each stage, and whether one method can handle both stages remains an open question. I think it is important to remember there are two stages and not do only one or the other. Regards

  5. Pingback: click here

  6. Pingback: london taxi

  7. Pingback: web page

  8. Pingback: makeanygirlwanttofuck

  9. Pingback: strona firmy

  10. Pingback: strona

  11. Pingback: rehabilitacja wroclaw

  12. Pingback: zobacz

  13. Pingback: projektowanie sklepów internetowych

  14. Pingback: sell your bitcoin

  15. Pingback: kliknij tutaj

  16. Pingback: strona firmy

  17. Pingback: smacznesoki.bloog.pl

  18. Pingback: pity 2015 pobierz - pity2015program.pl

  19. Pingback: witryna firmowa

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s