Some have been asking for an explanation of WhyWhere and how it fits in relation to other methods, particularly GARP. Although the details are in the paper, they are in a more academic from and I thought I would try to explain it here.
Here is a nice schematic prepared by Jean Tate describing the basic one dimensional model output from a run on the Yellow Star Thistle, illustrated as a frequency histogram. A 2D model would be similar, just columns with two environmental dimensions
The blue columns show the frequency of each environmental class B, in this case standard deviation of precipitation. You can see the distribution of values is very skewed. The red histogram is the frequency P of those values in the cells where the species occurrences are. You can see there is a difference in the distributions, with the YST locations showing a tendency for greater SD. Finally the green crosses are the posterior probability of species being present in a category (i.e. P/(P+B)) . This is virtually identical to the approach of surrogate models.
Why is it different? Notice the form of the posterior probability – exponential asymptotic – not uni-modal as you usually see. Because WW is based on discrete intervals it can detect strong associations with virtually any distribution. It does not require assumptions about the form of the distributions such as gaussian, sigmoidal, exponential, etc, as an input — it shows the distribution of the most accurate variable as an output. This gives one more confidence that if a significant association exists that it will be found.
Secondly, the 2 and 3D version captures simple interactions between variables. There are no additive terms like you find in regression. I never could understand what it meant to add temperature and precipitation together. By looking at these figures you can see the form of the response in a low dimensional space — easer to visualise and interpret. Based on preliminary work in comparison with GARP there is no loss in predictive accuracy from limiting the model to fewer variables, particularly as the most accurate variables are selected from 1000’s of potential correlates.
Actually in tests with the same set of environmental variables there appears to be little difference between the results of GARP and WhyWhere. Below is a comparison of the predictions on the Yellow Star Thistle. Where WhyWhere can be potentially be more accurate is in comparing the approach of selecting a few very accurate variables from alarge data set (e.g. using monthly climate averages) versus developing a model with a larger number of less correlated variables (such as annual climate averages).
Here is a set of images for the YST comparing GARP and WhyWhere predictions, on the same datasets, and on the Corvalis data sets with different approaches to integrating transportation data. I hope to make this paper available soon.
Thirdly, this simple approach allows for a range of efficient measures for selecting the optimal combination of variables. Currently I use expected accuracy as a guide, calculated by assuming all categories with a posterior greater than 0.5 predict presence, and all else absence. The approach would allow other measures – a Chi square test of difference of two distributions is obvious. I have also been experimenting with entropy measures, as used in the MaxEnt algorithms.
Fourthly, WhyWhere comes supplied with access to a large number of global datasets. At last count 950 layers were available for testing for correlates and building models. These are held remotely in a Storage Resource Broker archive and are individually cropped and scaled as needed for a given analysis.
Fifthly, WhyWhere is an innovative architecture based on separating the modelling server from the interface. This allow multiple interfaces to be implemented while retaining a standardized protocol for calling the modelling services.
There are a number of open questions, such as how many categories and how best to allocate them? The nubmer of colors need to be reduced, as in the 2D case there are 256×256 categories and in the 3D 256x256x256. Controlling numbers of categories is the main way to control over-fitting. Currently I use the Heckbert’s median cut which is a binning algorithm that allocates more categories to the more frequent color classes in the image. This is a very nonlinear algorithm that has proven itself to produce acceptable color compression over the long term.