GARP and numbers of data points

There are a number of issues that arise in analysis of spatial data points, not enough data and spatial auto-correlation being two often raised.

As a general principle, the external accuracy (on the test set) will increase asymptotically as number of data increases, and the internal accuracy (on the training set) will decrease asymptotically as the number of data increases. We are most often interested in external accuracy, so more data is better, but the returns are diminishing. There are also plenty of results to show that the shape of the curve varies a lot between methods and species, but its not clear why.

Spatial autocorrelation occurs when the occurrences are sufficiently close together that they essentially duplicate each other. The main problem here is that it leads to spuriously high indications of significance as the assumption generally is the points are independent. This could create problems for example, if you were to develop a logistic regression using stepwise method, and variables were at different scales, or resolutions. Then I would think the finer scale variables would produce misleadingly high correlations. What GARP does is map all the environmental layers and the occurrences into a grid of the same extent and resolution. This results in a reduction in the number of data points, with more being eliminated at coarser resolutions. I believe this results in a more authentic representation of the information contained in the presence points and reduces, though not eliminates, spatial autocorrelation. This is completely built-in to the system, though is somewhat independent of the actual GARP algorithm. Its just one of those things I added to try to achieve a more general purpose system.


