There are two main forms of data about species occurrences, lists of locations where a species has been found, called presence-only (P) data, and lists of locations where species are both present and absent (PA). In developing ENMs, PA data are often said to be preferable to P data (e.g. Austin and Meyers 1996), and some have shown empirical results supporting this view (e.g. Broto et al. 2004). But is there an intrinsic advantage to PA data?
The problem of presence-only data arises largely from using museum collections as a source of geocoordinates for species distributions. Specimens in museum collections often have the longitude and latitude recorded where they were found. But there is no information on where they were not found as there can logically be no absent specimens in a collection. Atempting to model and map species with the locational data from these collections leads to the P-only problem. PA data arises from structured surveys were either all species are recorded in a given area and absence can be assumed by omission from the list, or absences are explicitly recorded.
There are a number of ways of framing the modeling of species distributions. The main way it to view the PA data as a binary variable, i.e. a variable with two values, 1 and 0. This frame is implied most explicitly in binomial (or binary) logistic regression (e.g GLM or GAMs) to model the relationship of environment to the PA data and many other methods. One of the main assumptions of a binomial variable is that the values or categories must be mutually exclusive and exhaustive, that is both values must exist and be logical opposites. To use PA methods with P only data therefore, the A data must be generated. These data are termed ‘psuedo-absences’, and are most often generated from a random sample of the entire set of grid cells available called ‘background’, or they could be generated by subjective expert opinion. However, these A data are guaranteed to violate exclusively as they will invariably contain some P points. Deviations from assumptions can lead to unpredictable results.
A way of looking at P only methods is to view the Ps as a draw or a sample from a larger population consisting of all the available grid cells. The goal of an ENM algorithm such as WhyWhere, is to determine which combinations of environmental variables maximize the apparent non-randomness of this draw. For models derived in this way, it is entirely consistent with assumptions to compare Ps with either the background frequencies or a random sample of the backgrounds, and numerous statistical tests are available for testing deviance of samples from a population.
(Put example here.)
Empirical findings of superiority of PA approaches may be due to factors other than A data alone. Quantitative comparison of PA with Ps is confounded by the two different methods used (e.g Broto et al. 2004). One explanation for better results is the reduction in the effect of bias. It is possible that using As with the same survey structure as the Ps can reduce the effect of bias such as roadside collections. But this effect is not apriori — it is a case-by-case factor.
Could it be that expressed preference of PA to P data could be due only to apparent conformance of PA to the specific statistical methods used? That is, because the use of GLMs requires PA data, then PA data is regarded preferable? As I have indicated above, there may not be any inherent theoretical superiority of PA over P, as there are valid statistical tests for dealing with P only situations in the form of tests of deviations of samples from a population.
In fact, it may be argued that on a theoretical basis at least, a P approach may be preferable because there is no requirement for truly exhaustive and exclusive absences, a requirement that is not met by most biodiversity data. Even more, the use of a random sample from the background population to supply pseudo-absences may have unexpected consequences on results when true absences are expected.
Austin M P and Meyers J A 1996 Current Approaches to Modeling the Environmental Niche of Eucalyptus: Implications for Management of Forest Biodiversity; For. Ecol. Manag. 85 95â€“106
LluÃs Broto, Wilfrid Thuiller, Miguel B. AraÃºjo, Alexandre H. Hirzel Presence-absence versus presence-only modelling methods for predicting bird habitat suitability 2004Ecography, Vol. 27 Issue 4 Page 437.