The paper by S.J. Phillips, R.P. Anderson, and R.E. Schapire — A maximum entropy approach to species distribution modeling — introduces to niche modelers for the first time the Maximum Entropy approach well known in machine learning. They also provide the Maxent software for predicting species distribution using Maxent, and evaluate against a well know method called DesktopGARP in predicting the distribution of two Neotropical mammals, a sloth Bradypus variegatus and a rodent Microryzomys minutus.
The Maxent principle is to estimate the probability distribution, such as the spatial distribution of a species, that is most spread out subject to constraints such as the known observations of the species. Maxent uses entropy as the means to generalize specific observations of presence of a species, and does not require or even incorporate absence points within the theoretical framework. Presence-only points are observations of the presence of a species. For a variety of reasons, absence of a species is not usually recorded.
It would seem to be an advantage to have a framework that is based in presence-only points, the most common form of data in niche modeling.
Usually we are having to shoe-horn methods developed for discriminating presence from absence points into working with presence-only points. The method does seem to perform well and will no doubt be increasingly used. The trial reviewed here shows the method performs well.
There are some aspects of the way Maxent characterizes the problem of niche modeling that will seem surprising and make it a little difficult to grasp.
Feature vectors: Rather than working directly with environmental variables like temperature and precipitation as predictors, variables are transformed into feature vectors, and may be the mean of variables, their square, product with other variables, thresholds or binarizations of categorical variables.
Generative: While claiming to be most similar to generalized linear models (GLM) in approach Phillips et al. state it is a ‘generative’ approach rather than a ‘discriminative’ method like GLM. This means that where Y is the response (probability of occurrence) and X are the inputs (feature vectors) it models Pr(X|Y) the probability of inputs given the response. A discriminative approach models Pr(Y|X) the probability of a response given inputs, which is what is required for prediction. Phillips states Maxent uses Bayes’ rule to get from P(X|Y) to P(Y|X), and in fact uses P(X|Y=1) as only occurrence points are used. This is one of the many things I would have liked to have seen this explained more but couldn’t find it in the paper.
Cumulative: Another big difference is that Maxent estimates a probability distribution over every pixel in the study area which means the probability in all pixels sums to 1. Rather than individual probabilities of 0.8, 0.9, etc representing the suitability of each pixel for a species, the probabilities are each very small. To get around this, Maxent assigns a “cumulative” probability to each pixel which is the sum of the probabilities for all pixels with lesser probability. The interpretation of this distribution is that the expected accuracy of a predicted distribution using cumulative probability threshold t, will omit approximately t% of test locations with minimum predicted area. For example, t=0.05 will omit 5% of occurrences while minimizing area, which is what we would want to represent a type of boundary on the distribution.
Much is made in the paper of a subjective visual comparison of maps produced by Maxent and DesktopGARP, with the claim that Maxent predictions appeared to have more fine detail. It is not clear why. While they try variations on the only 10 ‘best subsets’ approach, they don’t seem to check the different ways of representing probability for the apparent lack of definition. Reference is also made to DesktopGARP testing the limit of storage, but exactly what information the program DesktopGARP writes to produce 20GB rather than 285MB in Maxent I don’t know. The GARP algorithm shouldn’t need a lot of temporary storage so it’s probably a detail of implementation that could be easily changed, and not inherent in the algorithm. Similarly, figures of 2hrs for 100 runs of DesktopGARP compared with 2.3hrs for one run of Maxent are quoted, so be prepared for long runs if using Maxent for repeated analyses.
One of the experiments is to examine the effect of adding in categorical variable vegetation type, which is claimed to improve accuracy in Maxent but have little effect in DesktopGARP. Atomic rules provide contribution in simple datasets such as the Waratah Creek Greater Glider data, provided with the original UNIX version of GARP, but in the context of these types of mixed climate and vegetation type data sets, logistic rules tend to dominate atomic rules. A new implementation of the GARP algorithm in OpenModeller has omitted atomic rules entirely.
Overfitting is reduced with a regularization principle equivalent to a Gibbs distribution with a constraint, the minimization of which encourages Maxent to focus on the most important feature vectors. Gibb’s distributions obey constraints such that in the normalized feature weights:
1 = Sum(exp(w f(x)))
where f(x) are the feature vectors and w are their weights. Gibbs distributions are usually used to define the equilibrium probabilities of stationary microscopic states. That the most accurate model on independent data should be constrained by a Gibbs distribution is an interesting aside, but the relevance to niche modelling is not made clear.
Phillips et al. throw in references to the theory of Convex duality, Gibbs distribution, and Bayes rule, but you would need a lot of background in machine learning to understand fully how the typical problems in niche modeling, such as bias, small samples, non-linear distributions and autocorrelation, would affect the method. I would personally have liked to have seen a lot more explanation of the methodology, not being up on statistical mechanics or machine learning, and less of the evaluation, much of which were subjective attempts to interpret the results in terms of species biology, and large and largely irrelevant tables (the ROC graphs would have sufficed).
As I said here, â€œnovel methods from Machine Learning continue to improve predictionâ€. The insight of the theoreticians, and input into niche modeling is greatly appreciated, as these results will no doubt help to propel niche modeling in the future.