There are a number of ways to answer this question.
There are a rich diversity of methods to predict
species’ distribution and they could be listed and described.
Alternatively, the biological relationships between
species and the environment could be emphasized, and approaches from
population dynamics used as a starting point.
A more general approach to niche modeling can be based the
statistical idea of the probability distribution.
Definition: A niche model is a probability distribution defined on environmental
Definition: A probability distribution f(E) is an assignment of a probability
to every interval on a set of environmental variables E.
This definition of the niche as a probability distribution
over sets of environmental variables allows for
developing niche models in new ways over
Based on this definition, the ‘entity’ being
modeled is probabilistic, not an
actual physical quantity such as population density of animals or group of plants.
Thus, the object of the modeling
is similar to a quantum entity — in the realm of possibility
rather than actuality. Niche models often describe
fairly vague concepts, such as habitat suitability.
Nevertheless they are useful if
one is careful not to carry the metaphor too far, partly because
the fundamental constraints that govern microscopic physical
system, such as conservation of energy laws, do not hold.
Niche models are sometimes called equilibrium models,
as generally the niche represents a stable relationship
of a species to its environment.
Stability in this sense refers to the overall stability of a population despite
non-equilibrium disturbances such as annual cycles and episodic threats,
For example, the processes that lead to expansion of the
range of the species balance the processes that lead to contraction
and result in an equilibrium.
But equilibrium conditions are not a necessary assumption
to develop these models.
Any form of reasonably ‘stable’ probability distribution can be used
to make a niche model.
For example, while migrating species move in relation to their
environment, it has been shown that many are ‘niche followers’
by remaining in a fairly constant climate as the seasons change.
Invasive species are another example of species not at ‘equilibrium’
but generally only spreading to similar environmental niches to
those occupied in their host country.
Types of niche model
The quantitative basis of niche modeling really lies
in the Hutchinsonian definition of a niche. This was
described as a region in ‘hyperspace’ of n dimensional
environmental variables where a critter lives. Developing
a model using the ideas of Hutchinson
simply requires defining the range of the species along the axes
of the set of environmental variables.
This approach was used in BIOCLIM, one of the first niche modeling
tools first used in an early study of the distribution
of snakes in Australia by Henry Nix.
This approach was very intuitively convincing, and captures the sense of a niche
as understood by ecologists: that the occurrence of species should be limited by
a range of environmental factors, and that an envelope around
those ranges would have predictive utility.
However, the approach runs into some practical problems.
Firstly, there is no way to exclude irrelevant variables from being included in the model.
The range of irrelevant variables is determined more by chance
that any causal relationship. Yet those irrelevant variables will act as
constraints outside the range.
Secondly, in environmental envelopes the limits of the species are defined
largely by the tails of the
probability distribution. The tails of a probability distribution
usually have the least probability, the
least numbers of samples, and hence the greatest uncertainty.
One way of reducing the variability of the range limits
is to estimate the 95 percentile.
But this approach produces a progressive reduction in ecological area
with each variable. As the number of variables increase, the
potential area is reduced. To overcome this problem, models have
been proposed based on the mean and standard deviation.
Thirdly the environmental envelopes defined by the limits
of each variable in turn form squares or box-like shapes.
Alternative geometries have been proposed to correct these problems, by
allowing more flexible geometric shapes to
describe the distribution.
Generalized linear models
While the above approaches to correcting the deficiencies of environmental envelopes
let to some improvements, there is an essential component missing that was
a concern for more statistically minded researchers. Environmental
envelopes do not explicitly estimate probability.
That is, while they define a region in space, the variation in probability
within that region is undefined.
Logistic regression was used to place niche
modeling on a firmer statistical footing. Logistic regression is
a form of linear regression modeling where the dependent
variable, the variable to be estimated from the environmental
variables is specifically probability and not abundance or other variable.
Logistic regression models are a
well studied and understood statistical methodology.
While the introduction of statistical rigor is to be preferred over
ad hoc approaches, there were still problems identified.
One of the first was called ‘naughty naughts’.
The ‘naughty naughts’ referred to the great many areas
with essentially zero probability, such as oceans for a terrestrial species.
Most probability distributions are continuous with finite (though very small)
probability over the whole range. The need to eliminate the naughts,
leads to the use of truncated
and other more complex probability distributions
in an attempt to fit the expected shape of the probability distribution.
Actually, the problem of finding the best shape for a distribution of a species
is not trivial and cannot be taken for granted.
Species distributions are not necessarily ‘normal’ and there can be
good ecological reasons for highly unusual distributions.
They can be skew, bimodal, exponential or sigmoidal.
They can also have long tails. Both justifying the shape of the distribution and
modeling with the range of possible distributions involves difficult
and challenging statistical tests using classical statistical approaches.
Secondly the treatment of categorical variables such as
vegetation types, ecological regions, and so on, is problematic. Logistic
regression usually handles categorical variables, by treating each
category as a binary variable. For example, if a variable has 100 categories, then
this would produce 100 new variables. However, with more categories
and more variables the number of variables that would need to be introduced
is enormous. Logistic regression would be a method of choice with well
behaved distributions of largely continuous variables.
Machine learning methods
Due in part to the problems posed by categorical variables,
and essentially arbitrary distributions, machine
learning was seen as potentially applicable to niche modeling.
Machine learning methods have been used in a variety of
problems where the there were no exact analytical solutions.
The popular early methods: decision trees and neural nets
were tried and found useful. The GARP approach was an attempt to meld the three traditional
approaches in a genetic algorithm that evolved a set of solutions
consisting of environmental envelopes, logistic regression and categorical rules.
The idea of a genetic algorithm is to generate a set of rules for each type of relationship
and then iteratively test and refine them until a stable solution is achieved, letting the best rules win.
This approach was intended to capture complex heterogeneous types of
relationships of species to the environment.
Although machine learning methods have the
requirement that they estimate the probability,
all these methods were problematic to interpret in terms of
a Hutchinsonian niche, although people did try.
Another drawback was that some required multiple runs and
were computationally intensive.
Nevertheless, the development of these machine learning methods has progressed
and many are giving very good results that clearly exceed the
classical approaches. However the difficulty in interpreting
the results in terms of ecological theory remains, as do potential limitations
with large sets of variables.
Data mining is the automated search for patterns in large amounts of data.
A couple of aspects of niche modeling make data mining potentially useful.
Firstly, as often little is known about the factors determining species’
distributions, we don’t know what factors will be most accurate at predicting the species.
Because of this uncertainty, we can’t always apply annual averages of temperature and rainfall and
expect to get a good model.
For example species in freshwater and marine environments
are not well modeled by annual climate factors, and as the popularity of niche modeling
grows more entities in exotic environments will be of interest.
Data mining makes it possible to test a large number of datasets as
potential candidates for models.
Secondly there is lot more data available now than there was — a
factor described in a following chapter.
The philosophy behind a data mining approach to niche modeling allows
for minimal assumptions to be made about the type of variables
and the form of the probability distribution that can potentially be used
in a model. Also, an approach that allows virtually any variable to be used,
opens analysis up to modeling potentially anything.
To a large extent the only difference between niche modeling
and data mining is the number of independent variables. Approaches whereby models
are developed with all variables simultaneously cannot usually be used with large
data sets due to memory limitations. Generally a sequential approach
to including variables in the model is needed. Data mining also needs
to robustly discover information about a range of types of data with
arbitrary statistical distributions.
One of the most popular approaches used in data mining is the
induction of decision trees.
Mentioned before under machine learning methods,
decision tree methods have continued to be improved
with the use of more complex algorithms for
Another approach to data mining is clustering.
Often used as an exploratory method of data analysis,
methods such as k-means quantize variables into
a discrete number of groups, and characterize the
points in the groups by representative features, such
as the group centroids.
The statistics of k-means and decision tree methods have been
well researched, and in comparison to more heuristic
methods are well understood. They been used successfully in a variety
A clustering approach is used in the WhyWhere algorithm.
Here an image processing method is used to derive the groups
from up to three environmental variables at once, characterized as
the list of reduced colors. Efficient approximate implementations of k-means are
used for the color reduction. The method used, Heckbert’s median cut,
is used in GIF and other image formats to compress their size,
and has been proven to give good results for images.
In clustering approaches, probabilities for
prediction at a specific point are derived from a single probability at each
cluster. These can simply be the cluster the point belongs too, or
a weighted sum of probabilities at a number of clusters.
In WhyWhere the probabilities of presence or absence are calculated from
the proportion of occurrences of the points in a group
relative to the proportion of environmental values in that category.
Here we presented a brief overview of the background to niche modeling methods,
the strengths and limitations of various approaches and the status of contemporary
trends. The next chapter discusses the other most important component of
ecological niche modeling — the environmental data.