The following manual has been moved from its original location.
by Karen Payne and D.R.B. Stockwell
Welcome to the GARP Modelling System (GMS)! GARP is an acronym for
Genetic Algorithm for Rule Set Production. The GMS is a set of modules
primarily designed for predicting the potential distribution of biological
entities from raster based environmental and biological data. The modules
perform a variety of analytical functions in an automated way, thus making
possible rapid unsupervised production of animal and plant distributions.
This manual describes the use of the software which has widespread application
where a simple to use, robust and informative modelling system is needed.
The package you have just downloaded consists of three parts. First,
this manual which is intended to serve as a gentle introduction to those
interested in using the GMS. Secondly, you have all the programs necessary
for using the GMS. Finally, this package also contains a small example
data set and scripts for running them, do.x and do2.x. These examples are
referred to in this tutorial paper.
- revision history
- retreiving and installing garp
- contacting the author
- other sources of information
- conventions used in this manual
- a note on the examples provided
- general structure of analytical systems
- general structure of this manual
- overview of logic and
- rule types
- introduction to genetic
This program is the copyrighted and intellectual property of David Stockwell. Permission is given to use this program for evaluation. For regular
use a fee may be charged.
Caveat on the use of the GARP modelling package
The author disclaims any warranties of fitness of programs for any particular
This is version 1.0, the first public release version of GARP.
Downloading and installing the GMS
GARP is available for download from biodi.sdsc.edu
This package is a C coded version of an earlier system called Ttree
which was written in Turbo PROLOG for MSDOS. This version has run successfully
on Sun workstations and IBM PC’s (using Linux).
Installation in UNIX
The following commands should compile the programs. Modify the makefile
for system specific compilers and installation destination.
Unzip, untar and compile.
> gunzip garp-1.0.tar.gz > tar xovf garp1.0.tar > make all
Installation in DOS
For DOS installation use the syntax:
>pkunzip -d garp_1.0.zip
in your garp directory.
After you install garp you should have on your system the following
files on your system:
rasteriz* initial* presampl* explain* predict* verify* image* translat*
CAVEATS COPYRIGHT FAQ README garp.txt formats.txt rasterize.txt initial.txt presample.txt explain.txt predict.txt verify.txt image.txt translate.txt index.txt
Contacting the author
Bugs, comments, money, contracts and general praise of the GMS can be
directed to the author via one of the following contacts:
- David Stockwell
- davids99us at yahoo.com
What published information and applications are available?
Biodi also has supporting documentation in addition
to links to other relevant sites and can be viewed at: http://biodi.sdsc.edu.
You may also wish to joint the GARP public mailing list by sending the
> subscribe garp
The list owner is David Stockwell and can be contacted at either of
the following email addresses: firstname.lastname@example.org
If you are interested in learing more about genetic algorithms you may
view the Genetic
Algorithms FAQ from the newsgroup comp.ai.genetic.
A note about the notation used in this manual
File names are given in quotation marks (eg. "test"). Commands
that you would type in at your terminal are preceeded with a > indicating
the machine prompt. cd means change directory.
A note on the examples provided
Example programs are included for testing and tutorial. These are contained
in the directories Example and Example2. To run the GMS on these examples
on UNIX or DOS machines cd to the "Example" directory and type:
This example uses most of the model development tools in their typical
usage. The output is an ascii map of a 20×20 distribution of Greater Glider
density. To run the second example cd into "Example2" and type:
This example applies the rules developed in "do" to a 140×100
map to predict the density of Greater Glider over a larger area. The output
are images in portable grey map (pgm) format. You will need an image viewer
such as xv to view or convert these images.
It may be useful to examine the files "do", "do2"
and "parameters" as an example of how to run GARP in a batch
mode. A typical batch file for a UNIX machine, the "do" file,
is shown below:
set -x echo "Datadir Example" > paramete cp Example/layer00 . ./presampl -prop ./initial ./explain cat test | ./verify ./predict | ./image -pnm ./translat cat predI.pgm
Each line of the batch file is a separate command to the operating system
which is executed in order in the batch file. The pipe symbol (|) directs
the output of one program into the input of another. The redirect symbol
(>) directs the output of one program into a named file.
General structure of this manual
This manual is designed to guide you through the GARP Modelling System.
The GMS contains tools for database modelling and visualisation. Provided
the requisite files are available each module can be run independently
at any given time. The intent is to describe how the GMS runs and explain
how and why this modelling system performs the way it does.
The GMS modules and typical order of application are:
size, prior probability, and spatial coverage. It produces training and
test sets for the GARP modelling system
models and output a set of the best models
test data set
and cell of the raster data set
While the main form of this analysis is statistical, it has a number
of outstanding features:
to be developed.
allows applications to be developed where analysis bridges the gap between
databases and graphic visualisation packages.
logistic regression. These are evaluated and applied at the same time allowing
simultaneous comparison of different methods.
and significance, and them ranks according to quality. This provides a
robust system which attempts to provide the best patterns that can be found
in the data.
The package up to the present has been used for develop predictive models
of the distribution of biological species from survey data, although many
other applications are possible.
General structure of the analytical system
The GMS is a "production line" type architecture, with a linear
configuration of components which provides an efficient, simple structure.
Within the range of alternative architectures for spatial information
systems the GMS can be classified as a loosely coupled system (Abel et.
al. 1992). Loosely coupled or open systems support re-configuration and
therefore ease of integration and customisation. For example, the recorded
sightings can be extracted from an ORACLE data base, and the results displayed
in browsers such as NETSCAPE.
Data formats and preparation
There are three stages to data preparation. You begin with geocoded
flat files, turn these into a series of "layer" files and then
sample the layers to create training and testing datasets.
The challenge modelling biological pattern is to take a set of site
based records of a species and produce an accurate map of the pattern of
the potential distribution. The records are scattered unevenly throughout
the region and points of absence may or may not be recorded (Fig 1).
Figure 1: An example of biological survey data. Points where a species
occurs are shown in white and points where it doesn’t occur in black.
The "data" file
The data that you wish to use in modelling must be in a "data"
file called a point coverage. The point coverage is an ascii file, the
first two columns contain the geocode (longitude and the latitude or easting
and northing), and the following columns contain an abundance value for
a species, or value of a variable, eg:
150.775 -35.005 0 0 1178 195 169 0 148.005 -35.005 1 3 824 204 138 5 ...
These values can originate from any source; most database and GIS applications
can output points in this form. This format is also known as a point coverage
in ARCINFO or as a geocoded flat file in other parlance.
The "parameters" file
Your must also create a file called "parameters" in your working
directory. This file serves two functions. First, it stores information
about each of the variables you use in your experiment. Secondly, it contains
parameters for controlling the options available to the programs in the
GMS. The listing below shows an example of the minimal contents of a parameter
file for two independent variables.
Columns 0 20 2 Rows 0 20 2 Variable 0 ExM 0 3 c degC %2.0f Variable 1 Dev 0 2 c % %2.0f Variable 2 StC 0 1 c mm %2.0f
The meaning of the "parameters" file is:
Columns (x min) (x max) (increment) Rows (y min) (y max) (increment) Variable (column) (name) (min) (max) (type) (units) (format) Variable ...
The Columns and Rows parameters control the spatial information for
mapping of point coverages into layers. The first number is the minimium
spatial extent, the last is the maximium spatial extent, and the last is
the cell size. The size of the layer is determined from the equation:
(max - min)/size
Together the first values of the rows and columns parameters should
define the upper left hand corner of the image. This means that the order
of the geocodes in the "parameters" file depends on the reference
grid that you are using and correspondingly which hemisphere you are working
in. For example, if you are using easting and northing measures characteristic
of UTM projections in the southern hemisphere then the order of the geocodes
will look something like:
Columns 250000 265750 30 Rows 6069265 6053515 30
Here the values of the eastings (or columns) increase as you move east.
The northings (rows) decrease as you move away from the equator in the
southern hemisphere. In contrast, if you were working with lattitude and
longitude measures then the first value of the rows parameter would be
smaller then the second value.
The remaining parameters are details of the model variables: variable
number, name (short identifier), minimum value, maximum value, types (categorical,
ordered or continuous), units of measure, and printf printing format. Note
that because the first two fields of your "data" file are reserved
for their geocodes, Variable 0 corresponds to the third column of your
"data" file, Variable 1 corresponds to the fourth column of your
"data" file etc. The printf format determines the how the variables
will be printed in the final model, that is, how many digits will be represented,
and follows the conventions of C programming.
A parameters file described above typically resides in a directory
containing all the data. It is possible to have another "parameters"
file in a working directory with the line:
Datadir [absolute path name]
The Datadir entry points the applications to another directory where
the full parameter file (as described above) is located. This feature allows
data to be located in a single location away from tempory directories where
the program may be running. To reiterate, this type of "parameters"
file can be put in a seperate directory as long as another file called
"parameters" is present in the working directory and contains
a line such as:
where /usr/Data/Australia is the directory where the "data"
and the "parameters" file containing the definitions of the variables
is kept. Alternatively, you could use the command line option
when running a module to specify where the data is kept.
The parameters file in the working directory can contain other lines
and flags affecting the running of the program. For example, one special
parameter is the Variables list which specifies the variables to use in
the analysis: e.g.
This parameter will cause only variables 1,2,5 and 6 to be used. A full
explanation of the options avaialbe to the GMS that are controlled by flags
in the "parameters" file is given in the man pages at the end
of this manual.
The next step in data preparation uses the program "rasteriz"
to convert your "data" file into a series of binary image files,
called "layers" which have one byte value per grid cell. Typically
all variables used in the modelling are layers. This format has a number
of advantages, the first being compression of information. For example,
a typical grid of 258×410 contains 106K points, requiring significant memory
resources if stored as floating point numbers. Storing these layers with
one byte per cell reduces the amount of memory needed. In practise the
approximation has not been a limitation.
The program "rasteriz" maps point data into a byte valued
spatial grid at a given scale. A cell is a single byte, its value determined
by linearly scaling the point value between 1 and 254. Supose for example,
that you had a data file that recorded the absence (0) or presence (1)
of a species at a series of locations. After running rasteriz over this
data the output files (called "test" and "train") will
have two values: 1 for absent species and 254 for records where species
are present. It has been mentioned that byte values are represented efficiently
on computers, contributing to computational efficiencies. Other advantages
are that the normalizing and scaling of the variables into single bytes
reduces the effects of differing magnitude between variables that can effect
some analytical techniques.
Mapping the data to scaled byte values also has the effect of changing
all variables to a common type. Rasterize recognises three types of variables
which is recorded in the type parameter of the "parameter" file.
Each type is treated differently:
- for species or presence/absence data,which are denoted with an "s",
a cell takes a presence value if one or more points falls within it, otherwise
it remains zero.
- for categorical data denoted with a "c", a cell takes the
value of the mode of the values of the points that fall within it.
- continuous or real data are marked with an "r" and the cell
takes the mean value.
The mapping also has the effect of bringing all of the data to the same
scale. Spatial auto-correlations caused by localised intensive sampling
or duplicates are eliminated. The magnitude of the effect can be seen in
the decrease in the number of effective data points. In an example of the
output from rasterize below, 58 data points are read but only 31 points
are recorded in the data grid:
RASTERIZE - point coverage to gray layer No of data points read 58 Raster cell size is 0.17x0.17 degrees Presences 31 Absences 0
In setting up an application the rasteriz program is used to prepare
the environmental layers for subsequent analysis. This has already been
done in the example in the distribution package, the layers contained in
the directory Example.
Figure 2: Examples of independent environmental variables in binary
layer form. The values range from the minimium value in black to the highest
value in white. The layers shown above from top left are geology, annual
temperature, annual rainfall and latitude.
A typical implementation uses over 30 environmental (predictor) data
layers, each containing variables such as temperature, rainfall, geology,
and topography. These layers remain constant. These layers are named layer01
to layer30, and are the independent variables for the model development.
Creating "layer" files from the data can be done an number
- creating a single "layer" file one at at time
- creating a series of "layer" files from all variables in
a point coverage
- creating a "layer" file from an arc grid file
Each of these three proceedures are detailed below.
1. creating a single "layer" file one at at time
The object of this modelling system is to take this point data which
is referred to as the dependent variable, and create a model which relates
it to your suite of independent variables.
Say for example that you wanted to predict a tree species based on a
set of field observations. You would have a geocoded data set of observations
of the dependent variable which cover only some of the rasters in your
dataset. Additionally you will need a set of predictor variables.
The general, minimal form of the "parameters" file looks
Columns (x min) (x max) (increment) Rows (y min) (y max) (increment) Variable (column) (name) (min) (max) (type) (units) (format) Variable ...
And in this instance the "data" file of observational data
may look something like:
250013 6053501 250605 6053720 250447 6067253 etc.
where each location indicates a known presence (absence is not recorded).
Alternatively, the "data" file may be of the form:
250013 6053501 0 250605 6053720 1 250447 6067253 0 etc.
where a 0 indicates a known absence at a site and 1 indicates a known
The "parameters" file which describes this set of dependent
or observational data may look something like:
Columns 250000 265750 30 Rows 6069265 6053515 30 Variable 0 Eumac 0 1 s species %2.0f
Where "Eumac" indicates that the first variable (located in
the third column) in the "data" file is the variable in question,
it is a species variable and may have values 0 or 1. Other types of variables
are "c" for categorical variables and "r" for continusous
variables. This allows you to model species presence or absence or abundance
or different types of variables. The location of "species" in
the parameter file is reserved for the unit of measure.
In this case the "rasteriz" program is run using the following
> cat data | rasteriz
> rasteriz -file data
Each time you run "rasteriz" on a single variable you generate
an output file layer called "layer00." If "layer00"
already exists in your current working directory and you run "rasteriz"
then "layer00" will be overwritten." "layer00"
is always the dependent variable being modelled. In this example "layer00"
is the layer representing the presence or absence of the Eumac species.
2. creating a series of "layer" files from all variables
in a point coverage
The "data" file for the independent variables may look something
250000 6069250 26.20 6.5862 6 250030 6069250 24.52 6.1418 6 250060 6069250 22.91 6.2675 6 etc
In this case the first two columns are again the geocodes. The third
column is elevation, a real variable ranging from -1 (no data) to 281.51
meters. The fourth column is slope, another real variable ranging from
-1 (no data) to 209.3667 and measured in percent. The final predictor variable
is geology type, a categorical variable that may take integer values from
0 to 6.
The "parameters" file associated with the independent data
may look something like:
Columns 250000 265750 30 Rows 6069265 6053515 30 Variable 0 ele -1 281.51 r m %2.2f Variable 1 slo -1 209.3667 r perc %2.4f Variable 2 geo 0 6 c type %2.0f
To rasterize all of the variables in this point coverage at once use
> rasteriz -file data -all
This will create a series of layer files: "layer00," "layer01,"…
"layern". There is one layer file for each of your predictor
variables. In the next section you will create models using the dependent
and indpendent variables. When you do this the dependent variable being
modelling will always be "layer00." For this reason if you choose
to rasterize all of your independent data at once you should rename each
of the layer files so that "layer00" becomes "layer01",
"layer01" becomes "layer02" etc. This allows "layer00"
created in the previous step to be the dependent variable in your model.
Additionally, you will also want to modify the "parameters" file
so that it reflects the new layer names. In our example the new "parameters"
file will look like:
Columns 250000 265750 30 Rows 6069265 6053515 30 Variable 0 Eumac 0 1 s species %2.0f Variable 1 ele -1 281.51 r m %2.2f Variable 2 slo -1 209.3667 r perc %2.4f Variable 3 geo 0 6 c type %2.0f
3. creating a "layer" file from an arc grid file
An ARCgrid file has a header describing the size of the file and its
geographic location, followed by the data points.
ncols nrows xllcorner yllcorner cellsize NODATA_value 0 0 0 0 0 0 0 1.2 0 5 6 6 6 7.2 0 ...
In the case where the max and min values of the ARCgrid file are not
known we first run option -layer 2 to find the value. Dummy values can
be used for the max and min in the parameters file.
> cat data | rasteriz -layer 2 Missing val Min is 0, max is 54.0 Columns 0 20 1 Rows 0 20 1 No of columns 20, no of rows 20.
The output lists the max min values, and an appropriate line for the
Rows Columns and Missingval lines in the parameters file. After amending
the parameters file to reflect these values the ARCgrid file can be input
using option -layer 3.
>cat data | rasteriz -layer 3
The final step in data preparation is to take the "layer*"
files and create a training set of data with which we can induce a model
and an independent test set which allows us to assess the model’s performance,
ie. we can count how many mistakes the induced model makes on the "test"
file. The presample program is run using:
Note there is no "e" at the end of this command.
Running the program presample will produce two point files form the
stack of layers developed in the rasterize program. These files are called
"train" and "test". The "train" file is used
for training the genetic algorithm and the "test" file is used
for testing it’s overall accuracy. As presampl is running it will print
the following diagnostics to the screen:
PRESAMPLE - from data to train and test sets 201386 points with pred 0 85 points with pred 1 119 points with pred 254 presences 119 absences 85 background 201386 no_of_data 201590 102 points to train and 102 points to test
The "test" and "train" files produced from presampl
0 0 85 254 1 102 169 85 1 254 0 0 254 254 1 203 169 254 1 254 0 0 254 127 1 152 85 85 1 254 0 0 169 127 1 152 169 169 1 254 0 0 1 124 1 1 1 1 1 254 0 0 169 254 1 254 254 169 1 254 0 0 254 254 254 203 254 169 1 254 0 0 1 254 1 1 254 169 1 254 0 0 254 254 1 152 85 254 1 254 0 0 85 254 1 203 169 169 1 254 ...
The columns are the x and y locations of the point, and the values of
each of the bytes in the layers at a particular location with the dependent
variable, or layer00, at column 3.
The default behaviour of the presampling algorithm is as follows:
- a random sample with replacement, of
- a fixed number of points from outside the masked area,
- with equal proportions of the values of the dependent variable
- plus a "background" value.
The GMS allows the user to choose whether or not to sample with replacement,
that is, the user must decide if the "test" and "train"
files are limited to the number of observations that were actually made
or if these files should be made larger by allowing single data points
to be represented more than once. This option is controlled by the Resamples
and the Replace flags in the parameters file (or by the -resamples and
-noreplace command line options). There are advantages and disadvantages
to both approaches and these arguments are outlined below.
Using the command line option -prop or the parameters flag Propflag
There are two primary reasons why we would choose to control the proportions
of the dependent values. First, a data set of sightings of a species will
by composed of values in varying proportions. Varying proportions make
it difficult to compare the predictive accuracy of models on different
species. For instance, in rare species, if the proportion of sightings
is very low, such as less than 5% of the total, the strategy of predicting
the absence of the species everywhere will have an expected accuracy of
95%. Presampling the data to an even distribution i.e. 50% presences and
50% absences, allows consistent comparison of accuracy between species.
Secondly, we will see later that the GMS uses a measure of "significance"
in determining how good a rule is and if it should be maintained in the
final model. The measure of significance used in the GMS calls on the normal
approximation to the binomial distribution. This estimate becomes more
inaccurate at either end of the abundance scale. That is, the estimate
becomes more inaccurate when the number of occurances of a species is either
very rare in the dataset (probabilities near zero) or if it is extremely
common (probabilities near one).
Using the command line option -noreplace or the parameters flag Replaceflag
Small numbers of records occur with rare species or those of a very
restricted range. When this occurs sampling with even proportions cannot
occur without replacement of the record. While presample supports the option
of sampling with or without replacement, sampling with replacement provides
generality by allowing model development using the range of possible frequencies
of records, including from a single datum.
The disadvantage of sampling with replacement is that you are in essesence
saying that the species and it’s associated attributes occur more often
than it actually does in the landscape. Ultimately, we would like a modelling
system that performs well on rare, common and those species in between
these two extremes. The generality of being able to compare models as if
all plants occured with the same frequency in the landscape is achieved
at the expense of the assumption of independence of the probability of
selecting data points.
Using the command line option -resamples or the parameters flag Resamples
Allowing arbitrarily large numbers of data points, can lead to long
computation times for little gain in predictive accuracy. The default for
resampling with replacement is 2500 data points. This typically provides
sufficient information for the system to have data points from throughout
the range of possible configurations. If you wish to change this limit
you may specify a new limit in either the parameters file or on the command
The Resamples X line in the parameters file (or -resamples X on the
command line) will write X number of points to each the "test"
and "training" files. If the resample option is used by itself
then the "test" and "train" files will have equal occurances
of presence and absence even if this is not the case in the original data.
Specifying a Resamples X value in the parameters file (eg. Resamples 500)
will write X data points each to the "test" and "train"
files even if you have, say, only 100 points in your data file.
In summary, the command line option -noreplace or the parameters file
line Replaceflag 0 will set the sampling to non-replacement. When you turn
the Replaceflag off (ie. set it to 0) then you should still specify a Resamples
value in the parameters file to control how many of your points will be
written to the "test" and "train" data sets. In this
case the Resamples value will be less than the number of points in your
The Modelling Tools
To model you create a preliminary set of rules, refine them by using
a genetic algorithm and then apply it to your test data set to assess it’s
The initial program is run using
The training set generated by presample is input to the next program
initial. This produces an initial model – a good initial starting point
for the next stage of developing a model. The initial model, output in
the file "prelim", is a set of rules.
Each rule, which is a model in itself, is an if-then statement used
for making inferences about the values of the variable of interest. The
sets of rules developed by the GMS are more accurately described as inferential
models rather than mathematical models. Inferential modelling differs from
mathematical modelling in that the models are more closely related to logic
than mathematics and the basic process is logical inference rather than
The general form of a rule is as follows:
Given that if A then B, and A is true, then predict B.
The statement denoted as A is called the precondition while the one
denoted as B is called the conclusion. The accuracy of a rule is determined
from simple probability calculations. A set of data can be identified with
the precondition of a rule (eg. the set of data with rainfall between 600mmm
and 700mm). The probability of occurrence of the species can be calculated
from the number of these cells in which the species occurs, divided by
the number of cells selected by the precondition. For those who are unfamiliar
with probability calculations a brief introduction is provided in the following
section. If you are familiar with this topic you may wish to skip
to the next section.
Logical statements are elementary propositions which can be either true
or false. These propositions are designated by symbols, usually capital
letters. Compound propositions are formed from elementary propositions
through modification by negation or through connectives as listed below.
Logical deduction also requires a rule of inference from which the true
value of unknown propositions can be determined. The main such rule is
called modus ponens, also shown below.
Name Symbol Example Natural language propositions P,Q, P x>42, x is between 0 and 5, y=3 not ! !P x is not between 0 and 5 and ^ P^Q x=4y=5 or v PvQ x=3 or y=5 implies => P =>Q if P then Q iff <=> P<=>Q P if and only if Q rule of |= P, PQ |= Q if P then Q and P is true, conclude Q is true inference
Basic elements of Probability
The basic elements of probability are sample points and events.
Sample points are observations or measurements of the world (e.g. when
a species occurs in a cell) and are generally denoted with the capital
letters E1, E2, E3, etc. These sample points can be counted.
A specific collection of sample points is referred to as an event and
is generally denoted with a single capital letter A, B, C, etc.
The sample points that compose events can be counted by summing the
observations. In this manual this is denoted: #(A) (e.g. number of cells
in which species A occurs). For example, let an "x" in the next
figure represent a recorded presence of species A in a raster dataset.
This is equivlant to saying each "x" is a sample point in the
event "species A is present."
The probability of the outcome of any event is denoted P(A) and is equal
to the sum of the probabilities of the sample points in A. A particular
event is said to occur if any sample point in the event occurs. That is
if A is observed #A times (a sample point within A occurs) then the probability
of A, P(A) is:
where n is the total number of data points.
In our example then:
P(A) = #A/n = 6/36 = 0.17
In general, the probability that an event A will occur is between 0
(there is no chance of the event occurring) and 1 (it is certain that the
event will occur).
If two events are affiliated in such a way that the occurrence of one
indicates something about the occurrance of the other then the two events
are said to be related. The magnitude of the relationship is given by a
conditional probability. A conditional probability is the probability that
event A will occur given that event B has occured and is written P(A|B).
P(A|B) can be calculated by:
P(A|B) = P(AB)/P(B)
P(B) is defined above and P(AB) is the intersection of events A and
B. That is, P(AB) is the event that both A and B occur.
Any sample point that occurs simultaneously in both A and B indicates
that the event AB has occured. P(AB) can be calculated by using the Multiplicative
Law of Probability which states that given two events A and B, the probabilty
of the intersection AB is:
P(AB) = P(A)P(B|A) = P(B)P(A|B).
In the notation of this manual then:
P(AB) = P(AB)/P(B) = (#AB)/#B
Where #AB are the number of cells which occur in both A and B.
Continuing with our example, lets say that we have, in addition to the
survey data represented in figure 1 above, an image with a value of an
attribute estimated for each raster in the dataset.
You could imagine that this (simplistic!) raster image represents two
geology types found in your study area. Now in order to make a prediction
you want to look at the relationship between the geology types and the
occurance of species A. We defined the probability of A given B above:
P(A|B) = P(AB)/P(B) = # of cells which both A and B occur/# of cells
where B occurs = #AB/#B
So in the case that B=1
P(A|B) = 5/20 = 0.25
While in the case that B=2
P(A|B) = 1/16 = 0.06
The explanation of the above problems are presented in the language
of probablity theory. For those of you acquainted with statistics this
particular problem may sound like the familiar binomial theorum. In order
to continue with this discussion we should first define a few terms.
A discrete random variable is one that can assume a countable
number of values. We may denote a discrete random variable as y and in
the case of our example let y equal the number of occurances of species
A in our dataset. The probability of the occurance of A is denoted p(y)
and is equivlant to P(A) when stated according to probability theory.
A particular type of discrete random variable is the randomly distributed
binomial variable. This variable can take the value of 0 (in our example
this may indicate the absence of species A) or 1 (present). This is written
in short hand notation as:
y ~ (n,p)
Where n is the number of trials (the number of points in your dataset)
and p is the probability that the binomial variable will occur.
In the most general terms then, the example given in figures 1 and 2
above can also be described in statistical terms as a binomial experiment.
A binomial experiment:
1. Consists of n identical trials.
2. The outcome of each trial falls into one of k classes. For binomial
3. The proabiliy that the outcome of a single trial will fall in a
particular class is denoted pi where i=1,2. This means that both classes
have an associated probability of occurance. Note that the sume of all
pi’s is 1. ie. p1 + p2 = 1.
4. We denote ni where i=1,2 (eg. n1, n2) as the number of trials in
which the outcome falls into class i. Note that n1 + n2 = n which is the
total number of data points in your dataset.
5. The probabilities pi remain the same from trial to trial. And finally,
6. The trials are independent.
The general problem of predicting the presence or absence of a species,
as we have described it above, satisfies all the definitions of a binomial
experiment except the last one. Due to spatial autocorrelation we can not
truly say that the trials are independent. The notion of spatial autocorrelation
can generally be thought of as the observation that things (natural phenemon)
are more similar to those things which are spatially close to them and
more dissimilar to those things farther away. For this reason the notion
of an independent trial is violated. While this situation is not strictly
optimal, the problem of spatial autocorrelation is an entire subfield of
geographical studies and it’s solution is beyond the scope of this manual.
Calculating p for our binomial distribution y~(n,p) when n is large
is labor intensive. However, we can avoid these calculations because under
certain circumstances the binomial distribution can be estimated using
a normal or gaussian distribution. In other words, when n is large and
p is not too close to zero or one, the binomial probability distribution
has a shape that is closely approximated by a normal curve which has a
mean = np and a standard deviation = sqrt(npq) where q = 1-p.
The following normal approximation to the binomial distribution follows
from (Mendenhall et. al. 1981 pp 550):
With 1 degree of freedom, large n (>4), k=2, and pi not too close
to zero or one then the following calculation for an outcome i is approximately
standard normal z = ni-npi/sqrt(npi(1-pi))
Later we will use the binomial theorum to evaluate the models we develop
with the GMS.
Below is a diagram representing a GIS scene combined with a landcover
layer. Imagine you have to determine the relationships between two attributes,
say perhaps species distribution and forest cover.
Below is an idealized version of the example above.
An event space is a graphical representation of sets of events. The
event space has a parallel in the spatial representation of GIS. In a trial
consisting of random sampling of points in the event space, the probability
of a randomly chosen point satisfying a proposition A, say, is equal to
the area of shape A.
Inferences for Prediction
To predict, according to the Concise Oxford Dictionary is to foretell
or prophesy. To predict accurately is to correctly foretell an outcome.
If we wish to foretell an event accurately, such as whether a cell is A
or B, then it would be wise to do so when the probability of that event
is one. One way to do this is to use the rule of deduction. We would expect
to predict accurately by the rules of logical inference: Given that if
A then B, and A is true, then predict B.
In probabilistic terms we would expect that if the probability of B
given A is 1 (P(B|A=1) and the probability of A is 1 (P(A)=1) then the
probability of B is 1 (P(B)=1). The problem of prediction then consists
of finding situations in which the conditional probability P(B|A) is one,
or at least very high, because from them we can construct the proposition
A=>B and use it for inference.
These propositions, which I shall call rules from now on are frequently
used in our natural language, tautologies – e.g. if you go in the water
then you will get wet. Below are some examples of different types of rules.
Inclusion – this species occurs in rainforest gullies (i.e. if species
occurs then it is in a rainforest gully),
Causation – friction causes heat
Probabilistic causation – if you smoke then your likelihood of getting
cancer is increased.
You can see that there are a variety of types of rules that resemble
the basic if-then pattern. The analysis of the exact relationship between
them has occupied logicians for centuries and doesn’t concern us here.
For illustrative purposes I would like to examine simple cases of rules
that can be validly derived from the range of configurations of two circles
in a square, and evaluate their value for prediction.
The first situation where the two circles overlap illustrates the calculation
of the conditional probability P(B|A) from the event space diagram.
P(B|A) = P(BA)/P(A) = #(BA)/#(A) = 0.2 (approximately)
This is much less than 1 and so the rule A=>B (read: "if A then
B") would not be inferred.
P(B|A) = P(BA)/P(A) = #(BA)/#(A) = 0/#(A) = 0
This is 0 and so the rule if A then not B (denoted A=>!B) would be
The next two situations are a little more complicated.
P(B|A) = P(BA)/P(A) = P(A)/P(A) = 1
P(B|A) is 1 so the rule A=>B would be inferred. The inverse does
not hold however.
P(A|B) = P(BA)/P(B) = P(A)/P(B) = #(A)/#(B) = 0.5 (approx)
This is much less than 1 and so the rule B=>A would not be inferred.
Note, however, that this rule is adequate for predictive purposes. Consider
the situation that occurs when the factor B that predicts A envelopes A.
In the spatial analogue it means that the environmental attribute (B) used
to predict A with high probability does not apply in all cases where B
is possible. So when B occurs we do not necessarily expect A to occur,
even though it is a good predictor.
The situation where A=>B and B=>A are true only holds when the
two shapes are coextensive.
This is clearly a special condition that is vary rarely met when performing
overlays. This accounts for some of the power of using rules for prediction.
The rules only need to apply part of the time, or locally, rather than
all of the time, or globally. Generally more local rules with higher probability
can be found than global rules.
Applications to Modelling
Now lets examine a few applications of logic as they pertain to modelling.
The first is the BIOCLIM modelling technique used for modelling the distribution
of species. The basic idea behind this technique is the uncontestable notion
that living things have environmental tolerances beyond which they cannot
survive. The model is developed by enclosing the occurrences of the entity
of interest in a box or envelope defined by percentile ranges of the variables
of importance, such as climatic variables.
Using the x and y axes as climatic variables leads to a typical event
space diagram below.
The box defined by B is often used to predict the distribution of the
species observed at A. When the diagram is compared with the previous diagrams
it can be seen that the situation is very similar to the one where one
A was completely enclosed within B. From this we inferred that A=>B
and not B=>A. That is, the environmental range B is validly predicted
from the occurrences A, rather than the occurrences being predicted from
the environmental range.
Again, consider the inverse case from A=>B which is B=>A; If a
point is outside the climatic range of the species then the species will
not occur. But the inference that the species will always occur within
the climatic range is logically invalid. BIOCLIM therefore predicts the
absence of a species from the environment, but not the presence of a species.
When does BIOCLIM predict the presence of a species? In practice the
probability of the species occuring given points within the range approaches
one when the occurences of the species fill the entire climatic range.
This can occur when the species has a very restricted distribution, such
as a single occurrence or unique habitat. In this case the distribution
and range are coextensive.
The second application pertains to all models where the predictive accuracy
is quoted to support the accuracy of a particular models. You see statements
such as "this model x gave a high accuracy of 0.95 and this was better
than model y which only gave 0.85." Below is a situation where the
rule predicts with a high accuracy.
Where B is the whole of the event space S the probability of B given
A is very high, greater than 0.95, one in fact. If one relied on the conditional
probability alone then one would say that A=>B is a good rule. However
it can be seen that any area within the space would give a high probability.
This is because the prior probability of B is already high.
To use the predictive accuracy as some guide to how "good"
a model is one needs an idea of how hard the task of prediction is. Predicting
B above is not difficult, its probability in one, like death and taxes.
Predicting a very rare event, one with low prior probability, is much harder.
In the smoking causes cancer example, the rule is given experimental
credibility because it has been shown that smoking raises the probability
of cancer, i.e. the incidence of cancer in smokers is greater than the
incidence of cancer in the general population. P(B|A) > P(B)
This relationship can be decomposed as follows. P(B|A) > P(B)
P(BA)/P(A) > P(B) using definition of conditional probability,
P(BA) > P(B)P(A) as probability is always greater than zero,
which is familiar in the form P(BA) = P(B)P(A) an identity describing
the independence of events A and B. Thus the relation above occurs when
there is a positive dependence between events. In statistical terms this
is equivalent to a positive correlation between variables. Positive correlation
is often used as an estimate of goodness of regression models for example.
Interestingly, satisfying the relationship above does not necessarily
lead to accurate prediction. The diagram below illustrates a case where
P(B|A) is much greater than P(B) but P(B|A) is low. The diagram below shows
that a model with a high correlation does not neces