The following manual has been moved from its original location.

by Karen Payne and D.R.B. Stockwell

## Introduction

*Welcome to the GARP Modelling System (GMS)! GARP is an acronym for
Genetic Algorithm for Rule Set Production. The GMS is a set of modules
primarily designed for predicting the potential distribution of biological
entities from raster based environmental and biological data. The modules
perform a variety of analytical functions in an automated way, thus making
possible rapid unsupervised production of animal and plant distributions.
This manual describes the use of the software which has widespread application
where a simple to use, robust and informative modelling system is needed.
*

*The package you have just downloaded consists of three parts. First,
this manual which is intended to serve as a gentle introduction to those
interested in using the GMS. Secondly, you have all the programs necessary
for using the GMS. Finally, this package also contains a small example
data set and scripts for running them, do.x and do2.x. These examples are
referred to in this tutorial paper.*

- copyright
- caveat
- revision history
- retreiving and installing garp
- contacting the author
- other sources of information
- conventions used in this manual
- a note on the examples provided
- general structure of analytical systems
- general structure of this manual

- parameters
- rasteriz
- presampl

- initial
- overview of logic and

probability - rule types
- explain
- introduction to genetic

algorithms - predict
- verify

- image

## Administrative matters

### Copyright

This program is the copyrighted and intellectual property of David Stockwell. Permission is given to use this program for evaluation. For regular

use a fee may be charged.

### Caveat on the use of the GARP modelling package

The author disclaims any warranties of fitness of programs for any particular

problem.

### Revision history

This is version 1.0, the first public release version of GARP.

### Downloading and installing the GMS

GARP is available for download from biodi.sdsc.edu

### Installation

This package is a C coded version of an earlier system called Ttree

which was written in Turbo PROLOG for MSDOS. This version has run successfully

on Sun workstations and IBM PC’s (using Linux).

Installation in UNIX

The following commands should compile the programs. Modify the makefile

for system specific compilers and installation destination.

Unzip, untar and compile.

> gunzip garp-1.0.tar.gz > tar xovf garp1.0.tar > make all

Installation in DOS

For DOS installation use the syntax:

>pkunzip -d garp_1.0.zip

in your garp directory.

After you install garp you should have on your system the following

files on your system:

- executables

rasteriz* initial* presampl* explain* predict* verify* image* translat*

CAVEATS COPYRIGHT FAQ README garp.txt formats.txt rasterize.txt initial.txt presample.txt explain.txt predict.txt verify.txt image.txt translate.txt index.txt

mod.x* multi.x*

Example/ Example2/

### Contacting the author

Bugs, comments, money, contracts and general praise of the GMS can be

directed to the author via one of the following contacts:

- David Stockwell
- davids99us at yahoo.com

### What published information and applications are available?

Biodi also has supporting documentation in addition

to links to other relevant sites and can be viewed at: http://biodi.sdsc.edu.

You may also wish to joint the GARP public mailing list by sending the

message:

> subscribe garp

to

The list owner is David Stockwell and can be contacted at either of

the following email addresses: davids@sdsc.edu

If you are interested in learing more about genetic algorithms you may

view the Genetic

Algorithms FAQ from the newsgroup comp.ai.genetic.

### A note about the notation used in this manual

File names are given in quotation marks (eg. "test"). Commands

that you would type in at your terminal are preceeded with a > indicating

the machine prompt. cd means change directory.

### A note on the examples provided

Example programs are included for testing and tutorial. These are contained

in the directories Example and Example2. To run the GMS on these examples

on UNIX or DOS machines cd to the "Example" directory and type:

> do

This example uses most of the model development tools in their typical

usage. The output is an ascii map of a 20×20 distribution of Greater Glider

density. To run the second example cd into "Example2" and type:

> do2

This example applies the rules developed in "do" to a 140×100

map to predict the density of Greater Glider over a larger area. The output

are images in portable grey map (pgm) format. You will need an image viewer

such as xv to view or convert these images.

It may be useful to examine the files "do", "do2"

and "parameters" as an example of how to run GARP in a batch

mode. A typical batch file for a UNIX machine, the "do" file,

is shown below:

set -x echo "Datadir Example" > paramete cp Example/layer00 . ./presampl -prop ./initial ./explain cat test | ./verify ./predict | ./image -pnm ./translat cat predI.pgm

Each line of the batch file is a separate command to the operating system

which is executed in order in the batch file. The pipe symbol (|) directs

the output of one program into the input of another. The redirect symbol

(>) directs the output of one program into a named file.

### General structure of this manual

This manual is designed to guide you through the GARP Modelling System.

The GMS contains tools for database modelling and visualisation. Provided

the requisite files are available each module can be run independently

at any given time. The intent is to describe how the GMS runs and explain

how and why this modelling system performs the way it does.

The GMS modules and typical order of application are:

size, prior probability, and spatial coverage. It produces training and

test sets for the GARP modelling system

models and output a set of the best models

test data set

and cell of the raster data set

While the main form of this analysis is statistical, it has a number

of outstanding features:

to be developed.

allows applications to be developed where analysis bridges the gap between

databases and graphic visualisation packages.

logistic regression. These are evaluated and applied at the same time allowing

simultaneous comparison of different methods.

and significance, and them ranks according to quality. This provides a

robust system which attempts to provide the best patterns that can be found

in the data.

The package up to the present has been used for develop predictive models

of the distribution of biological species from survey data, although many

other applications are possible.

### General structure of the analytical system

The GMS is a "production line" type architecture, with a linear

configuration of components which provides an efficient, simple structure.

Within the range of alternative architectures for spatial information

systems the GMS can be classified as a loosely coupled system (Abel et.

al. 1992). Loosely coupled or open systems support re-configuration and

therefore ease of integration and customisation. For example, the recorded

sightings can be extracted from an ORACLE data base, and the results displayed

in browsers such as NETSCAPE.

## The Database

### Data formats and preparation

*There are three stages to data preparation. You begin with geocoded
flat files, turn these into a series of "layer" files and then
sample the layers to create training and testing datasets. *

The challenge modelling biological pattern is to take a set of site

based records of a species and produce an accurate map of the pattern of

the potential distribution. The records are scattered unevenly throughout

the region and points of absence may or may not be recorded (Fig 1).

Figure 1: An example of biological survey data. Points where a species

occurs are shown in white and points where it doesn’t occur in black.

### The "data" file

The data that you wish to use in modelling must be in a "data"

file called a point coverage. The point coverage is an ascii file, the

first two columns contain the geocode (longitude and the latitude or easting

and northing), and the following columns contain an abundance value for

a species, or value of a variable, eg:

150.775 -35.005 0 0 1178 195 169 0 148.005 -35.005 1 3 824 204 138 5 ...

These values can originate from any source; most database and GIS applications

can output points in this form. This format is also known as a point coverage

in ARCINFO or as a geocoded flat file in other parlance.

### The "parameters" file

Your must also create a file called "parameters" in your working

directory. This file serves two functions. First, it stores information

about each of the variables you use in your experiment. Secondly, it contains

parameters for controlling the options available to the programs in the

GMS. The listing below shows an example of the minimal contents of a parameter

file for two independent variables.

Columns 0 20 2 Rows 0 20 2 Variable 0 ExM 0 3 c degC %2.0f Variable 1 Dev 0 2 c % %2.0f Variable 2 StC 0 1 c mm %2.0f

The meaning of the "parameters" file is:

Columns (x min) (x max) (increment) Rows (y min) (y max) (increment) Variable (column) (name) (min) (max) (type) (units) (format) Variable ...

The Columns and Rows parameters control the spatial information for

mapping of point coverages into layers. The first number is the minimium

spatial extent, the last is the maximium spatial extent, and the last is

the cell size. The size of the layer is determined from the equation:

(max - min)/size

Together the first values of the rows and columns parameters should

define the upper left hand corner of the image. This means that the order

of the geocodes in the "parameters" file depends on the reference

grid that you are using and correspondingly which hemisphere you are working

in. For example, if you are using easting and northing measures characteristic

of UTM projections in the southern hemisphere then the order of the geocodes

will look something like:

Columns 250000 265750 30 Rows 6069265 6053515 30

Here the values of the eastings (or columns) increase as you move east.

The northings (rows) decrease as you move away from the equator in the

southern hemisphere. In contrast, if you were working with lattitude and

longitude measures then the first value of the rows parameter would be

smaller then the second value.

The remaining parameters are details of the model variables: variable

number, name (short identifier), minimum value, maximum value, types (categorical,

ordered or continuous), units of measure, and printf printing format. Note

that because the first two fields of your "data" file are reserved

for their geocodes, Variable 0 corresponds to the third column of your

"data" file, Variable 1 corresponds to the fourth column of your

"data" file etc. The printf format determines the how the variables

will be printed in the final model, that is, how many digits will be represented,

and follows the conventions of C programming.

A parameters file described above typically resides in a directory

containing all the data. It is possible to have another "parameters"

file in a working directory with the line:

Datadir [absolute path name]

The Datadir entry points the applications to another directory where

the full parameter file (as described above) is located. This feature allows

data to be located in a single location away from tempory directories where

the program may be running. To reiterate, this type of "parameters"

file can be put in a seperate directory as long as another file called

"parameters" is present in the working directory and contains

a line such as:

Datadir /usr/Data/Australia

where /usr/Data/Australia is the directory where the "data"

and the "parameters" file containing the definitions of the variables

is kept. Alternatively, you could use the command line option

-data /usr/Data/Australia

when running a module to specify where the data is kept.

The parameters file in the working directory can contain other lines

and flags affecting the running of the program. For example, one special

parameter is the Variables list which specifies the variables to use in

the analysis: e.g.

Variables 1,2,5,6

This parameter will cause only variables 1,2,5 and 6 to be used. A full

explanation of the options avaialbe to the GMS that are controlled by flags

in the "parameters" file is given in the man pages at the end

of this manual.

### rasteriz

The next step in data preparation uses the program "rasteriz"

to convert your "data" file into a series of binary image files,

called "layers" which have one byte value per grid cell. Typically

all variables used in the modelling are layers. This format has a number

of advantages, the first being compression of information. For example,

a typical grid of 258×410 contains 106K points, requiring significant memory

resources if stored as floating point numbers. Storing these layers with

one byte per cell reduces the amount of memory needed. In practise the

approximation has not been a limitation.

The program "rasteriz" maps point data into a byte valued

spatial grid at a given scale. A cell is a single byte, its value determined

by linearly scaling the point value between 1 and 254. Supose for example,

that you had a data file that recorded the absence (0) or presence (1)

of a species at a series of locations. After running rasteriz over this

data the output files (called "test" and "train") will

have two values: 1 for absent species and 254 for records where species

are present. It has been mentioned that byte values are represented efficiently

on computers, contributing to computational efficiencies. Other advantages

are that the normalizing and scaling of the variables into single bytes

reduces the effects of differing magnitude between variables that can effect

some analytical techniques.

Mapping the data to scaled byte values also has the effect of changing

all variables to a common type. Rasterize recognises three types of variables

which is recorded in the type parameter of the "parameter" file.

Each type is treated differently:

- for species or presence/absence data,which are denoted with an "s",

a cell takes a presence value if one or more points falls within it, otherwise

it remains zero. - for categorical data denoted with a "c", a cell takes the

value of the mode of the values of the points that fall within it. - continuous or real data are marked with an "r" and the cell

takes the mean value.

The mapping also has the effect of bringing all of the data to the same

scale. Spatial auto-correlations caused by localised intensive sampling

or duplicates are eliminated. The magnitude of the effect can be seen in

the decrease in the number of effective data points. In an example of the

output from rasterize below, 58 data points are read but only 31 points

are recorded in the data grid:

RASTERIZE - point coverage to gray layer No of data points read 58 Raster cell size is 0.17x0.17 degrees Presences 31 Absences 0

In setting up an application the rasteriz program is used to prepare

the environmental layers for subsequent analysis. This has already been

done in the example in the distribution package, the layers contained in

the directory Example.

Figure 2: Examples of independent environmental variables in binary

layer form. The values range from the minimium value in black to the highest

value in white. The layers shown above from top left are geology, annual

temperature, annual rainfall and latitude.

A typical implementation uses over 30 environmental (predictor) data

layers, each containing variables such as temperature, rainfall, geology,

and topography. These layers remain constant. These layers are named layer01

to layer30, and are the independent variables for the model development.

Creating "layer" files from the data can be done an number

of ways.

- creating a single "layer" file one at at time
- creating a series of "layer" files from all variables in

a point coverage - creating a "layer" file from an arc grid file

Each of these three proceedures are detailed below.

#### 1. creating a single "layer" file one at at time

The object of this modelling system is to take this point data which

is referred to as the dependent variable, and create a model which relates

it to your suite of independent variables.

Say for example that you wanted to predict a tree species based on a

set of field observations. You would have a geocoded data set of observations

of the dependent variable which cover only some of the rasters in your

dataset. Additionally you will need a set of predictor variables.

The general, minimal form of the "parameters" file looks

like:

Columns (x min) (x max) (increment) Rows (y min) (y max) (increment) Variable (column) (name) (min) (max) (type) (units) (format) Variable ...

And in this instance the "data" file of observational data

may look something like:

250013 6053501 250605 6053720 250447 6067253 etc.

where each location indicates a known presence (absence is not recorded).

Alternatively, the "data" file may be of the form:

250013 6053501 0 250605 6053720 1 250447 6067253 0 etc.

where a 0 indicates a known absence at a site and 1 indicates a known

presence.

The "parameters" file which describes this set of dependent

or observational data may look something like:

Columns 250000 265750 30 Rows 6069265 6053515 30 Variable 0 Eumac 0 1 s species %2.0f

Where "Eumac" indicates that the first variable (located in

the third column) in the "data" file is the variable in question,

it is a species variable and may have values 0 or 1. Other types of variables

are "c" for categorical variables and "r" for continusous

variables. This allows you to model species presence or absence or abundance

or different types of variables. The location of "species" in

the parameter file is reserved for the unit of measure.

In this case the "rasteriz" program is run using the following

syntax:

> cat data | rasteriz

or

> rasteriz -file data

Each time you run "rasteriz" on a single variable you generate

an output file layer called "layer00." If "layer00"

already exists in your current working directory and you run "rasteriz"

then "layer00" will be overwritten." "layer00"

is always the dependent variable being modelled. In this example "layer00"

is the layer representing the presence or absence of the Eumac species.

#### 2. creating a series of "layer" files from all variables

in a point coverage

The "data" file for the independent variables may look something

like:

250000 6069250 26.20 6.5862 6 250030 6069250 24.52 6.1418 6 250060 6069250 22.91 6.2675 6 etc

In this case the first two columns are again the geocodes. The third

column is elevation, a real variable ranging from -1 (no data) to 281.51

meters. The fourth column is slope, another real variable ranging from

-1 (no data) to 209.3667 and measured in percent. The final predictor variable

is geology type, a categorical variable that may take integer values from

0 to 6.

The "parameters" file associated with the independent data

may look something like:

Columns 250000 265750 30 Rows 6069265 6053515 30 Variable 0 ele -1 281.51 r m %2.2f Variable 1 slo -1 209.3667 r perc %2.4f Variable 2 geo 0 6 c type %2.0f

To rasterize all of the variables in this point coverage at once use

the syntax:

> rasteriz -file data -all

This will create a series of layer files: "layer00," "layer01,"…

"layern". There is one layer file for each of your predictor

variables. In the next section you will create models using the dependent

and indpendent variables. When you do this the dependent variable being

modelling will always be "layer00." For this reason if you choose

to rasterize all of your independent data at once you should rename each

of the layer files so that "layer00" becomes "layer01",

"layer01" becomes "layer02" etc. This allows "layer00"

created in the previous step to be the dependent variable in your model.

Additionally, you will also want to modify the "parameters" file

so that it reflects the new layer names. In our example the new "parameters"

file will look like:

Columns 250000 265750 30 Rows 6069265 6053515 30 Variable 0 Eumac 0 1 s species %2.0f Variable 1 ele -1 281.51 r m %2.2f Variable 2 slo -1 209.3667 r perc %2.4f Variable 3 geo 0 6 c type %2.0f

#### 3. creating a "layer" file from an arc grid file

An ARCgrid file has a header describing the size of the file and its

geographic location, followed by the data points.

ncols nrows xllcorner yllcorner cellsize NODATA_value 0 0 0 0 0 0 0 1.2 0 5 6 6 6 7.2 0 ...

In the case where the max and min values of the ARCgrid file are not

known we first run option -layer 2 to find the value. Dummy values can

be used for the max and min in the parameters file.

> cat data | rasteriz -layer 2 Missing val Min is 0, max is 54.0 Columns 0 20 1 Rows 0 20 1 No of columns 20, no of rows 20.

The output lists the max min values, and an appropriate line for the

Rows Columns and Missingval lines in the parameters file. After amending

the parameters file to reflect these values the ARCgrid file can be input

using option -layer 3.

>cat data | rasteriz -layer 3

### presample

The final step in data preparation is to take the "layer*"

files and create a training set of data with which we can induce a model

and an independent test set which allows us to assess the model’s performance,

ie. we can count how many mistakes the induced model makes on the "test"

file. The presample program is run using:

>presampl

Note there is no "e" at the end of this command.

Running the program presample will produce two point files form the

stack of layers developed in the rasterize program. These files are called

"train" and "test". The "train" file is used

for training the genetic algorithm and the "test" file is used

for testing it’s overall accuracy. As presampl is running it will print

the following diagnostics to the screen:

PRESAMPLE - from data to train and test sets 201386 points with pred 0 85 points with pred 1 119 points with pred 254 presences 119 absences 85 background 201386 no_of_data 201590 102 points to train and 102 points to test

The "test" and "train" files produced from presampl

look like:

0 0 85 254 1 102 169 85 1 254 0 0 254 254 1 203 169 254 1 254 0 0 254 127 1 152 85 85 1 254 0 0 169 127 1 152 169 169 1 254 0 0 1 124 1 1 1 1 1 254 0 0 169 254 1 254 254 169 1 254 0 0 254 254 254 203 254 169 1 254 0 0 1 254 1 1 254 169 1 254 0 0 254 254 1 152 85 254 1 254 0 0 85 254 1 203 169 169 1 254 ...

The columns are the x and y locations of the point, and the values of

each of the bytes in the layers at a particular location with the dependent

variable, or layer00, at column 3.

The default behaviour of the presampling algorithm is as follows:

- a random sample with replacement, of
- a fixed number of points from outside the masked area,
- with equal proportions of the values of the dependent variable
- plus a "background" value.

The GMS allows the user to choose whether or not to sample with replacement,

that is, the user must decide if the "test" and "train"

files are limited to the number of observations that were actually made

or if these files should be made larger by allowing single data points

to be represented more than once. This option is controlled by the Resamples

and the Replace flags in the parameters file (or by the -resamples and

-noreplace command line options). There are advantages and disadvantages

to both approaches and these arguments are outlined below.

#### Using the command line option -prop or the parameters flag Propflag

<0,1>

There are two primary reasons why we would choose to control the proportions

of the dependent values. First, a data set of sightings of a species will

by composed of values in varying proportions. Varying proportions make

it difficult to compare the predictive accuracy of models on different

species. For instance, in rare species, if the proportion of sightings

is very low, such as less than 5% of the total, the strategy of predicting

the absence of the species everywhere will have an expected accuracy of

95%. Presampling the data to an even distribution i.e. 50% presences and

50% absences, allows consistent comparison of accuracy between species.

Secondly, we will see later that the GMS uses a measure of "significance"

in determining how good a rule is and if it should be maintained in the

final model. The measure of significance used in the GMS calls on the normal

approximation to the binomial distribution. This estimate becomes more

inaccurate at either end of the abundance scale. That is, the estimate

becomes more inaccurate when the number of occurances of a species is either

very rare in the dataset (probabilities near zero) or if it is extremely

common (probabilities near one).

#### Using the command line option -noreplace or the parameters flag Replaceflag

Small numbers of records occur with rare species or those of a very

restricted range. When this occurs sampling with even proportions cannot

occur without replacement of the record. While presample supports the option

of sampling with or without replacement, sampling with replacement provides

generality by allowing model development using the range of possible frequencies

of records, including from a single datum.

The disadvantage of sampling with replacement is that you are in essesence

saying that the species and it’s associated attributes occur more often

than it actually does in the landscape. Ultimately, we would like a modelling

system that performs well on rare, common and those species in between

these two extremes. The generality of being able to compare models as if

all plants occured with the same frequency in the landscape is achieved

at the expense of the assumption of independence of the probability of

selecting data points.

#### Using the command line option -resamples or the parameters flag Resamples

Allowing arbitrarily large numbers of data points, can lead to long

computation times for little gain in predictive accuracy. The default for

resampling with replacement is 2500 data points. This typically provides

sufficient information for the system to have data points from throughout

the range of possible configurations. If you wish to change this limit

you may specify a new limit in either the parameters file or on the command

line.

The Resamples X line in the parameters file (or -resamples X on the

command line) will write X number of points to each the "test"

and "training" files. If the resample option is used by itself

then the "test" and "train" files will have equal occurances

of presence and absence even if this is not the case in the original data.

Specifying a Resamples X value in the parameters file (eg. Resamples 500)

will write X data points each to the "test" and "train"

files even if you have, say, only 100 points in your data file.

In summary, the command line option -noreplace or the parameters file

line Replaceflag 0 will set the sampling to non-replacement. When you turn

the Replaceflag off (ie. set it to 0) then you should still specify a Resamples

value in the parameters file to control how many of your points will be

written to the "test" and "train" data sets. In this

case the Resamples value will be less than the number of points in your

data.

## The Modelling Tools

*To model you create a preliminary set of rules, refine them by using
a genetic algorithm and then apply it to your test data set to assess it’s
performance.*

## initial

The initial program is run using

> initial

The training set generated by presample is input to the next program

initial. This produces an initial model – a good initial starting point

for the next stage of developing a model. The initial model, output in

the file "prelim", is a set of rules.

Each rule, which is a model in itself, is an if-then statement used

for making inferences about the values of the variable of interest. The

sets of rules developed by the GMS are more accurately described as inferential

models rather than mathematical models. Inferential modelling differs from

mathematical modelling in that the models are more closely related to logic

than mathematics and the basic process is logical inference rather than

calculation.

The general form of a rule is as follows:

Given that if A then B, and A is true, then predict B.

The statement denoted as A is called the precondition while the one

denoted as B is called the conclusion. The accuracy of a rule is determined

from simple probability calculations. A set of data can be identified with

the precondition of a rule (eg. the set of data with rainfall between 600mmm

and 700mm). The probability of occurrence of the species can be calculated

from the number of these cells in which the species occurs, divided by

the number of cells selected by the precondition. For those who are unfamiliar

with probability calculations a brief introduction is provided in the following

section. If you are familiar with this topic you may wish to skip

to the next section.

### Overview of logic and

probability

Logical statements are elementary propositions which can be either true

or false. These propositions are designated by symbols, usually capital

letters. Compound propositions are formed from elementary propositions

through modification by negation or through connectives as listed below.

Logical deduction also requires a rule of inference from which the true

value of unknown propositions can be determined. The main such rule is

called modus ponens, also shown below.

Name Symbol Example Natural language propositions P,Q, P x>42, x is between 0 and 5, y=3 not ! !P x is not between 0 and 5 and ^ P^Q x=4y=5 or v PvQ x=3 or y=5 implies => P =>Q if P then Q iff <=> P<=>Q P if and only if Q rule of |= P, PQ |= Q if P then Q and P is true, conclude Q is true inference

### Basic elements of Probability

The basic elements of probability are sample points and events.

Sample points are observations or measurements of the world (e.g. when

a species occurs in a cell) and are generally denoted with the capital

letters E1, E2, E3, etc. These sample points can be counted.

A specific collection of sample points is referred to as an event and

is generally denoted with a single capital letter A, B, C, etc.

The sample points that compose events can be counted by summing the

observations. In this manual this is denoted: #(A) (e.g. number of cells

in which species A occurs). For example, let an "x" in the next

figure represent a recorded presence of species A in a raster dataset.

This is equivlant to saying each "x" is a sample point in the

event "species A is present."

The probability of the outcome of any event is denoted P(A) and is equal

to the sum of the probabilities of the sample points in A. A particular

event is said to occur if any sample point in the event occurs. That is

if A is observed #A times (a sample point within A occurs) then the probability

of A, P(A) is:

P(A)=#A/n

where n is the total number of data points.

In our example then:

P(A) = #A/n = 6/36 = 0.17

In general, the probability that an event A will occur is between 0

(there is no chance of the event occurring) and 1 (it is certain that the

event will occur).

If two events are affiliated in such a way that the occurrence of one

indicates something about the occurrance of the other then the two events

are said to be related. The magnitude of the relationship is given by a

conditional probability. A conditional probability is the probability that

event A will occur given that event B has occured and is written P(A|B).

P(A|B) can be calculated by:

P(A|B) = P(AB)/P(B)

P(B) is defined above and P(AB) is the intersection of events A and

B. That is, P(AB) is the event that both A and B occur.

Any sample point that occurs simultaneously in both A and B indicates

that the event AB has occured. P(AB) can be calculated by using the Multiplicative

Law of Probability which states that given two events A and B, the probabilty

of the intersection AB is:

P(AB) = P(A)P(B|A) = P(B)P(A|B).

In the notation of this manual then:

P(AB) = P(AB)/P(B) = (#AB)/#B

Where #AB are the number of cells which occur in both A and B.

Continuing with our example, lets say that we have, in addition to the

survey data represented in figure 1 above, an image with a value of an

attribute estimated for each raster in the dataset.

You could imagine that this (simplistic!) raster image represents two

geology types found in your study area. Now in order to make a prediction

you want to look at the relationship between the geology types and the

occurance of species A. We defined the probability of A given B above:

P(A|B) = P(AB)/P(B) = # of cells which both A and B occur/# of cells

where B occurs = #AB/#B

So in the case that B=1

P(A|B) = 5/20 = 0.25

While in the case that B=2

P(A|B) = 1/16 = 0.06

The explanation of the above problems are presented in the language

of probablity theory. For those of you acquainted with statistics this

particular problem may sound like the familiar binomial theorum. In order

to continue with this discussion we should first define a few terms.

A *discrete random variable *is one that can assume a countable

number of values. We may denote a discrete random variable as y and in

the case of our example let y equal the number of occurances of species

A in our dataset. The probability of the occurance of A is denoted p(y)

and is equivlant to P(A) when stated according to probability theory.

A particular type of discrete random variable is the *randomly distributed
binomial variable*. This variable can take the value of 0 (in our example

this may indicate the absence of species A) or 1 (present). This is written

in short hand notation as:

y ~ (n,p)

Where n is the number of trials (the number of points in your dataset)

and p is the probability that the binomial variable will occur.

In the most general terms then, the example given in figures 1 and 2

above can also be described in statistical terms as a binomial experiment.

A binomial experiment:

1. Consists of n identical trials.

2. The outcome of each trial falls into one of k classes. For binomial

experiment k=2.

3. The proabiliy that the outcome of a single trial will fall in a

particular class is denoted pi where i=1,2. This means that both classes

have an associated probability of occurance. Note that the sume of all

pi’s is 1. ie. p1 + p2 = 1.

4. We denote ni where i=1,2 (eg. n1, n2) as the number of trials in

which the outcome falls into class i. Note that n1 + n2 = n which is the

total number of data points in your dataset.

5. The probabilities pi remain the same from trial to trial. And finally,

6. The trials are independent.

The general problem of predicting the presence or absence of a species,

as we have described it above, satisfies all the definitions of a binomial

experiment except the last one. Due to spatial autocorrelation we can not

truly say that the trials are independent. The notion of spatial autocorrelation

can generally be thought of as the observation that things (natural phenemon)

are more similar to those things which are spatially close to them and

more dissimilar to those things farther away. For this reason the notion

of an independent trial is violated. While this situation is not strictly

optimal, the problem of spatial autocorrelation is an entire subfield of

geographical studies and it’s solution is beyond the scope of this manual.

Calculating p for our binomial distribution y~(n,p) when n is large

is labor intensive. However, we can avoid these calculations because under

certain circumstances the binomial distribution can be estimated using

a normal or gaussian distribution. In other words, when n is large and

p is not too close to zero or one, the binomial probability distribution

has a shape that is closely approximated by a normal curve which has a

mean = np and a standard deviation = sqrt(npq) where q = 1-p.

The following normal approximation to the binomial distribution follows

from (Mendenhall et. al. 1981 pp 550):

With 1 degree of freedom, large n (>4), k=2, and pi not too close

to zero or one then the following calculation for an outcome i is approximately

standard normal z = ni-npi/sqrt(npi(1-pi))

Later we will use the binomial theorum to evaluate the models we develop

with the GMS.

**Event Space**

Below is a diagram representing a GIS scene combined with a landcover

layer. Imagine you have to determine the relationships between two attributes,

say perhaps species distribution and forest cover.

Below is an idealized version of the example above.

An event space is a graphical representation of sets of events. The

event space has a parallel in the spatial representation of GIS. In a trial

consisting of random sampling of points in the event space, the probability

of a randomly chosen point satisfying a proposition A, say, is equal to

the area of shape A.

**Inferences for Prediction **

To predict, according to the Concise Oxford Dictionary is to foretell

or prophesy. To predict accurately is to correctly foretell an outcome.

If we wish to foretell an event accurately, such as whether a cell is A

or B, then it would be wise to do so when the probability of that event

is one. One way to do this is to use the rule of deduction. We would expect

to predict accurately by the rules of logical inference: Given that if

A then B, and A is true, then predict B.

In probabilistic terms we would expect that if the probability of B

given A is 1 (P(B|A=1) and the probability of A is 1 (P(A)=1) then the

probability of B is 1 (P(B)=1). The problem of prediction then consists

of finding situations in which the conditional probability P(B|A) is one,

or at least very high, because from them we can construct the proposition

A=>B and use it for inference.

These propositions, which I shall call rules from now on are frequently

used in our natural language, tautologies – e.g. if you go in the water

then you will get wet. Below are some examples of different types of rules.

Inclusion – this species occurs in rainforest gullies (i.e. if species

occurs then it is in a rainforest gully),

Causation – friction causes heat

Probabilistic causation – if you smoke then your likelihood of getting

cancer is increased.

You can see that there are a variety of types of rules that resemble

the basic if-then pattern. The analysis of the exact relationship between

them has occupied logicians for centuries and doesn’t concern us here.

For illustrative purposes I would like to examine simple cases of rules

that can be validly derived from the range of configurations of two circles

in a square, and evaluate their value for prediction.

The first situation where the two circles overlap illustrates the calculation

of the conditional probability P(B|A) from the event space diagram.

P(B|A) = P(BA)/P(A) = #(BA)/#(A) = 0.2 (approximately)

This is much less than 1 and so the rule A=>B (read: "if A then

B") would not be inferred.

P(B|A) = P(BA)/P(A) = #(BA)/#(A) = 0/#(A) = 0

This is 0 and so the rule if A then not B (denoted A=>!B) would be

inferred.

The next two situations are a little more complicated.

P(B|A) = P(BA)/P(A) = P(A)/P(A) = 1

P(B|A) is 1 so the rule A=>B would be inferred. The inverse does

not hold however.

P(A|B) = P(BA)/P(B) = P(A)/P(B) = #(A)/#(B) = 0.5 (approx)

This is much less than 1 and so the rule B=>A would not be inferred.

Note, however, that this rule is adequate for predictive purposes. Consider

the situation that occurs when the factor B that predicts A envelopes A.

In the spatial analogue it means that the environmental attribute (B) used

to predict A with high probability does not apply in all cases where B

is possible. So when B occurs we do not necessarily expect A to occur,

even though it is a good predictor.

The situation where A=>B and B=>A are true only holds when the

two shapes are coextensive.

This is clearly a special condition that is vary rarely met when performing

overlays. This accounts for some of the power of using rules for prediction.

The rules only need to apply part of the time, or locally, rather than

all of the time, or globally. Generally more local rules with higher probability

can be found than global rules.

### Applications to Modelling

Now lets examine a few applications of logic as they pertain to modelling.

The first is the BIOCLIM modelling technique used for modelling the distribution

of species. The basic idea behind this technique is the uncontestable notion

that living things have environmental tolerances beyond which they cannot

survive. The model is developed by enclosing the occurrences of the entity

of interest in a box or envelope defined by percentile ranges of the variables

of importance, such as climatic variables.

Using the x and y axes as climatic variables leads to a typical event

space diagram below.

The box defined by B is often used to predict the distribution of the

species observed at A. When the diagram is compared with the previous diagrams

it can be seen that the situation is very similar to the one where one

A was completely enclosed within B. From this we inferred that A=>B

and not B=>A. That is, the environmental range B is validly predicted

from the occurrences A, rather than the occurrences being predicted from

the environmental range.

Again, consider the inverse case from A=>B which is B=>A; If a

point is outside the climatic range of the species then the species will

not occur. But the inference that the species will always occur within

the climatic range is logically invalid. BIOCLIM therefore predicts the

absence of a species from the environment, but not the presence of a species.

When does BIOCLIM predict the presence of a species? In practice the

probability of the species occuring given points within the range approaches

one when the occurences of the species fill the entire climatic range.

This can occur when the species has a very restricted distribution, such

as a single occurrence or unique habitat. In this case the distribution

and range are coextensive.

The second application pertains to all models where the predictive accuracy

is quoted to support the accuracy of a particular models. You see statements

such as "this model x gave a high accuracy of 0.95 and this was better

than model y which only gave 0.85." Below is a situation where the

rule predicts with a high accuracy.

Where B is the whole of the event space S the probability of B given

A is very high, greater than 0.95, one in fact. If one relied on the conditional

probability alone then one would say that A=>B is a good rule. However

it can be seen that any area within the space would give a high probability.

This is because the prior probability of B is already high.

To use the predictive accuracy as some guide to how "good"

a model is one needs an idea of how hard the task of prediction is. Predicting

B above is not difficult, its probability in one, like death and taxes.

Predicting a very rare event, one with low prior probability, is much harder.

In the smoking causes cancer example, the rule is given experimental

credibility because it has been shown that smoking raises the probability

of cancer, i.e. the incidence of cancer in smokers is greater than the

incidence of cancer in the general population. P(B|A) > P(B)

This relationship can be decomposed as follows. P(B|A) > P(B)

P(BA)/P(A) > P(B) using definition of conditional probability,

P(BA) > P(B)P(A) as probability is always greater than zero,

which is familiar in the form P(BA) = P(B)P(A) an identity describing

the independence of events A and B. Thus the relation above occurs when

there is a positive dependence between events. In statistical terms this

is equivalent to a positive correlation between variables. Positive correlation

is often used as an estimate of goodness of regression models for example.

Interestingly, satisfying the relationship above does not necessarily

lead to accurate prediction. The diagram below illustrates a case where

P(B|A) is much greater than P(B) but P(B|A) is low. The diagram below shows

that a model with a high correlation does not neces