Examples of Ecological Niche Models from GARP

GARP is an acronym for Genetic Algorithm for Rule Set Production. GARP is an algorithm primarily designed for predicting the potential distribution of biological entities from raster based environmental and biological data. This post describes examples of the interpretation of different sets of rules developed by GARP.

Abundance of Greater Glider

The Greater Glider (Petauroides volans) is a species of gliding possum found extensively in old-growth forest regions of South Eastern Australia. It nests in hollows created by the broken limbs of eucalyptus trees, and feeds on eucalyptus leaves of a variety of species. The species is of interest for conservation because their presence is an indicator of the presence of a suite of arboreal marsupial species.

The Waratah Creek data set is a mapping of an area 1600 ha in extent, in a 20×20 grid, located at Waratah Ck. It contains eight data layers. The first is the density of Greater Gliders at four levels, while the remaining variables are based on forest inventory variables known to be relevant to possum density. The data set and comparison of the performance of a number of other Artificial Intelligence methods is described in Stockwell et.al. (1990). The variables are shown, in row-column order in Figure 1.


Figure 1. The variables in the Waratah Creek data set in row-columns order are GG Density, Dev development, (road corridors, pine plantations), StC stream corridor (proximity), SdC stand condition (merchantable timber) , StQ site quality (productivity), FlN floristic nutrients (based on vegetation types), Slp slope, and Ero erosion potential. Dark squares are low values, and lighter squares are higher values.

The data set is a useful small, test data set for comparing predictive algorithms, and is included in the distribution of the GARP program. It is particularly useful for testing predictive algorithms because there are complex combinations of ecological relationships within it.

Continue reading

Bayesian Networks

The problem with many models, from climate systems to multiple species and ecosystems processes, to consumer purchasing behaviour is that we often have very little understanding of the actual relationships between the variables in the system.

From our limited vantage point as observers of and not experimenters on systems we only see many weakly correlated variables, often drawn from incomplete samples and widely ranging sources.

We need an automated method of developing structure from the given data that explicitly quantifies our belief that a model that captures the behaviour of the system. Bayesian nets, Beliefs nets or graphical models begin to do this, by assigning a level of belief to each of the possible values of parameters. That is, while a conventional simulation of climate say has at most one value at each simulation, a Bayesean network would represent the distribution of possible values for each parameter at each point in time.

Belief net construction can involve a manual process of knowledge engineering. Examples of systems for graphically structuring models are the Ptolemy project for modeling, simulation, and design of concurrent, real-time, embedded systems, or the freely available ‘scientific workflow’ tool called Kepler where the flow of data from one analytical step to another is captured in a formal workflow language.
Recent advances in machine learning and data mining have also yielded efficient methods for creating belief nets directly from data (Cooper and Herskovits, 1992).

Continue reading

Quantitative Niche Market Research

Niche marketing is the process of finding and serving small but potentially profitable market segments. These small market segments can be visualized as part of a “long tail”, a term elaborated by Chris Anderson in his longtail blog.

Niche markets are important for small businesses, as they can find it profitable to serve markets too small for mainstream businesses. Anderson argues that products with low sales volume can collectively exceed the relatively few current bestsellers. An example is a relative handful of weblogs have many links going into them but “the long tail” of millions of weblogs may have only a handful of links going into them.

The first advice for a small business or web site developer is to “identify your niche market.” But how? And how to quantify what you have found. Ready sources of data exist within the domain of internet and internet marketing, based around keywords. Advertisers pay for keywords, resulting in placement of adds in panels based on the keywords on a web page. You can see such to the left of this article. Google and Yahoo have built billion dollar businesses around this advertising model, and anyone can become a publisher and profit from the available information, if you know how.

Continue reading

Writing a Book Using R

Think faster than “How to write a book in 28 days”. With the freely available R language you can create a book in less than 28 seconds. Unfortunately, you still have to write the text and do the programming. What you can do is integrate the R code and text into the same files, then generate the figures and latex text together. This adds a lot of flexibility and organization for highly technical productions, and avoids the hassle of cross referencing.

In my book, “Niche Modeling” which finally been sent to the publishers I incorporated many tables of and figures results on circularity and reconstructions published here over the last 6 months, almost all generated on the fly from R data structures Sweave and xtable. A push of a button runs all the R scripts for generating plots, tables, and outputting latex.

There were some technical issues and last minute formatting glitches though that I want to document here for posterity.


The starting point for organizing a multipart book was this short note — LaTeX Files for a Book or Thesis. However, I put all the 12 chapters in subdirectories, and included the chapters from a master.tex file in the parent directory. I also used a single chapter.tex file, so I could work on one chapter at a time easily.


Sweave is an R package that allows ‘literate programming’ or integrating code and documentation. For example the code blocks are included in the latex like below. Then the figure is referred to in the text as Figure ref~{fig1}. On running sweave, the figure is generated as a postscript file, and the appropriate latex for inserting and referencing it in the document added. This save a lot of annoying cross-referencing.

< >=
... R code ...
caption{This is a figure}

I also needed to include an option to write figures to another directory
to keep them from cluttering up the chapter directories.


Sweave is run with the command Sweave(“script.R”).

Here is where the problems started. The publisher required all fonts
embedded in the pdf file. This includes the figures. The default font in R figures is Helvetica which is not available for embedding by the latex compiler. I had to use ps.options(family=”NimbusSan”) to specify another font.

Embed fonts with ghostscript

Normally I used iTexMac for compiling latex files. For final preparation I also had to compile then ensure all the fonts were embedded with the following ghostscript command.

pdflatex master
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=master2.pdf master.pdf

To check the fonts are embedded, open the file in acrobat and look at “document properties” under “fonts”. All the fonts should say “Embedded Subsetted”. After doing this there was still a single font not embedded called R1002 or R1004 depending on how I compiled it and I could not find any information about it. The publishers technical person found it was due to a single apostrophe in a code listing! Something to watch out for.

The publishers also required that I use their style file. This style used the latex directive tabletitle{} instead of caption{} for table captions. As I was using the R package xtable to generate tables I couldn’t change them. xtable is really useful, producing nicely formated latex for R data structures like dataframes, model output, time series. But I had to change the xtable code where it writes caption{ to tabletitle{ and also set it to write them on top of the table block by default, not below.

Another issue was the code listing in R would exceed the page with. I found that by reducing the width of the console window would also shorten the breaks in output strings written to latex files.

That is about it for the moment. I wish I had another 3 months to fiddle with the figures and explain things more. But I have to get it in or it won’t be published this year.

Multiple Lines of Plausible Evidence

The process behind the Mann “hockey stick” paper, featured by Al Gore in his movie “An Inconvienient Truth” has been damned by the US National Academy of Sciences Report. (the Wegman Report.) But how much difference should this make to belief in global warming?

The Wegman report addresses one of the questions asked by Barton’s office:

How central is the work of Drs. Mann, Bradley, and Hughes to the consensus on the temperature record?

Ans: MBH98/99 has been politicized by the IPCC and other public forums and
has generated an unfortunate level of consensus in the public and political sectors and has been accepted to a large extent as truth. Within the scholarly community and in certain conservative sectors of the popular press, there is at least some level of skepticism.

How could we answer this question quantitatively?

Consider the status of Anthropogenic Global Warming (AGW) claims of unanimous certainty by Naomi Oreskes
that “the scientific community believes that the Earth is warming and that human activities are the principal cause”.

Yet the IPCC claims AGW is “likely”. IPCC defines the terms “likely” (meaning 64-90%) and “very likely” (meaning >90%). Under this definition, AGW would not rise to the level of ‘significant’ in a scientific test based usually on a threshold of 95% levels of confidence. Classical science operates through such definitive high confidence beliefs. In this way the possibility of being wrong in a claim is reduced to very low probability, an event in the ‘long-tail’ of the probability distribution.

On listening to the general arguments in the Wegman hearing it seemed that the scientific consensus for anomalous warming of the 20th century comes by accumulating multiple lines of plausible evidence (e.g. glaciers, models, proxies, biota, etc).

Application of simple probability theory leads to some interesting observations about how multiple lines of low confidence evidence stack up against a single high confidence claim. Simple probability theory gives us a handle on it.

Continue reading

Hwang Woo Suk, Blogs, and Mann

I’ll make it clear from the outset that the falsification of data by Hwang Woo Suk and flawed results of Mann Bradley and Hughes are completely different situations. There are some similarities though. One, seemed to open a door to therapeutic cloning that could benefit millions of people with debilitating illnesses such as Alzheimer’s and Parkinson’s disease. The others provided a breakthough view of millennial climate history that seemed to prove humans were altering climate in an unprecedented way. They both have thousands of supporters wanting to give them a chance to prove the findings correct. And the problems were both discovered by scrutiny outside the peer review process.

At the hearings
Questions Surrounding the ‘Hockey Stick’ Temperature Studies: Implications for Climate Change Assessments
the question not asked directly was how many other studies have been subjected to independent scrutiny? Certainly not by the IPCC that only conducts a literature review. To my knowledge no other climate studies have been audited as McIntyre and McKitrick did for the hockey stick. Yet this scrutiny is probably less than might be done in evaluation of major engineering projects, ore body reserves or any significant business venture. The hockey stick was one climate study among many subjected to auditing in depth and found wanting.

Continue reading

How a Hockey Stick led to a Committee Hearing

According to the Wall Street Journal editorial, a semiretired Toronto minerals consultant and an economist, with about $5,000 of his own money and time, took on an apparently simple task — trying to double-check the influential graphic known as the “hockey stick” — and eventually confronted an influential scientific community before a Congressional Committee, and won. What was the sequence of events?

1990 — MWP — Based on numerous anecdotal studies, the Intergovernmental Panel on Climate Change (IPCC) in 1990 included a schematic view of the past 1000 years there was a period of elevated temperatures known as the Medieval Warm Period, which was followed by the Little Ice Age, and then a new period of global warming.


Alternative options for temperature history. IPCC 1990 Figure 7.1.c (red), MBH 1999 40 year average used in IPCC TAR 2001 (blue), and Moberg et al 2005 low frequency signal (black) from Wikipedia.

1998 — MBH98 — Mann, Bradley and Hughes published a quantitative study using a new climate field methodology, showing temperatures as a hockey stick shape, and eliminated the Medieval Warm Period, flattening the fluctuations in global temperatures over most of the past millennium (the blade of the hockey stick) until we get to the 20th century, where the rate of global warming takes off in a sharp upward surge (the handle of the hockey stick).

Continue reading

Avian Influenza No Spatial Change

The WHO Epidemic and Pandemic Alert and Response reports no change in the spatial distribution of human cases of H5N1 avian influenza, although cases occurring in Indonesia is consistent with linear projections. Below are most recent cases from affected countries.

11 April 2006 — The case is a 17-year-old girl who developed symptoms on 11 March. She was seriously ill with bilateral pneumonia but has since fully recovered and been discharged from hospital.

6 April 2006 — The Ministry of Health in Cambodia has confirmed the country’s sixth case of human infection with the H5N1 avian influenza virus. The case occurred in a 12-year-old boy from the south-eastern province of Prey Veng, which borders Viet Nam.

16 June 2006 — The Ministry of Health in China has confirmed the country’s 19th case of human infection with the H5N1 avian influenza virus. The patient is a 31-year-old man employed as a truck driver in Shenzhen City, Guangdong Province, near the border with Hong Kong.

12 May 2006 — The Ministry of Health in Djibouti has confirmed the country’s first case of human infection with the H5N1 avian influenza virus. The patient is a 2-year-old girl from a small rural village in Arta district. She developed symptoms on 23 April. She is presently in a stable condition with persistent symptoms.

5 May 2006 — The Ministry of Health in Egypt has announced the country’s 5th death from H5N1 avian influenza. The death occurred in a previously reported case, a 27-year-old woman from Cairo. She was hospitalized on 1 May and died on 4 May.

14 July 2006 — The Ministry of Health in Indonesia has confirmed the country’s 53 rd case of human infection with the H5N1 avian influenza virus. The case, which was fatal, occurred in a 3-year-old girl from a suburb of Jakarta.

1 March 2006 — A WHO collaborating laboratory in the United Kingdom has verified H5N1 avian influenza as the cause of death in a 39-year-old Iraqi man, previously announced by the Ministry of Health.

9 December 2005 — The Ministry of Public Health in Thailand has confirmed a further case of human infection with the H5N1 avian influenza virus. The case occurred in a 5-year-old boy, who developed symptoms on 25 November, was hospitalized on 5 December, and died on 7 December. The child resided in the central province of Nakhonnayok.

30 January 2006 — A WHO collaborating laboratory in the United Kingdom has now confirmed 12 of the 21 cases of H5N1 avian influenza previously announced by the Turkish Ministry of Health. All four fatalities are among the 12 confirmed cases.

Viet Nam
25 November 2005 — The Ministry of Health in Viet Nam has confirmed a further case of human infection with H5N1 avian influenza. The case is a 15-year-old boy from Hai Phong Province. He developed symptoms on 14 November and was hospitalized on 16 November. He has been discharged from hospital and is recovering.

Continue reading

When r2 Regression is Very High

The coefficient of determination, or r2, is the ratio of explained variation to total variation of two variables X and Y. The coefficient ranges from −1 to 1, where a value of 1 is an exact, positive linear relationship, with all data points lying on the same line. A value of 0 shows no linear relationship between the variables. If r2 is high, most people assume X and Y are related. Possibly, if the errors are “normal” or independent and identically distributed. But r2 may also be high and X not related to Y when natural series have “spurious significance”.

It is hard to do better than this set of articles on high correlation statistics from Steve McIntyre at ClimateAudit to explain “spurious significance”.

Continue reading

Simple Linear Regression of Rainfall

From forecasting the onset of the Monsoon in India, to mapping the drought in the USA, and worsening
drought in Australia
, insight from statistics into rainfall patterns affects everyone. The following documents contain technical information on regression models and rainfall, organized from the simplest linear regression to the more technical local spatial and temporal fitting techniques.

Statistics 301 Handout #30: Simple Linear Regression contains R code with simple regression exercises.

Partial Correlation Coefficients by Gerard E. Dallal, Ph.D. provides some climatic related examples of scatterplots and partial correlation coefficients.

More advanced approaches to reconstruction of rainfall fields use a form of local regression. Here a partial thin plate smoothing spline is used for SPATIAL MODELLING OF CLIMATIC VARIABLES ON A CONTINENTAL SCALE

Here is another good example where local fitting techniques have been used for Estimation of Precipitation by Kriging in the EOF Space of theSea Level Pressure Field.

Some of the most advanced and insighful work on rainfall modeling is by Koutsoyiannis, e.g. An entropic-stochastic representation of rainfall intermittency: The origin of clustering and persistence, Water Resources Research, 42(1), W01401, 2006.

Here is the technical documentation of software SPLINA and SPLINB.

The additive regression model appears to be a practical option for analysing spatially varying effects of several predictors on observed phenomena. It is attractive from the point of view of overcoming curse of dimension problems associated with the analysis of noisy multivariate data. Moreover its implementation is a straightforward extension of standard thin plate spline

Continue reading