GARP is an acronym for Genetic Algorithm for Rule Set Production. GARP is an algorithm primarily designed for predicting the potential distribution of biological entities from raster based environmental and biological data. This post describes examples of the interpretation of different sets of rules developed by GARP.
Abundance of Greater Glider
The Greater Glider (Petauroides volans) is a species of gliding possum found extensively in old-growth forest regions of South Eastern Australia. It nests in hollows created by the broken limbs of eucalyptus trees, and feeds on eucalyptus leaves of a variety of species. The species is of interest for conservation because their presence is an indicator of the presence of a suite of arboreal marsupial species.
The Waratah Creek data set is a mapping of an area 1600 ha in extent, in a 20×20 grid, located at Waratah Ck. It contains eight data layers. The first is the density of Greater Gliders at four levels, while the remaining variables are based on forest inventory variables known to be relevant to possum density. The data set and comparison of the performance of a number of other Artificial Intelligence methods is described in Stockwell et.al. (1990). The variables are shown, in row-column order in Figure 1.
The data set is a useful small, test data set for comparing predictive algorithms, and is included in the distribution of the GARP program. It is particularly useful for testing predictive algorithms because there are complex combinations of ecological relationships within it.
The relationships of the greater glider to habitat have been the subject of a number of studies. High glider density is due to a combination of factors including high nutrient vegetation, presence of trees of sufficient age and type to have nesting hollows (Braithwaite 1984, Kavanagh 1987, Davey 1989). The models of response are multidimensional, and non-linear. There are also nonlinear components, such as the dark rectangular area on the left, areas of forestry development with zero glider density, which are otherwise located on high quality habitat.
The next tables and figures contains the results from one run of GARP on this data set. The predicted density in Figure 2a, the result of applying all the rules generated by GARP together, and can be compared with the actual density in Figure 1a. The accuracy of the model at predicting each level of abundance is shown in Table 1. The level zero is predicted with an accuracy of one, while level 3 or high density is predicted with a low accuracy of 0.3. The overall accuracy however is 0.625.
Table 1. Example of final set of rules, ordered by usage.
... TRANSLATE - model into natural language No-rule number, Type, Prior-prior, Post-accuracy, Sig-Significance, Cov-coverage, Use-usage No Type Prior Post Sig Cov Use 20 m 0.31 0.55 6.61 0.43 0.335 IF Dev=[ 1, 2]c AND StC=[ 0, 1]c AND SdC=[ 3, 5]c AND StQ=[ 1, 3]c AND FlN=[ 2, 3]c AND Slp=[ 1, 1]c AND Ero=[ 3, 3]c THEN ExM= 3 13 r 0.22 0.67 10.79 0.25 0.225 IF - Dev*0.10 c - StC*0.10 c + SdC*0.09 c + StQ*0.06 c - FlN*0.19 c + Slp*0.40 c THEN ExM= 1 12 d 0.28 0.91 11.42 0.17 0.160 IF Dev=[ 1, 2]c AND StC=[ 0, 1]c AND SdC=[ 2, 5]c AND StQ=[ 1, 3]c AND FlN=[ 1, 3]c AND Slp=[ 2, 2]c THEN ExM= 1 0 d 0.19 1.00 18.06 0.19 0.060 IF Dev=[ 1, 2]c AND SdC=[ 0, 0]c AND StQ=[ 0, 3]c THEN ExM= 0 19 a 0.25 0.81 6.73 0.07 0.043 IF Dev= 0 AND StC= 1 AND SdC= 2 AND StQ= 1 AND FlN= 1 AND Slp= 3 AND Ero= 1 THEN ExM= 1 18 a 0.22 1.00 7.35 0.04 0.035 IF SdC= 1 AND StQ= 1 AND FlN= 0 THEN ExM= 3 25 a 0.22 0.80 5.38 0.04 0.032 IF Dev= 0 AND StC= 0 AND SdC= 2 AND Slp= 3 THEN ExM= 2 ...
Table 2. The accuracy of the set of rules at predicting each of the levels of Greater Glider abundance, and the overall accuracy of the model.
BV Value Prop. Acc. Error(S.D.) 0 1 0.25 1.00 0.00 1 85 0.25 0.73 0.04 2 169 0.25 0.57 0.05 3 254 0.25 0.30 0.04 NoMasked 0.000 Unpredicted 0.000 Predicted 1.000 Accuracy 0.625 s.d. 0.024 Overall Acc. 0.625 s.d. 0.024
The combined model is not the end of the story however, as in the list of rules in the model, each rule can be viewed in isolation. This provides the potential for a variety of ecological interpretations. The result of using each of the remaining seven rules singly is shown in Figure 2b-h. These rules have contributed to the final map, as each rule was applied when its expected accuracy was greater than any other rule.
The multiple rules generated by GARP can provide alternative examples of niche modesl for the given data. Choosing between these alternatives is one way to allow an expert to input their specialized expertise and biological insight into the problem.
In the example above, the first rule, rule 20, is an envelope rule. The area it predicts is shown in Figure 2b. It says that if the conjunction of certain ranges of all variables occurs in a data point, then predict high quality habitat with a probability of 0.55. This is the most frequently used rule in prediction, with a usage of 0.33, or one third of the test data set.
Another rule in the set predicting high density is the atomic rule 18. It predicts high-density areas (Figure 5g) using single categorical values of the variables: Stand condition (low), site quality (low) and floristic nutrients (high). This rule with coverage of 0.2 predicts less data points that rule 20 (Figure 5b).
Thus, there are two types of explanation of high quality habitat represented by the two rules. The first covers most high quality areas with low accuracy (0.55). The second represents a particular type of high quality habitat, with a high probability (1.0).
The reason for these two explanations is as follows. The precondition of rule 20 requires conjunction of areas with low to high stand condition and site quality, but medium to high floristic nutrients, with low slope, conditions satisfied by many cells in the study area. The need for higher levels of floristic nutrients and low slope are similar general conditions for high Greater Glider density discovered via multivariate statistical analysis (Braithwaite 1984). The data points satisfying the precondition of rule 18 are located in swampy areas (with species of trees Eucalyptus ovata or Swamp Gum). The trees are unsuitable for logging (low stand condition), have low growth rates (low site quality) but high floristic nutrients due to the species present.
These particular conditions of the Waratah Creek region have been described in separate observational studies (Kavanagh 1987) and are important glider habitat contributed to in the area. The second most-used rule predicts areas of low quality habitat (Figure 2c) and is a logistic rule. That is, there is a linear combination of variables that best predicts low quality habitat. By examining the coefficients of this equation, we can see the highest coefficient (Slp 0.40) is positive. Thus, the rule expresses the relationship that the probability of low quality habitat decreases with slope and floristic nutrients â€” correlates identified in a seperate study (Davey 1989).
Another rule with the high accuracy is rule 0, predicting an abundance of zero with accuracy one. These areas (Figure 2e) are situated on area cleared for forestry, yet exhibit properties of high quality habitat such as low slope, high floristic nutrients and site quality. These types of exceptional conditions are not handled well by curve fitting, yet are well described by a specific rule to cover the condition. Thus, the distribution map is built up of general and specific models, applying the most accurate rule available in each situation. The system ‘constructs’ a model from simple components according to their applicability and accuracy, from individual significant and accurate models of the relationship of the species to the environment.
The example above used data on abundance levels of a species. For many species, only data on presence are available. These are collectively referred to as ‘museum data’ because museums contain a lot of data of this sort, taken from the geographic location where the specimen was found written on labels of their specimens. In using the system with museum data, the training set is developed with a 50/50 sample of the locations (presences) and a random sample of other locations (background).
The prediction is developed using the rules and the environmental data to generate predictions of all values at each grid cell in the environmental data layers. The output of the utility has the value 254 when present and 0 when absent. The value 255 is a masked area, outside the area of interest.
We use the Cerulean Warbler (Dendronica cerulea) as an example. Cerulean Warblers nest in extensive tracts of mature deciduous forests in mid-story canopy trees, making an open-cup nest (Curson et.al. 1994). The species is a long distance migrant, wintering in South America. It feeds primarily on insects. It is a rare species, and its conservation status is regarded as vulnerable, potentially endangered. The records for the Cerulean Warbler were taken from the Breeding Bird Survey (BBS) for biological records (Sauer et.al. 1997). The environmental variables are composed of a number of climatic surfaces for the US, prepared by the Oregon Climate Service. These surfaces were developed through interpolation of weather station data using the PRISM method (Daly et.al. 1997). The variables include annual precipitation, mean maximum temperature, mean minimum temperature, mean temperature, mean range of temperature, cooling degrees, heating degrees, growing days, and snowfall.
The prediction is in general agreement with the distribution or data points (Figure 3). There are differences however. Firstly, there are predictions of marginal habitat in the east without sightings of the species. Secondly, areas of the state of Wisconsin are not predicted, but contain data points.
In general, over-prediction is common in models using only climatic data. While the model indicates areas that are climatically suitable, the species does not occur there for a range of reasons. One is because the species is no longer present, due to changes in habitat in that area. Another reason is that the areas are geographically separated and passage between the areas is rare for the species. Conditions may however have been different in the past, the entire area being occupied by the one species. In this case, geographic separation and evolutionary change combine to produce similar climatic areas now occupied by relatives of a species. The area of predicted habitat in the west is not occupied with the Cerulean warbler but could be occupied by a similar species with same habitat requirements. Prediction of potential distributions is of value, for reintroduction of species, searching for new occurrences of the species.
Figure 4 shows the prediction for an alternative modeling method, using envelope rules. This method, in comparison with the evolutionary genetic algorithm, produces the model through explicit analysis of the data, by fitting an envelope around the values of the independent variables in the data points. It is expressed in the GARP system using two rules, one predicting presence and one predicting absence, with a region of overlap between the two. This results in the prediction of high, low and marginal habitat as seen in Figure 4. This illustrates the generality of using a system of rules for a model, as other models can be represented as single rules or combinations of a small number of rules.
Table 3. Listing of model using only envelope rules. Two rules are used; the first predicts presence, the second absence. The area of overlap between them indicates marginal habitat.
TRANSLATE - model into natural language No-rule number, Type, Prior-prior, Post-accuracy, Sig-Significance, Cov-coverage, Use-usage No Type Prior Post Sig Cov Use 0 m 0.50 0.76 20.65 0.63 0.574 IF usppt=[77317,152666]mm AND ustmax=[50,77]deg AND ustmin=[29,51]deg AND ustavg=[40,64]deg AND ustrange=[35,60]deg AND uscdd=[134,2073]deg AND ushdd=[2863,9390]deg THEN Taxon=PRESENT 1 ! 0.50 0.89 25.99 0.44 0.426 IF NOT usppt=[77317,147085]mm AND ustmax=[52,74]deg AND ustmin=[29,49]deg AND ustavg=[41,61]deg AND ustrange=[39,58]deg AND uscdd=[178,1872]deg AND ushdd=[2988,9079]deg THEN Taxon=BACKGROUND
The envelope rules are potentially used in a full GARP model containing many more rules. Some differences in the results of the two approaches can be identified. The envelope method on the other hand develops a predicted region around the recorded points within the range of climate. It includes the areas between Wisconsin and Missouri where the bird has not been recorded. GARP on the other hand does not include the Wisconsin points. This is because the GARP penalizes rules that include climatic regions where points do not occur, attempting to optimize an overall measure of accuracy that predicts both presence and background accurately.
Another way to express this is that the envelope rule minimizes omissions (presences predicted to be background or outside the predicted area) while GARP minimizes the sum of omissions and commissions (background points predicted to be present). GARP will therefore leave out presences that have sufficiently different climates to make them hard to include without decreasing the accuracy.
The capacity to exclude outliers or separate populations can be useful if the omissions as a group come from a different population to the rest of the points in the data set. This could be due to mis-classification or confusion with a related bird, such as the Blackburnian Warbler Dendroica fusca. In their account for D. fusca, Dunn & Garret (1997) suggest many features are similar to Cerulean Warbler. These include postures, songs, and the preference for mature forest habitats both on the breeding grounds; some aspects of immature plumages are quite similar as well. The omissions could also be due to correctly identified birds captured in transit or occupying marginal habitat not ideal for the success of breeding populations. This is probably not the explanation in this case as BBS workers are generally very qualified, and censuses are carried out late enough for migrants to be absent.
Because the predictions above are produced from the predicted value of the most accurate rule at each point, the distribution presents a sharp edge to the range. The GARP probability surface method averages the result of the best rules (Figure 5).
Genetic Algorithms are a method for developing robust models for species distribution studies. The use of multiple models contributes to this robustness, as it can deal with a range of possible relationships in the data. Some modifications to the classical GA were necessary however, in order to ensure efficient robust performance. The were the use of the rule-archive for maintaining a set of distinct solutions, and the initialization of rules before input to the GA to provide good starting points for the algorithm to optimize.
While GARP is one of many methods used in the field, it does seem to give reliable results for a wide range of variables and species. It seems that the major advantage is the robustness of the system, as curve-fitting methods methods can work well in particular situations, but are sensitive to data characteristics such as cross-correlations. A further advantage of the method is that it can handle a large number of irrelevant and correlated variables. This reduces the effort required by the modeler in producing models for predicting distributions, as it is not necessary to identify a subset of variables for input to the statistical method.
Testing of the method is still going on. The main problem with the genetic algorithm is that as a stochastic method it gives slightly different answers every time. This is a difficulty when a user wants to get the best-predicted distributions. It means that users must average the results of many modeling runs. The variability can be reduced in a number of ways: by running the algorithms for longer over more generations until it becomes more stable. The variation cannot be eliminated, as is the case with deterministic methods based on curve fitting.
On the other hand, the variation provides a source of variability that enables statistical tests to be applied. This is used in exploratory and comparative studies where the variability and potential to suggest multiple solution is a desirable feature. Other studies that use large number of predictions such as the patterns of diversity also because the variations in individuals species distributions are less influential in the overall results.
DesktopGARP and OpenModeller are two implementation of GARP.
Stockwell, D.R.B. Davey S.M. Davis J.R. and I.R. Noble, 1990. Using inductions of decision trees to predict greater glider density, AI Applications in Natural Resource Management, 4:4, 33-43.
Braithwaite, L.W. 1984 On identifying important habitat characteristics and planning conservation strategy for arboreal marsupials within the Eden Woodpulp Concession area. Pages 501-508 in: Possums and Gliders, A.P. Smith and I.P. Hume, editors Australian Mammal Society and Surrey Beatty and Sons, Sydney.
Kavenaugh R.P. 1987. Floristic and phenological characteristics of a eucalyptus forest in relation to its use by arboreal marsupials. M.Sc Thesis, Department of Forestry, Australian National University, Canberra.
Davey, S.M. 1989. The environmental relationships of arboreal marsupials in a eucalypt forest: a basis for Australian forest wildlife management. Ph.D. thesis, Department of Forestry, Australian National University, Canberra.
Curson J., Quinn D., and Beadle D., 1994. Warblers of The Americas. Houghton Mifflin.
Sauer, J. R., J. E. Hines, G. Gough, I. Thomas, and B. G. Peterjohn. 1997. The North American Breeding Bird Survey Results and Analysis. Version 96.3. Patuxent Wildlife Research Center, Laurel, MD
Daly, C., G. Taylor, and W. Gibson, 1997, The PRISM Approach to Mapping Precipitation and Temperature, 10th Conf. on Applied Climatology, Reno, NV, Amer. Meteor. Soc., 10-12.
Dunn, J. and Garrett, K. 1997. A Field Guide to the Warblers of North America. Houghton Mifflin: New York.