The problem with many models, from climate systems to multiple species and ecosystems processes, to consumer purchasing behaviour is that we often have very little understanding of the actual relationships between the variables in the system.
From our limited vantage point as observers of and not experimenters on systems we only see many weakly correlated variables, often drawn from incomplete samples and widely ranging sources.
We need an automated method of developing structure from the given data that explicitly quantifies our belief that a model that captures the behaviour of the system. Bayesian nets, Beliefs nets or graphical models begin to do this, by assigning a level of belief to each of the possible values of parameters. That is, while a conventional simulation of climate say has at most one value at each simulation, a Bayesean network would represent the distribution of possible values for each parameter at each point in time.
Belief net construction can involve a manual process of knowledge engineering. Examples of systems for graphically structuring models are the Ptolemy project for modeling, simulation, and design of concurrent, real-time, embedded systems, or the freely available ‘scientific workflow’ tool called Kepler where the flow of data from one analytical step to another is captured in a formal workflow language.
Recent advances in machine learning and data mining have also yielded efficient methods for creating belief nets directly from data (Cooper and Herskovits, 1992).
In graphical presentation, Bayes Nets are directed acyclic graphs, of probabilistic relationships between the values of variables. In directed graphs the nodes are linked with directional arrows (e.g. A->B->C or A->B< -C).
Acyclic graphs do not have cycles, or feedback loops. In practice, Bayes Nets have both a graph structure, and algorithms for updating probabilities as information is propogated throughout the graph. Cycles in graphs would allow infinite feedbacks, and oscillations that would prevent stable parameters. The Bayes Net is really composed of two models, one expresses the graph structure, and the other is the probability value of the entries in the matrices, or parameters.
In the Bayes Net, each node contains parameters that make up a probability distribution. Nodes at the edge of the net are marginal probabilities (e.g. P(A), P(B)). A conditional probability function fills other nodes (e.g. P(B|A) or P(C|B)). The Bayes Net, together with rules for inferencing, allows all probabilities including the joint probabilities of these three variables (P(A,B,C)), to be calculated, given any state of the system.
The major benefit of Bayes Nets for real world applications is the capacity to represent the probability distributions of large numbers of variables using low dimensional matrices. The medical diagnostic system for lymph node pathology called Pathfinder, contains many hundreds of nodes (Heckerman 1988). These distributions correspond to common inference tasks, such as prediction, abduction (explaining away), and diagnosis, and form the basis for rational decision procedures. Thus the network can be used for a variety of purposes other than exploration and understanding of the structure of variables in a data set.
In a causal Bayesian network the directed arcs of the graph are interpreted as representing causal relations. An important configuration of variables in a causal net is a chain (e.g. A->B->C). Because of conditional independence relations, the variable C can be determined from values in B alone. The variable B is proximal to C and is called a â€˜screeningâ€™ variable, because the value of this variable screens the value of the distal variable A, making their values irrelevant.
This situation illustrates the notion of distal and proximal determinants in ecology. For example, vegetation type would be regarded as more proximal to abundance of an herbivorous animal than climate. Vegetation type screens climate because knowledge of the vegetation type is generally sufficient to predict animal abundance, with out reference to climatic regimes, because vegetation types are determined by climate. The distal variable is often called a driving variable, one that determines values of the system without itself being determined by any other variable in the system. Application of these concepts provides understanding when Bayesian nets are used to model ecological systems.
Given the many ways the Bayes Net can be used, for prediction and optimization for example, we would expect it has a role in integration of environmental data for the environmental information, e.g., remote sensing, biodiversity, GIS, for a variety of purposes, such as monitoring, and conservation planning (Stockwell et. al. 1999). Additional research on identifying optimal structures would make the method even more useful, while clustering algorithms incorporated into method may help in analysis of continuous data.
A prototype implementation of the K2 algorithm for model induction based on Cooper and Herskovits, 1992 is available here.
Cooper, G. and Herskovits, E. 1992. A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning Journal, 9, 309-347.
Heckerman D.E. 1988, An empirical comparison of three inference methods, in Shachter, R., Levitt, T., Kanal, L., and Lemmer, J., editors, Uncertainty in Artificial Intelligence 4, 283–302. North-Holland, New York, 1990.
Stockwell D.R.B, Arzberger P., Fountain T., and J. Helly. 1999. An Interface between Computing, Ecology and Biodiversity: Environmental Informatics. Korean J. of Ecology, 23(2): 101-106, 2000.