by David Stockwell and Bing Zhu for SRB Workshop, February 2-3, 2006, San Diego, due Dec 15th
Here we describe the use of the Storage Resource Broker (SRB) to support new data intensive approaches to Environmental Niche Modeling (ENM) by providing access to cropped images from a remote SRB data store of almost 1000 global coverage data sets. The basic architecture of the system is illustrated on the figure below.
Figure 1. Illustration of the components and operation of the SRB WhyWhere data archive for ecological niche modeling. A large set of images and meta data are stored in a central archive. The client directs the server to crop an image in the archive using a server-side proxy operation. The cropped image is copied to the local directory and scaled by the client to the resolution required for the prediction algorithm. Illustrated is a prediction of a North American bird, the Cerulean Warbler.
ENM is a generic name for a range of geospatial modeling methods that given inputs of the locations of a species as a list of latitudes and longitudes, returns a model of the probability of the species occurring w.r.t. environmental variables (the ecological niche), and then uses this model to project a distribution onto the landscape. Whereas in the past, ENMs were developed with small numbers of primarily climatic variables such as annual average temperature and rainfall, this approach ignores a large number of potential correlates including monthly temperatures and rainfall, functions of these variables such as standard deviations, those related to water availability and evapotransporation, soil and vegetation habitat conditions, and topography. When this form of model ling is extended into the marine environment, each variable potentially exists in 3 dimensions. Combine this with remote sensing data, the existence of alternative versions of variables, different scales, and temporal factors such as time and duration and the number of variables that you might want to examine for potential correlates expands rapidly.
The collection is called WhyWhere and can be viewed with inQ v3.3.1 (http://www.sdsc.edu/srb/software.html) using the settings: Name: testuser Host: orion.sdsc.edu Domain: seek Port: 7613 Authorization: ENCRYPT1 Password: TESTUSER. By downloading and using the WhyWhere algorithm (Stockwell, in press, [arxiv:q-bio/0511046]), one can make full use of the archive for niche modelling (http://biodi.sdsc.edu/ww_home.html).
The aim of this SRB archive was to provide a source of environmental correlates for ecological niche modeling (ENM) from a massive archive of data. The second element of the architecture is an algorithm for efficiently mining for correlates, described elsewhere (Stockwell, in press, [arxiv:q-bio/0511046]).
In this section we describe the needs and solutions that drove the environmental data component supplying the spatial mining algorithms. In particular, we describe the implementation of a server-side ‘pgm_cut’ operation. By putting the â€˜pgm_cutâ€™ in the SRB server, we move the workload of running â€˜pgm_cutâ€™ algorithm into server side and thus avoid the overhead of downloading the whole image into client. This remote partial file transfer capability is accomplished by treating â€˜pgm_cutâ€™ as a proxy operation within SRB server.
Other benefits of using SRB as an archive place include location encapsulation for archived files and metadata, meaning that a â€˜pgm_cutâ€™ client doesnâ€™t need to know where the actual image and data are stored. As a distributed data management system, SRB provides virtually unlimited storage space for geospatial data/images. The constraints were as follows and are described in turn, but the fourth required the extensions to the SRB that we describe in more detail: (1) single format for at least 1000 data variables, (2) all data sets of global extent but variable scale, (3) all data described with meta data, and (4) each variable supplied cropped and scaled to a specific size and resolution.
The format used for storing the variables was a pgm image. This is a simple gray scale image with one byte per pixel and a simple text header giving the format, the extent and the number of shades of gray in the image. For example below is the first few lines of a pgm binary file with dimension 3600×1800 and 255 colors.
While many variables such as categorical variables describing vegetation, landscape or soil types contained fewer than 256 values, continuous value variables were simply normalized between 0 and 256 according to their maximum and minimum values. While this trade off for efficiency resulted in loss of information, it is not as problematic as it might seem given the algorithm is only looking for statistical associations. The WhyWhere algorithm used in mining the data set categorizes all variables into less than this number of categories anyway, and the reduced size provides both storage and computational efficiencies.
Scale and Extent
It was also decided to restrict the database to variables with the same global extent and geographic projection to make extracting information consistent. Given all variables are global, all can potentially be used in an an analysis. The scale varies between 1 degree per grid cell (i.e. 360×180 pixel image) and 1km per grid cell, approximately 1GByte total size.
The meta data format used is called the ‘fits’ format as used in the Global Ecosystem Database (GED), one of the main sources of data. An example of this simple list of attribute/value pairs is shown below. While more complex meta data would be useful, this format was enough for our purposes. The meta data documents images, provides the dimensions to allow extraction by the cropping algorithm so that parts of the image file can be accessed by the SRB.
file title : Legates & Willmott Annual Temperature (0.1C)
data type : integer
file type : binary
columns : 720
rows : 360
ref. system : lat/long
ref. units : deg
unit dist. : 1.0000000
min. X : -180.0000000
max. X : 180.0000000
min. Y : -90.0000000
max. Y : 90.0000000
pos’n error : unknown
resolution : 0.5000000
min. value : -569
max. value : 299
value units : 0.1 degrees Celsius
value error : unknown
flag value : none
flag def’n : none
legend cats : 0
Two main operations are needed, exemplified in the local image processing library netpbm as pgm_cut and pgm_scale. As an example of ‘pgm_cut’ operation, we want to get the partial image labeled by ‘B’. As an example of ‘pgm_scale’ operation, we want to change the x and y extent to a given size, either increased (B to a) or reduced the size (a to B).
Figure 2. Array of pixels for illustrating cropping and scaling operations. In a crop the array extent ‘a’ is cropped to ‘B’. In scaling the entire array ‘a’ is reduced in size to ‘B’ or the array ‘B’ is increased in size to ‘a’.
We wanted to develop a generic approach to accessing the geographic data sets, paying attention to increased usage in the future. There were a number of options to developing the right approach to balancing the client/server load. Based on our experience dealing with remote file transfer software, currently no such software can handle the requirement of a ‘pgm_cut’ from SRB in a distributed environment.
- One approach could be to download whole file into a local machine and operate on the image locally. No one would consider this is a good solution in terms of performance as many files are greater than 1GByte and we just need partial data.
- Although grid-ftp provides partial file transfer, it cannot be done by one function call for the above case. So the client side (grid-ftp client) has to repeatedly calculate offsets and then make grid-ftp calls with new offsets and numbers of bytes for data transfer. In this case, we create a fat customized client that then assembles the resulting lines of the data to create the image. We tried this with the SRB using the SRB read library function. However initial experiments were very slow, presumably due to latency of the Internet connections that are made and broken for each line of the image.
- Finally we extended SRB functionality by developing a cropping function on the server side, and kept the scaling function on the client side. The server side reads the header of the image file, retrieves and assembles a series of lines from the file corresponding to the area needed then passes it to the calling client. We found it to be reasonable performance and is the ‘generic’ solution currently used in the WhyWhere application.
The SRB connection is carried out as a proxy operation on the server through the use of a program in the client called rpgm. The WhyWhere application iterates along a list of image names to retrieve as selected by the user. For each image it retrieves a version cropped according to the latitude and longitude required. To do this, the client side calculates the pixel coordinates needed based on local metadata and makes calls of the form:
$ Spcommad “Spgmcut 0 0 20 20 /home/whywhere.seek/ei/Data/Terrestrial/a00sd1.9.pgm” > temp.pgm
$ pnm_scale -xsize $xwall -ysize $ywall temp.pgm >$RDIR$DIRt_$i.pgm”;
The above Spcommand sends the only command line argument to SRB server. The command line argument has the proxy program name, Spgmcut, parameters and file name. The result is sent to ‘stdout’ which, in above example, is directed to a local file, temp.pgm. The SRB S-commands in ‘landscape.sdsc.edu’ can be found in the following directory of the WhyWhere distribution – WW/cgi-bin/UTIL. While initial tests indicated adequate performance, once a number of people started using the service the performance became variable according to the size of the files (the biggest are 1GByte each) or the load on the server. Currently a call to Remote_All_Data is an overnight task.
The data sets were collected from various free sources on the web in a variety of formats and processed into pgm images. The following were major sources:
- GED: The Global Ecosystems Database (GED) project began in 1990 as an Interagency project between the National Geophysical Data Center (NGDC) of the U.S. National Oceanic and Atmospheric Administration (NOAA), and the U.S. Environmental Protection Agency’s (EPA) Environmental Research Laboratory in Corvallis, Oregon (ERL-C). In particular the following variables added, many consisting of multiple layers (e.g. monthly mean and standard deviations of temperatures).
Global Geographic (longitude/latitude) Database (GLGEO)
A01: NGDC Monthly Generalized Global Vegetation Index from NESDIS NOAA-9 Weekly GVI Data (APR 1985 – DEC 1988).
A05: Olson World Ecosystems.
A06: Leemans Holdridge Life Zone Classifications.
A07: Matthews Vegetation, Land Use, and Seasonal Albedo.
A10: Wilson and Henderson Sellers Global Land Cover and Soils Data for GCMs.
B01: Fedorova, Volkova, and Varlyguin World Vegetation Cover
B02: Bazilevich Global Primary Productivity
B03: Bailey Eco regions of the Continents (reprojected)
- WORLDCLIM is a set of global climate layers (grids) on a square kilometer grid supported by NatureServe.
The bioclimatic variables represent physiologically relevant annual trends (e.g., mean annual temperature, annual precipitation) seasonality (e.g., annual range in temperature and precipitation) and extreme or limiting environmental factors (e.g., temperature of the coldest and warmest month, and precipitation of the wet and dry quarters). A quarter is a period of three months (1/4 of the year). They are coded as follows:
BIO1 = Annual Mean Temperature
BIO2 = Mean Diurnal Range (Mean of monthly (max temp – min temp))
BIO3 = Isothermality (P2/P7) (* 100)
BIO4 = Temperature Seasonality (standard deviation *100)
BIO5 = Max Temperature of Warmest Month
BIO6 = Min Temperature of Coldest Month
BIO7 = Temperature Annual Range (P5-P6)
BIO8 = Mean Temperature of Wettest Quarter
BIO9 = Mean Temperature of Driest Quarter
BIO10 = Mean Temperature of Warmest Quarter
BIO11 = Mean Temperature of Coldest Quarter
BIO12 = Annual Precipitation
BIO13 = Precipitation of Wettest Month
BIO14 = Precipitation of Driest Month
BIO15 = Precipitation Seasonality (Coefficient of Variation)
BIO16 = Precipitation of Wettest Quarter
BIO17 = Precipitation of Driest Quarter
BIO18 = Precipitation of Warmest Quarter
BIO19 = Precipitation of Coldest Quarter
- World Ocean Atlas 2001 (WOA01) Data for Ocean Data View
The objectively analyzed global ocean historical hydrographic data from the U.S. NODC World Ocean Atlas 2001. Data are on a 1×1 degree horizontal grid and at the following standard depths (in m): 0, 10, 20, 30, 50, 75, 100, 125, 150, 200, 250, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1750, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500. Data are available for the following variables:
Oxygen Saturation [%]
- Continuous Fields 1 Km Tree Cover was developed by DeFries, R. Hansen, M., Townshend, J.R.G., Janetos, A.C., Loveland, T.R. (2000) at University of Maryland is an alternative paradigm to describing land cover as discrete classes is to represent land cover as continuous fields of vegetation characteristics using a linear mixture model approach. This data set contains 1km cells estimating:
Percent tree cover
Percentage cover for two layers representing leaf longevity (evergreen and deciduous)
Percentage cover for two layers estimating leaf type (broadleaf and needleleaf)
Future needs are many if the archive is to transition into a production resource for the general research community. The following are some of the main challenges:
- It is envisaged that traffic could expand considerably, as each analysis requires download and processing of a large section of the collection by each researcher each time they start analysis, so local caching is only a small saving. Meeting this need is initially envisaged via replication of SRB archives and a means in the client of selecting the most efficient (e.g. proximate) archive for connection and download.
- If the popularity expands, attention will be paid to transitioning the archive to a community of developers for maintaining, updating, cleaning and improving the archive. Currently we are talking with the OpenModelling (http://sourceforge.net/projects/openmodeller/) project about this, and also maintaining a Weblog site to facilitate education and promotion about data intensive approaches (http://landscape.sdsc.edu/~davids/enm).
- Finally, improved methods of access are needed to allow more flexible selection of variables for analysis.
- Stockwell D.R.B. (in press) Improving ecological niche models by data mining large environmental datasets for surrogate models, Ecological Modelling, [arxiv:q-bio/0511046], (http://landscape.sdsc.edu/~davids/enm/?p=13).