Streaming environmental data management

The new [tag]WhyWhere[/tag] application is starting to work smoothly on [tag]large datasets[/tag] now (http://landscape.sdsc.edu/ww-testform.html). I have added a list of all terrestrial data (All_Terrestrial), though there will be some errors until I clean that lot up, it should be usable. I have been thinking about how to deal with the numbers of data sets in a [tag]streaming data[/tag] framework.

The issue with specifying environmental datasets to use in an analysis comes down to accommodating series of options:
1. User prepares custom datasets: three sub options:
1.1 uploads them onto the server for others to use too,
1.2 upload to use for a fee if not public
1.3 installs the program themselves and use

2. Use a growing collection of global data sets. Sub options for organization.
2.1 separate lists for high resolution and low resolution
2.2 separate lists for marine, terrestrial and freshwater (though some variables overlap)
2.3 user generates customized lists

3. Issues to do with the kind of questions:
3.1 what is the best predictor(s) for this species in all data sets
3.2 what is the specific relationship to selected datasets (e.g. veg)
3.3 what variable other than a given variable (say climate) is important.

4. Various modalities
4.1 distribution modeling
4.2 invasive species
4.3 climate change
4.4 ensembles of species

Rather than allow development to go in all directions I have the idea (vision) of a new type of data streaming web, where customized information including model predictions is just flowing continuously down stream, rather than chunked into stages (data preparation and variable selection, modeling, writing, publication, etc). So it would make sense to envision custom filters applied to the environmental data. One approach would be to have a text selection capability such as regular expressions. Another would be to have a custom editable list in the users scratch directory, rather than have a multitude of options available.

These filters could be applied on each of the channels of the image. For example, the red channel could be used for temperature and rainfall and the green channel used for another category of variable, such transportation. This would ensure variables from the required class are incorporated. A channel could be set to a specific variable: e.g. red is annual temperature, ensuring this variable is used. This forcing method was a feature of the previous version, but the filtering approach would make it a specific behavior in a more general system.

Advertisements

0 thoughts on “Streaming environmental data management

  1. David,

    After reading your comments, I tried modeling my data again, this time in three ways.

    First, I tried to model the entire data set, ignoring species. I received a nice output from this and the best accuracy was from the Legates & Willmott Annual Temperature (0.1C) variable.

    Second, I tried to model the first species (74 points, mostly from North Florida), which you suggested might cover too fine an area for this data set. The model was unable to produce images of the map and histogram from these data, but it did give me a best accuracy from the Legates & Willmott Annual Measured Precipitation (mm/year) variable.

    Third, I tried to model the second species (28 points, mostly from South Florida and covering a smaller geographic area than the 74 points from the first species). This model created a nice output, including a map and histogram, with a best accuracy variable of Legates & Willmott Annual Temperature (0.1C).

    Given that there were 74 points for the first species and these points were more geographically separated than the 28 points for the second species, and the second species modeled fine, I am not sure that the geographical spread of the points for the first species is what is preventing the model from running properly. I hope I’m making sense, but if not, let me know and I’ll try to clarify further.

    As for creating a high res data set, I think that’s an excellent idea. Of course I realize that WhyWhere is a huge project and you probably already have plenty of work to keep you busy for a long while, however I believe a high res data set would be extremely valuable and could be a strong selling point for using the model. I mentioned earlier that I am taking a graduate class in the Ecology and Evolution department here at FSU about Species Distribution Modelling. Four students in the class, myself included, are each working on a project to use data we have in some of the different species distr. models available. In talking with the other students, it seems that most have point data similar to my own, covering a region of a state or maybe just one state and in one case just covering a particular state forest. In most cases, the students are interested in answering questions at these higher resolutions and less interested in continental/global questions. I get the impression (of course I am no expert!) that these models would be even more exciting to ecologists if they could be used to address questions at these higher resolutions for which they have point data. I realize that it is possible to create custom data sets in some models to do this, however we have discovered in most cases that creating these data sets is mechanistically very challenging. Many of my colleagues (faculty and grad students) simply have never worked with (or have access to!) programs such as ArcGIS. For them to use these models effectively to answer the high res questions they have requires that they first acquire and learn GIS, then try to figure out where to get these data layers they need, and then try to figure out how to create the data sets from these various layers. While again, all of this is possible to do, it can be overwhelming and is very time consuming for a new user. I think if you included a high res data set from WhyWhere it would make this modelling program even more valuable and might start to reach an audience of researchers that don’t have the knowledge or simply enough time to figure out how to create high res data sets of their own.

    I’ll get off my soapbox now! Thanks for entertaining my ideas and observations. And best of luck to you as you work to get this model working smoothly! Already I am excited about it as it seems much more straight forward than other models out there! I’m really looking forward to seeing it develop further and using it in my research. I work on invasive plant species and models such as WhyWhere have the definite potential of helping researchers tease apart the major factors influencing their spread! I am no computer programmer however and although I try to pick up as much as I can as I work, it’s people like yourself who are out there designing these models that I rely on and look to for guidance as I try to model and predict species spread.

    Thanks for your time!

    Sarah ๐Ÿ™‚

  2. David,

    After reading your comments, I tried modeling my data again, this time in three ways.

    First, I tried to model the entire data set, ignoring species. I received a nice output from this and the best accuracy was from the Legates & Willmott Annual Temperature (0.1C) variable.

    Second, I tried to model the first species (74 points, mostly from North Florida), which you suggested might cover too fine an area for this data set. The model was unable to produce images of the map and histogram from these data, but it did give me a best accuracy from the Legates & Willmott Annual Measured Precipitation (mm/year) variable.

    Third, I tried to model the second species (28 points, mostly from South Florida and covering a smaller geographic area than the 74 points from the first species). This model created a nice output, including a map and histogram, with a best accuracy variable of Legates & Willmott Annual Temperature (0.1C).

    Given that there were 74 points for the first species and these points were more geographically separated than the 28 points for the second species, and the second species modeled fine, I am not sure that the geographical spread of the points for the first species is what is preventing the model from running properly. I hope I’m making sense, but if not, let me know and I’ll try to clarify further.

    As for creating a high res data set, I think that’s an excellent idea. Of course I realize that WhyWhere is a huge project and you probably already have plenty of work to keep you busy for a long while, however I believe a high res data set would be extremely valuable and could be a strong selling point for using the model. I mentioned earlier that I am taking a graduate class in the Ecology and Evolution department here at FSU about Species Distribution Modelling. Four students in the class, myself included, are each working on a project to use data we have in some of the different species distr. models available. In talking with the other students, it seems that most have point data similar to my own, covering a region of a state or maybe just one state and in one case just covering a particular state forest. In most cases, the students are interested in answering questions at these higher resolutions and less interested in continental/global questions. I get the impression (of course I am no expert!) that these models would be even more exciting to ecologists if they could be used to address questions at these higher resolutions for which they have point data. I realize that it is possible to create custom data sets in some models to do this, however we have discovered in most cases that creating these data sets is mechanistically very challenging. Many of my colleagues (faculty and grad students) simply have never worked with (or have access to!) programs such as ArcGIS. For them to use these models effectively to answer the high res questions they have requires that they first acquire and learn GIS, then try to figure out where to get these data layers they need, and then try to figure out how to create the data sets from these various layers. While again, all of this is possible to do, it can be overwhelming and is very time consuming for a new user. I think if you included a high res data set from WhyWhere it would make this modelling program even more valuable and might start to reach an audience of researchers that don’t have the knowledge or simply enough time to figure out how to create high res data sets of their own.

    I’ll get off my soapbox now! Thanks for entertaining my ideas and observations. And best of luck to you as you work to get this model working smoothly! Already I am excited about it as it seems much more straight forward than other models out there! I’m really looking forward to seeing it develop further and using it in my research. I work on invasive plant species and models such as WhyWhere have the definite potential of helping researchers tease apart the major factors influencing their spread! I am no computer programmer however and although I try to pick up as much as I can as I work, it’s people like yourself who are out there designing these models that I rely on and look to for guidance as I try to model and predict species spread.

    Thanks for your time!

    Sarah ๐Ÿ™‚

  3. Thanks Sarah, for bearing with me on the development stage. Although it is a work in progress, there are some surprising results coming out. When I ran your entire data set of points, it picked out the % innundation variable as the best predictor from your data in Florida out of over 500 data sets.

    I fixed a problem with layer registration, and now it is only the very fine scale Worldclim datasets that are misregistering in the large data set. A the moment it is only a 1D search, but useful none-the-less for identifying interesting correlative variables. ArcGrid output many other options to go.

    I don’t see any attempts other than WhyWhere to tackle approaches to using the huge set of possible correlates that are becoming available in the context of species distribution modelling. To me, studies restricting analysis to primarily temperature and rainfall datasets should justify their exclusion of everything else. But this is not possible without testing them first, and to prepare all datasets for every study is hugely inefficient. This is an attempt to address the problem. A reasonably objective approach to me would be to run WhyWhere first to see what pops out, even if you then go on to focus a more detailed model on just a few variables.

    I think for the moment if you want to look at fine scales, run on the larger dataset, which also includes the coarse variables. If it turns out the coarse variables are better predictors, then that is good to know. If you really want to use a particular variable, I will make it possible to ‘force’ the inclusion of a given variable in one of the channels, and then conduct the search in the other channel.

    Perhaps in the near future I will make it possible to specify which data sets with a custom list or a regular expression.

  4. Thanks Sarah, for bearing with me on the development stage. Although it is a work in progress, there are some surprising results coming out. When I ran your entire data set of points, it picked out the % innundation variable as the best predictor from your data in Florida out of over 500 data sets.

    I fixed a problem with layer registration, and now it is only the very fine scale Worldclim datasets that are misregistering in the large data set. A the moment it is only a 1D search, but useful none-the-less for identifying interesting correlative variables. ArcGrid output many other options to go.

    I don’t see any attempts other than WhyWhere to tackle approaches to using the huge set of possible correlates that are becoming available in the context of species distribution modelling. To me, studies restricting analysis to primarily temperature and rainfall datasets should justify their exclusion of everything else. But this is not possible without testing them first, and to prepare all datasets for every study is hugely inefficient. This is an attempt to address the problem. A reasonably objective approach to me would be to run WhyWhere first to see what pops out, even if you then go on to focus a more detailed model on just a few variables.

    I think for the moment if you want to look at fine scales, run on the larger dataset, which also includes the coarse variables. If it turns out the coarse variables are better predictors, then that is good to know. If you really want to use a particular variable, I will make it possible to ‘force’ the inclusion of a given variable in one of the channels, and then conduct the search in the other channel.

    Perhaps in the near future I will make it possible to specify which data sets with a custom list or a regular expression.

  5. Pingback: sylwester miejski

  6. Pingback: rehabilitacja wroclaw

  7. Pingback: witryna www

  8. Pingback: zobacz

  9. Pingback: strona www

  10. Pingback: wideofilmowanie Lublin

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s