In Praise of Numeracy

Mathematical shapes can affect our lives and the decisions we make.

The
hockey stick graph
describing the earths average temperature over the last millennia has been the subject of a controversial debate over reliability of methods of statistical analysis.

hockey stick.jpg
From this to this …
Long_tail.PNG

The long tail is another new icon, described in a new book, developed in the Blogosphere, by Chris Anderson called “The Long Tail”:

Forget squeezing millions from a few megahits at the top of the charts. The future of entertainment is in the millions of niche markets at the shallow end of the bit stream. Chris Anderson explains all in a book called “The Long Tail”. Follow his continuing coverage of the subject on The Long Tail blog.

As explained in Wikipedia:

The long tail is the colloquial name for a long-known feature of statistical distributions (Zipf, Power laws, Pareto distributions and/or general Lévy distributions ). The feature is also known as “heavy tails”, “power-law tails” or “Pareto tails”. Such distributions resemble the accompanying graph.

In these distributions a low frequency or low-amplitude population that gradually “tails off” follows a high frequency or high-amplitude population. In many cases the infrequent or low-amplitude events—the long tail, represented here by the yellow portion of the graph—can cumulatively outnumber or outweigh the initial portion of the graph, such that in aggregate they comprise the majority.

Continue reading

Novel methods continue to improve prediction of species' distributions.

The recently published paper by Jane Elith and Catherine Graham et.al.”Novel methods improve prediction of species’ distributions from occurrence data” (EG06) is sure to be a landmark study in the field. EG06 compares 16 modeling methods using 226 well-surveyed species in 6 regions of the world. Measures of statistical skill on held back data show a spread from a wide range of methods including: the older methods such as BIOCLIM, DOMAIN, through GARP, GLM and GAM to the newer arrivals from machine learning MAXENT, BRT and community based method GDM, prompting the conclusion “novel methods improve prediction”. The work of a great many people is appreciated, as these results will no doubt be very helpful to many biodiversity modellers in the future.

Why novel methods work

EG06 attributes the success of the newer methods to representing complexity of the relationships of species to their environment.

One feature that they all share in common is a high level of flexibility in fitting complex responses

The same thing was found in the early 80’s when novel machine learning methods — particularly neural nets, decision trees (CART), and genetic algorithms (GARP) — were first used for species distribution modeling. When BIOCLIM and GLM were the only species distribution methods, these early experiments showed heuristic approaches from machine learning would benefit the field.

The opposing view still widely held is that approaches should be based strictly on ecological theory, such as using BIOCLIM to represent the Hutchinsonian niche. This view is valid too, if your primary aim is to evaluate the theory and not necessarily maximize accuracy. This is a familiar theme in ML — that real world performance requires heuristic complexity at the expense of theoretical elegance. Happily the species prediction problem has come to the attention of the leading edge researchers in present day ML, and both theory and practice will benefit from the interplay.

Historic progression of models

In addition to the complexity dimension, the range of statistical skill across methods also represents an historic progression. GARP in the early 80’s used genetic algorithms to combine the major methods of the time, BIOCLIM, GLM and surrogate into a multi-model rule-set. The strategy of using multiple models for prediction is also used in the highest performing method BST in EG06 (Boosted Regression Trees). Under ideal conditions the ensemble approach in GARP would be expected to be better than the worst of the methods it uses (BIOCLIM), but no better than the best of the methods (GLM), and this is shown in EG06. It is likely that other approaches with high performance such as boosted regression trees (BRT) have evolved experiences with from earlier regression tree algorithms such as classification and regression trees (CART).

Unanswered issues

The major unanswered issue in species distribution modeling is environmental data selection. In EG06, selection of environmental data reflects the typical practices:

The environmental data used for each region were determined according to their relevance to the species being modeled (Austin 2002) as determined by the data provider (Tables 1 and 3).

EG06 does not address the problem of environmental data selection. No more than about 13 environmental data sets were used in each region. No statistics were provided for the power of these datasets. This in no way undermines their conclusions. However, people want to develop the best models possible. It has been shown in “Improving ecological niche models by data mining large environmental datasets for surrogate models” (S06) that monthly climate variables may be much more effective than annual climate averages, suggesting the variables typically used are not the best possible variables.

Just as GARP arose out of concern with arbitrariness in use of functional forms and generalized over them, WhyWhere arises out of concern with arbitrariness in the selection of environmental data. This problem has only become apparent, as the number of data sets available has burgeoned. There is also the large dataset problem of modeling species distribution in the Marine environment where the depth parameters multiples the number of possible variables enormously (e.g. nutrient levels at each depth).

Where WhyWhere fits in

One of the findings in “Effects of sample size on accuracy of species distribution models” — (SP04) a major comparison of 1060 species in Mexico using logistic regression (LR), GARP and simple surrogate model (SS) — was that the old SS method performed surprisingly well. This was interesting, as a very simple approach can be the basis for a very high performance algorithm, enabling the analysis large data sets. Rather than use a more theoretically precise approach to clustering the environment, such as kmeans which I have found to be inefficient and unreliable, a quicker more reliable heuristic method for classifying colors in images was used to develop a practical approach to data mining in the order of 1000’s of datasets, called WhyWhere.

It was subsequently shown in S06 that a relatively simple, low dimensional SS model searching a very large set of data could outperform a more complex model using a small set of general environment variables. This is because some specific variables, particularly monthly climate variables, correlate well with most species, but these vary from species to species (see Surprising finding #3 for recent results). The opposing view is that variable selection should follow ecological theory. That a small set of climate variables represent ecological determinants adequately. This is rhetorically similar to the arguments for ecologically-based models. Valid, if and only if you want to sacrifice maximum accuracy — real world performance at the expense of theoretical elegance.

There is also a larger agenda. Just as GARP and other algorithms demonstrated the value of machine learning approaches in the 80’s, WhyWhere is promoting biodiversity modeling by demonstrating the value of data mining approaches. This strategy enables the best minds in computer science to engage productively with the biodiversity field, and promotes biodiversity modeling, still a minor player compared with climate and population modeling.

Role of WhyWhere

WhyWhere can be used in a ‘pre-modeling’ stage. Points can be run through the server here just to see which variables give greatest accuracy. After objectively determining the best variables from the currently 528 terrestrial variables available, you can include them in other approaches if required.

The advantages of using WhyWhere in a pre-modelling stage are:

  1. Greater objectivity in environmental variable selection
  2. Applicability to environments with a large number of variables (e.g. marine and to the depth dimension)
  3. Generality to applications other than species distribution (e.g. house prices and climate).
  4. Potentially more accurate models using optimal datasets
  5. No need for each person to develop a new set of variables

Research questions:

1. On average it appears the Worldclim monthly climate average datasets are most frequently the highest performers (see Surprising finding #3 for recent results). While BIOCLIM may have performed relatively poorly in EG06, these preliminary results suggest the datasets largely associated with them may be very powerful. Will the combination of monthly climate and other variables with the novel methods increase skill?

2. Can the novel methods be run on 1000 environmental variables, many of which are categorical with 100s of categories each? For example regarding BRT:

Therefore, it is not prudent to analyze categorical dependent variables (class variables) with more than, approximately, 100 or so classes.

The approach of using a low dimension models with a piece-wise fit seems necessary for achieving reliable high performance on large numbers of environmental data.

3. How should we best address the problem of burgeoning numbers of environmental correlates? This is the next nettle that must be grasped to continue to move the field forward.

4. Are the data and algorithms in EG06 freely available for benchmarking other models?

Conclusions

EG06 also provides an objective and comprehensive evaluation of statistical skill of a wide range of methods at predicting the distribution of well-surveyed species from presence-only data using a small number of generic data sets, identifying the best methods for future studies. The study did not address the important role of environmental data sets selection. Results and logic suggest that a wider range of environmental data than are currently used, such as the monthly climate averages, will improve accuracy even more. People should start using them.

References

EG06 – Jane Elith, Catherine H. Graham, Robert P. Anderson, Miroslav Dudík, Simon Ferrier, Antoine Guisan, Robert J. Hijmans, Falk Huettmann, John R. Leathwick, Anthony Lehmann, Jin Li, Lucia G. Lohmann, Bette A. Loiselle, Glenn Manion, Craig Moritz, Miguel Nakamura, Yoshinori Nakazawa, Jacob McC. M. Overton, A. Townsend Peterson, Steven J. Phillips, Karen Richardson, Ricardo Scachetti-Pereira, Robert E. Schapire, Jorge Soberón, Stephen Williams, Mary S. Wisz and Niklaus E. Zimmermann, 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29: 129-.

S06 – Stockwell D.R.B. 2006. Improving ecological niche models by data mining large environmental datasets for surrogate models Ecological Modelling 192: 188–196.

SP04 – Stockwell DRB, Peterson AT, 2002. Effects of sample size on accuracy of species distribution models Ecological Modelling 148 (1): 1-13.

Worldclim – Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978.

How to start a science blog (scary version)

Having run across two recent notes on science blogs by academics, and written a post about the benefits of blogs to scientists here, I felt compelled to issue a warning for readers, that while there are positives and negatives to blogs — they just don’t get it.

One called Environmental Science Adrift in the Blogosphere (summary free) worries that statements concerning the range of estimates of the number of species going extinct diverge from the ‘scientific consensus’ and urges scientists to get involved in order to fix them. Despite appearing in Science magazine, the paltry evidence used to support their claim was a Google web search. Claims to scientific consensus on daily extinctions was supported by a single reference to an obscure Journal of Paleontology paper.

The other, Bloggers need not apply, related their experiences with screening of job applications where perusal of their weblog revealed personal information that killed their employment chances. Their problem with blogs seemed to be that without the controls of peer review people would post embarrassing rants.

Academics be afraid

The vague anxiety about blogs displayed in these articles is misplaced. Rather — they should be afraid, be very afraid. Why?

Consider the example of Dr Hwang, the formerly prominent stem cell researcher. Credit for his takedown is given to Korean bloggers here and here.

The nation’s young scientists made the allegation at the Web site of the state-backed Biological Research Information Center (bric.postech.ac.kr), which played a pivotal role in pinpointing manipulations at Hwang’s 2005 paper on patient-specific stem cells.

The lack of data archiving and due diligence in Nature and Science that may have contributed to this fraud has been highlighted again and again here. ClimateAudit is dedicated to documenting the ongoing quasi-litigation of journals and authors in the dendroclimatology field to make public their data and methods. What? Scientists don’t reveal their data and methods! Difficulties in exactly replicating the famous hockey stick theory of recent temperatures, may have been instrumental in the formation of a National Academy of Science panel Surface Temperature Reconstructions for the Past 2,000 Years: Synthesis of Current Understanding and Challenges for the Future.

I admit I have a dog in this fight, having tapped out a short article on simulations using random numbers that produce temperature histories remarkably similar to most reconstructions (here and here). These results show that concerns that climate histories may be affected by various forms of undocumented ‘cherry-picking’ such as inter-site selection, are justified. To claim that temperature proxies deliberately selected for an upturning pattern in the 20th century provide evidence that such warmth is anomalous, as was done here is an example of the logical fallacy known as circular reasoning.

Self defence

Not only do blogs enable an aggressive falsification program, they enable you to defend yourself against stones thrown from ivory towers. On May 11, 2005, on the day that Ross McKitrick and Steve McIntyre were presenting their results debunking the hockey stick in Washington, UCAR issued a press release announcing that one of its scientists, Caspar Ammann and one of its former post-doc fellows, Eugene Wahl, had supposedly demonstrated that their criticisms of the hockey stick were “unfounded”. S&M have used the blog medium masterfully to reveal that a crucial unfavorable r2 verification statistic was withheld from the Nature publications, thus proving the UCAR accusations, not only unwarranted, but totally unfounded. Claims that scientists have been ‘harrassed’ about archiving their data have been shown false by posting all relevant correspondence on the web.

We have been given fair warning — the scrutiny that can be focused on a field of science by the blog community is so great that scientists should ensure their own house is in order. The academic authors of the two articles I mentioned at the beginning of this post should read “The World Is Flat” — Thomas L. Friedman’s account of the great changes taking place in our time, as lightning-swift advances in technology and communications down a whole range of barriers and tyrannies. Academics locked in ivory towers should be anxious of the bulldozer that blogs are driving through their midst. But they seem largely unaware of the leveling taking place, or want to stand in front of them.

After blogs (AB)

How do you protect yourself from the scrutiny of the blogosphere? Vigorously communicate your views, and adopt with professional standards of data archiving, reporting and openness. In a world where a freelance web journalist can walk around with at cell phone camera instantly relaying stories back to a web site with thousands of hits an hour, perhaps a new breed of freelance scientist will emerge, replicating experiments, falsifying theories and reporting the results in real time. Or perhaps Open Science, where data are posted and contributors analyse it for free. Why change? The second wave of the Internet now called [tag]Web 2.0 promises to have even more profound effects on our societal structures than the first — and academe will not be immune.

Streaming environmental data management

The new [tag]WhyWhere[/tag] application is starting to work smoothly on [tag]large datasets[/tag] now (http://landscape.sdsc.edu/ww-testform.html). I have added a list of all terrestrial data (All_Terrestrial), though there will be some errors until I clean that lot up, it should be usable. I have been thinking about how to deal with the numbers of data sets in a [tag]streaming data[/tag] framework.

The issue with specifying environmental datasets to use in an analysis comes down to accommodating series of options:
1. User prepares custom datasets: three sub options:
1.1 uploads them onto the server for others to use too,
1.2 upload to use for a fee if not public
1.3 installs the program themselves and use

2. Use a growing collection of global data sets. Sub options for organization.
2.1 separate lists for high resolution and low resolution
2.2 separate lists for marine, terrestrial and freshwater (though some variables overlap)
2.3 user generates customized lists

3. Issues to do with the kind of questions:
3.1 what is the best predictor(s) for this species in all data sets
3.2 what is the specific relationship to selected datasets (e.g. veg)
3.3 what variable other than a given variable (say climate) is important.

4. Various modalities
4.1 distribution modeling
4.2 invasive species
4.3 climate change
4.4 ensembles of species

Rather than allow development to go in all directions I have the idea (vision) of a new type of data streaming web, where customized information including model predictions is just flowing continuously down stream, rather than chunked into stages (data preparation and variable selection, modeling, writing, publication, etc). So it would make sense to envision custom filters applied to the environmental data. One approach would be to have a text selection capability such as regular expressions. Another would be to have a custom editable list in the users scratch directory, rather than have a multitude of options available.

These filters could be applied on each of the channels of the image. For example, the red channel could be used for temperature and rainfall and the green channel used for another category of variable, such transportation. This would ensure variables from the required class are incorporated. A channel could be set to a specific variable: e.g. red is annual temperature, ensuring this variable is used. This forcing method was a feature of the previous version, but the filtering approach would make it a specific behavior in a more general system.