Validation of Climate Effect Models: Response to Brewer and Other

David R.B. Stockwell
February 4, 2009


A review by independent Accredited Statisticians, Brewer and Other [KB09], suggested that some claims in the report “Tests of Regional Climate Model Validity in the Drought Exceptional Circumstances Report” [DS08] were premature. Additional tests suggested by KB09 support the claim made in the original report of “no credible basis for the claims of increasing frequency of Exceptional Circumstances declarations”. The contributions of KB09 and DS08 to the evaluation of skill of climate model simulations with, arguably, weakly validated idiosyncratic statistics are discussed. These include recommendations for more rigor in evaluating the performance of climate effects simulations, such as those used in standardized forecasting practices [AG09].


As part of a review of the support to farmers and rural communities made under the Exceptional Circumstances (EC) arrangements and other drought programs, the Australian Federal Government Department of Agriculture, Fisheries and Forestry (DAFF), commissioned a study from the CSIRO Climate Adaptation Flagship and Australian Bureau of Meteorology (BoM) to examine the future of EC declarations under climate change scenarios. The DECR report examined yearly percentage area affected by exceptional temperature, rainfall, and soil moisture levels for each of seven Australian regions from 1900 to 2007 for both recorded observation and climate models (projecting, simulating or forecasting) historic drought, concluding:

DECR: Under the high scenario, EC declarations would likely be triggered about twice as often and over twice the area in all regions.

The interpretation of such statements by their client, the Department of Agriculture, Fisheries and Forestry (DAFF), is illustrated by a press release 6 July 2008 (DAFF08/084B) stating:

DAFF: Australia could experience drought twice as often and the events will be twice as severe within 20 to 30 years, according to a new Bureau of Meteorology and CSIRO report.

After the summary data used in the DECR report was made freely available on the BoM website, an assessment of the validity of the climate models was circulated [DS08] examining the skill of climate models, concluding there was “no basis for belief in the claim of increasing frequency of EC declarations”. At the initiation of Dr Ian Castles, independent accredited statisticians reviewed the DECR and DS08 [KB09]. This study provides some additional analysis in response to suggestions in KB09, and addresses other questions regarding DS08. The analysis is available in an R script [R09].

Why Validate?

A model or simulation like a global climate model (GCM) is a surrogate for an actual climate system. If the model does not provide a valid representation of the actual system, any conclusions derived from the model or simulation are likely to be erroneous and may result in poor decisions being made. Validation of models is expected practice throughout all society, similar to be business concept of ‘fitness for use’. Specifically:

Validation is the process of determining the degree to which a model or simulation is an accurate representation of the real world from the perspective of the intended uses of the model or simulation.

Validation consists of comparing simulation and system output data over one or more statistics with a formal statistical procedure. Examples of statistics that might be used include the mean, the trend, and correlation or confidence intervals. It is important to specify which statistics are most important, as some statistics may be more relevant than others. In the case of forecasting effects of CO2 on drought, the overall trend is regarded as more important than the patterns of correlation, because climate is a longer term phenomenon.

A model’s results have credibility if they satisfy additional factors: demonstration that the simulation has been validated and verified, and general understanding and agreement with the simulation’s assumption. If validity has not been demonstrated adequately, or the model ‘fails’ in key ways, then it is not ‘fit for use’. If it fails all tests, then it is accurately described as ‘useless’, and certainly cannot be regarded as credible.


KB09 agrees with DS08 on need for more effective validation of models of droughts at regional scales.

6. Dr Stockwell has argued that the GCMs should be subject to testing of their adequacy using historical or external data. We agree that this should be undertaken as a matter of course by all modellers. It is not clear from the DECR whether or not any such validation analyses have been undertaken by CSIRO/BoM. If they have, we urge CSIRO/BoM make the results available so that readers can make their own judgments as to the accuracy of the forecasts. If they have not, we urge them to undertake some.

7. If any such re-evaluation is to be carried out, however, it should be done using two separate time periods, namely 1900-1950 (during which the rainfall trend was generally upwards) and 1950-2007 (where it was generally downwards.) This would allow the earlier period to provide internal validations and the later period external validations. However, if and when these analyses are repeated, the raw data used should be compiled not for the existing seven Regions, but for more homogeneous Regions, as suggested in item 1 above.

Note that the thrust of DS08 was that models failed a range of validation tests, so there was no credible basis for the DECR claims. Additional analyses follow using the time periods suggested in KB09, and reporting the normality of distributions and residuals. These analyses use robust tests on mean values of each year over all regions and models in order to improve the normality of the distribution by filling in most of the zero (no drought) years. While aware there remain deficiencies in the approach: eg. the mean is over regions of unequal size, as we were not supplied with grid cells for making true means, it is argued the result is robust.


Fig 1. The area of each of the seven regions under exceptionally low rainfall (colors), and the mean (black).

Difference of means 1900-1950 vs. 1950-2007

Table 1 is similar to the mean comparison analysis in RBHS06. Here, for observations, the mean of droughted area in all 7 regions over the period, and for model projections, the means of all 7 regions calculated from the means of all 13 models area were compared over half-century periods. The mean areal extent of observed exceptionally low rainfall years decreased from 1900-1950 to 1951-2007, while the simulated area of exceptionally low rainfall years increased over the same period. The p values for a non-parametric Mann-Whitney test, used because the observations are not normally distributed, indicate the differences between the periods are highly significant.

Table 1: Mean percentage area of exceptionally low rainfall over time periods suggested by KB09. A Mann Whitney rank-sum test shows significant differences between periods.

1900-2007 1900-1967 1951-2007 P 1900-2007 vs. 1951-2007 P 1900-1950 vs. 1951-2007 Test
Observed % Area Drought 5.6±0.5 6.2±0.7 4.9±0.6 0.10 0.004 Mann-Whitney test
(wilcox.test(x,y) in R)
Modelled % Area Drought 5.5±0.1 4.8±0.2 6.2±0.2 0.006 <0.001 Mann-Whitney test
(wilcox.test(x,y) in R)

Trends in 1900-1950 vs. 1950-2007

Table 2 below shows two analyses related to trends on the entire data set, and the p-value from a Shapiro test for normality of residuals. A significant negative coefficient in LM Obs vs Exp 1900-2007 indicates an inverse relationship between observations and forecasts, while the significant p-value in the Shapiro test indicates residuals are not normally distributed. While the trend of the observations of drought area over the 1951-2007 period is not significantly different to zero in this test, the trend of the projections over the same period is positive and significant. The Shapiro tests are significant, indicating non-normality of residuals.

These results are consistent with those obtained by a different method in Table 1. The models forecast increasing drought areas, but the trend in observations of drought extent are mildly or significantly decreasing. Taking the mean of all models and regions, did not correct departure of residuals from normal due to the highly non-normal original data distribution. Normal residuals may only be obtained with a greatly improved statistical modelling approach, beyond the scope of this study.

Table 2. Linear regression test and residual normality of (1) all observed and forecast data and (2) trends of mean of observed and forecasts:

Linear model Shapiro test
LM Obs vs. Exp 1900-2007 Obs = -0.6*Exp + 8.9
r2=0.04 p=0.04
LM of Obs 1951-2007 Obs = -0.02*Exp + 6.3
R2 = -0.01 p=0.78
LM Forecast 1951-2007 Obs = 0.04*Exp + 2.7
R2 = 0.07 p=0.06

Moving 30 year Averages

Another approach to evaluating climatic trends was illustrated in the DECR and in Fig 1 from DS08. Fig 2 below shows the overall 30-year running mean of percentage area of exceptionally low rainfall for observations decreasing in almost all areas, and forecasts increasing in all areas. Further visual evidence of the significance of the difference between model projections and observations is shown by the lack of overlap of the spread of results at 1990.


Figure 2. Overall average (green thick line) of the 30-year running average of percentage area of exceptionally low rainfall for observations is decreasing, in almost all areas (red lines), while models (black lines) are increasing in all areas.

No doubt other statistics could be used to compare the difference of observed and modelled drought trends, with greater confidence if normality could be achieved. The most accurate conclusion then is that while it may be premature to say the models are entirely ‘useless’ at simulating drought, they have not been shown to be ‘fit for use’ of forecasting trends, and so are not credible.

Weather, Climate and Chaos

KB09 suggested there were misunderstandings of the DECR in the DS08 review (without being specific as the word is not used elsewhere in the report). Possibly KB08 refer to a distinction between ‘weather’ vs. ‘climate’ in the sense used by Andrew Ash below (pers. comm.).

AA: The correlation and efficiency tests are based on an assumption that the climate models are being used to hindcast historical weather. This assumption is incorrect.

Their argument is that the failure of validation at shorter time scales of weather does not block the fitness of the model at a 30 year time scale. I was because of this distinction, more emphasis was placed on the trends in DS08 (see Fig 2) as shown by the statement in the abstract of DS08:

DS: The most worrying failure was that simulations showed increases in droughted area over the last century in all regions, while the observed trends in drought decreased in five of the seven regions identified in the CSIRO/Bureau of Meteorology report.

Further, Fig 10 performs a crude validation by showing the variability of low rainfall lies within the range simulated by the multi-model ensemble (MME) of the models. The rationale is that because the observed temperature and rainfall are random instantiations of highly chaotic trajectories, the observations are comparable only with a specific model simulations. There are a number of problems with this view:

  1. The selection of the 13 climate models is ad hoc; and hence no assurance the MME properly samples the relevant state space. As a result, MMEs are sometimes referred to pejoratively as “ensembles of opportunity” [PD08].
  2. Even if the MME can be regarded as ‘skillful’ by virtue of containing the observations, this test does not demonstrate skill of models at forecasting trends. For that, one would need to demonstrate the models can match the trends in the observations.
  3. If validation only requires that the observations stay within the full range of all individual model simulations, where the models are of unknown accuracy, then skillful models are indistinguishable unskilful ones.
  4. As the correlation of trends in CO2 and temperature over the last 50 years is widely regarded as evidence of warming due to CO2, it is inconsistent to claim that a difference in the trend of warming and drought over the same time scale is inconsequential.
  5. Fig 2 is suggestive that the ranges predicted by the models, and the range of drought frequency may have in fact diverged significantly.

Some of these issues are highly technical but require closer evaluation to see what is actually being validated in an MME. If observations such as rainfall must only lie within the range simulated my the models, all that is being tested is the ability of the models to simulate the range (or variance) of the observations. Therefore, one cannot presume such models can also successfully simulate other features, such the mean value of the observations, the change in the mean value, the trend of the observations and so on.

It is crucial that if the intended ‘fitness for use’ of a model is to forecast trends, then validation must consist of demonstrated skill at modelling trends in historic data. This is the conventional view, and the view expressed in DS08, KB09, that evidence is necessary for supporting claims. One should also remark on the wisdom of the old saw, that extraordinary claims require extraordinary evidence. The DECR made an extraordinary claim about the change in the trend of the observations:

EC declarations would likely be triggered about twice as often and over twice the area in all regions

No evidence has been demonstrated of even ordinary evidence of skill at their intended use. Another way to say this, is that the validation consisting of the MME enclosing the range of observations is a very weak test, so weak that very little can be reliably inferred from it.

Tests of Individual Models

KB09 are concerned with the force of arguments in DS08, especially the use of the word ‘significant’ where residuals may not have been normal. In retrospect I would have qualified the word more. It is not clear the extent to which lack of normality of residuals undermines results, and departure from normality is quite common. Unfortunately, normality of residuals may be difficult to achieve with this type of data, without using much more sophisticated approaches.

KB09 outlined a preferred approach but performed no analysis:

9. A Possible Alternative to OLS Regression. It is at least possible that forecasting using simple ARIMA modelling [3], [4], might prove to be just as accurate and far easier to justify than OLS regression.

The DECR rainfall analysis uses a peculiar metric: the percentage of area with rainfall below the 5th percentile. In formal terms, this appears to be a ’bounded extreme value, peaks over threshold’ statistic. The distribution resembles a Pareto (power) law, but due to the boundedness where predicted extent of drought approaches 100%, the distribution becomes more like a beta functions.

KB09 believe a statistic such as average annual rainfall may preserve more information. In this case, the residuals of standard tests in DS08 might also improve. By way of explanation for the use of standard tests, there are ‘hard yards’ in developing a formal statistical model such as KB09 propose, on such an idiosyncratic statistic. The pragmatic approach of DS08 was to use a number of statistics breaking down the relevant elements into questions as follows:

  1. How much of the variation in the observations is explained by models? – R2 Correlation
  2. Does trend in drought severity and frequency match models? – slope of linear regression
  3. Do the models agree with the historic severity of droughts? – Nash-Sutcliffe coefficient.
  4. Do the models agree with the historic frequency of droughts? – return period.

Below are more specific notes on each of their concerns, largely framed in the form of if-then-maybe hypothetical:

If certain changes were made to the DECR analysis, then maybe the results might not have been so bad.

DS08 assessed the DECR as received, while KB09 is more constructive and speculative. This raises a larger question of how we go about assessing an idiosyncratic model, a topic expanded on in the conclusions.


Claims of significance of idiosyncratic data with non-normal distributions and autocorrelation do need to be treated with caution. KB09 take umbrage with a linear fit to the whole time period from 1900 to 2007, and argue that if a shorter time frame such as the period 1950-2007 were examined, then maybe results might improve. One might also argue this approach is arbitrary and informed by prior examination, or ‘data snooping’.
In defense of using the full period, CO2 and temperatures are both generally increasing from 1900 to 2007. Hence, any CO2 or temperature related correlations with drought should be discovered. Nevertheless, Fig 1 and Tables 1 and 2 shows identical results that the inverse relationship between trends in the models and observations exists at different time intervals.
Regarding the comment in KB09 that p values seem too low, while standard deviations (s.d.) were quoted in Table 1, the standard error of the mean (s.e.) was used to calculate the p values, as stated in the caption “Table 1: t-test of difference in mean of predicted trends to the observed mean of droughted area”. This follows the practise of DCPS07 and interested readers can refer to DD07 and associated links for discussion.


Consistently small values in the r2 columns of Tables 2-8, indicate lack of variance explained by the models. Here KB09 state:

If the implication is that the GCM-based projections do not reflect year to year changes in the drought affected percentages of the seven Regions, we do not regard this as a serious failure. It is not what the GCMs were constructed to do. They were meant to indicate long term trends.

This statement seems informed by a view prevalent in climate science that the r2 statistic only explains year-to-year variance and hence is invalid at climatic scales. However, this view is misleading. The r2 statistics will robustly quantify variation at a range of scales, including short and long term trends. R2 was used more in this capacity as a robust detector of possible skill. As we see only 1% of all variation (both short and long term) is explained by the models.


The average time between droughts in each region differed between observations and models. KB09 suggest that a “fallacy of composition” effect makes it more than possible that the two calculations of return period could be widely different for reasons other than lack of skill of models. This was confirmed by Andrew Ash:

AA: The observed data have the shortest return period as they have the finest spatial resolution and the model based return regions have increasingly larger mean return periods, inversely related to the spatial resolution at which they are reported.

Nevertheless, the labels on the data sets indicate the supplied data for models and observations represent comparable quantities at the same comparable, regional, scale: percentage area below the 5th percentile of exceptionally low rainfall. To compare on the same grid cell basis, we would have needed access to both observed and projected rainfall data within 25km grid cells, which we were not supplied with. Even so, this could be regarded as an “if then maybe” objection, if the analysis were conducted on the same scale then maybe skill would be shown in drought frequency. Then again, maybe not.


Consistently negative values of the Nash-Sutcliffe coefficient in Table 2-8 imply that “If averaged over time, each of the 13 GCMs’ sets of projections lies further away from the corresponding set of observed values than the simple mean of the observed values do.” KB09 suggest that if analysis were conducted for another period where the net change in rainfall was not constant, then perhaps the result would not be so bad.

The Nash-Sutcliffe coefficient is widely used in hydrological models for assessing the quality of model outputs against observations. As far as I know, it should not be affected by the start and end points of the series, rather performing a kind of sum of squares on the difference between observed and projected values at each point. Otherwise, my comments on choice of period of analysis also apply.


KB09’s main concerns may be summarized as: (1) some tests do not appear consistent with assumptions, and (2) DS08 did not eliminate all possible explanations for poor results, attributing poor results entirely to lack of model skill. New tests suggested by KB09 show the strong and significant departure of model projections from the observed pattern of historic droughts, with a strong bias in favor of increased and increasing drought in Australia with increasing levels of CO2. These additional analyses agree with the findings in DS08, demonstrating the robustness of the findings. Thus it appears the claim of no credible basis for increasing droughts, is not affected and actually vindicated by KB09’s report.
In DS08 the KB09 recommendation regarding improved statistical models and regionalization were alluded to in the discussion 2:

DS: Recasting the drought modelling problem into known statistical methods might salvage some data from the DEC report. Aggregating the percentage area under drought to the whole of Australia might reduce the boundedness of the distribution, and might also improve the efficiency of the models.

While drought biased climate simulations play well during a severe drought in the political power-bases of the country, the practice of uncritical acceptance of unvalidated or invalid must be strongly discouraged in evidence-based science policy. Finally, to quote Luboš Motl [LM08]:

And perhaps, most people will prefer to say “I don’t know” about questions that they can’t answer, instead of emitting “courageous” but random and rationally unsubstantiated guesses.

Further Work

One avenue for further work is the development of an ARIMA or other statistical framework for areas of exceptionally low rainfall as suggested by KB09.

Preliminary split sample analysis described at NM08 could be developed. These results suggest GCM models cannot be ‘selected’ on the basis of their historic fit to drought at regional scales. In most areas the models that do well in one 50 year period do poorly in another, and vice versa, further indicating ‘failure’ in external validation. The low value of GCM’s for regional effects forecasting are not fully discounted by their promoters.

Another avenue of inquiry is robust statistics for assessing the confidence in idiosyncratic models, where the developers performed no detailed statistical modelling. As such studies have no assumptions that conform to more standard approaches, most standard tests are going to be formally invalid. These concerns would argue for more agreed upon metrics of performance such as those proposed for forecasting [AG09].


[AG09] “Analysis of the U.S. Environmental Protection Agency’s Advanced Notice of Proposed Rulemaking for Greenhouse Gases”, Drs. J. Scott Armstrong and Kesten C. Green a statement prepared for US Senator Inhofe for an analysis of the US EPA’s proposed policies for greenhouse gases.

[CL05] Cohn, T. A., and H. F. Lins (2005), Nature’s style: Naturally trendy, Geophys. Res. Lett., 32(23), L23402, doi:10.1029/2005GL024476.

[DCPS07] David H. Douglass, John R. Christy, Benjamin D. Pearsona and S. Fred Singerc,det al. (2007). “A comparison of tropical temperature trends with model predictions” (PDF). International Journal of Climatology 9999 (9999): 1693. doi:10.1002/joc.1651. Retrieved on 12 May 2008.

[DD07] David Douglass’ Comments:

[DECR] Drought Exceptional Circumstances Report (2008), Hennessy K., R. Fawcett, D. Kirono, F. Mpelasoka, D. Jones, J. Batholsa, P. Whetton, M. Stafford Smith, M. Howden, C. Mitchell, and N. Plummer. 2008. An assessment of the impact of climate change on the nature and frequency of exceptional climatic events. Technical report, CSIRO and the Australian Bureau of Meteorology for the Australian Bureau of Rural Sciences, 33pp.

[DS08] David R.B. Stockwell, Tests of Regional Climate Model Validity in the Drought Exceptional Circumstances Report –

[KB09] K.R.W. Brewer and A.N. Other, (2009) Some comments on the Drought Exceptional Circumstances Report (DECR) and on Dr David Stockwell’s critique of it.

[KM07] Koutsoyiannis, D., and A. Montanari, Statistical analysis of hydroclimatic time series: Uncertainty and insights, Water Resources Research, 43 (5), W05429.1–9, 2007.

[LM08] Lubos Motl, The Reference Frame –

[NM08] Niche Modelling –

[PD08] T.N. Palmer, F.J. Doblas-Reyes, A. Weisheimer, G.J. Shutts, J. Berner, J.M. Murphy, Towards the Probabilistic Earth-System Model, arXiv:0812.1074v2 []

[R09] R script for analysis

[RBHS06] Rybski, D., A. Bunde, S. Havlin, and H. von Storch (2006), Long-term persistence in climate and the detection problem, Geophys. Res. Lett., 33, L06718, doi:10.1029/2005GL025591.


0 thoughts on “Validation of Climate Effect Models: Response to Brewer and Other

  1. Pingback: polecam

  2. Pingback: anti aging

  3. Pingback:

  4. Pingback: web link

  5. Pingback: polecam link

  6. Pingback: link do strony

  7. Pingback: zobacz tutaj

  8. Pingback: jak zalozyc sklep internetowy koszty

  9. Pingback:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s