A number of familiar tests, often used to evaluate the performance of models: R2 correlation, Nash-Sutcliffe efficiency and similarity of trends and return period, were reported here, noting not much evidence of skill in the DECR models compared with observations at any of these. I also said what a better treatment might entail but left that for another time:
The percentage of droughted area appears to be a â€™bounded extreme value, peaks over thresholdâ€™ or bounded POT statistic. The distribution resembles a Pareto (power) law, but due to the boundedness where predicted extent of drought approaches 100% becomes more like a beta distribution (shown for SW-WA on Fig 2). Recasting the drought modeling problem into known statistical methods might salvage some data from the DEC report. Aggregating the percentage area under drought to the whole of Australia might reduce the boundedness of the distribution, and might also improve the efficiency of the models.
Aware that the tests I applied were not the last word due to the idiosyncratic nature of the data, the conclusion in the summary was slightly nuanced: that as there was no demonstration of skill at modeling drought (in both the DECR and my tests), and as validation of models is necessary for credibility, there is no credible basis:
Therefore there is no credible basis for the claims of increasing frequency of Exceptional Circumstances declarations made in the report.
What is needed to provide credibility is demonstrated evidence of model skill.
Andrew Ash, of the CSIRO Climate Flagship sent a response on 12/18/08. This was to fulfill an obligation he made on 16 Sep 2008, to provide a formal response to your review of the Drought Exceptional Circumstances report (dated 3 Sep 2008), after my many requests to provide details of the validation to skill at modelling of droughts.
I must say I was very pleased to see there was no confidentially text at the end of the email. I feel so much more inclined to be friendly without it. I can understand confidentiality inside organizations, but to send out stock riders to outsiders is picayune. The sender should be prepared to stand by what they say in a public forum and not hide behind a legal disclaimer. Good on him for that.
The gist of the whole email is that he felt less compelled to respond due to an ongoing review I sent to the Australian Meteorological Magazine on 23 Sep 2008. As it was, I was still waiting for the review from the AMM on the 18th December when I received this response. Kevin Hennessy relayed some advice from Dr Bill Venables, a prominent CSIRO statistician, and the following didn’t add anything:
However, we have looked at the 3 Sep 2008 version of your review. The four climate model validation tests selected in your analysis are inappropriate and your conclusions are flawed.
* The trend test is invalidly applied because (i) there is a requirement that the trends are linear and (ii) the t-test assumes the residuals are normally distributed. We undertook a more appropriate statistical test. Across 13 models and seven regions, there are no significant differences (at the 5% level) between the observed and simulated trends in exceptionally low rainfall, except for four models in Queensland and one model in NW Australia.
Its well known that different tests can give different results, depending on the test. It is also true that some tests may be better or more reliable than others. Without more details of their more test its hard to say anything, except to say that lack of significance does not demonstrate skill. The variability of the climate model outputs could be so high that they allow ‘anything to be possible’, as often seems to be the case.
* The correlation and efficiency tests are based on an assumption that the climate models are being used to hindcast historical weather. This assumption is incorrect. As a result, the tests selected are inapplicable to the problem being addressed. This in turn leads to false conclusions.
This would be true if these were the only tests and if correlation and efficiency were dependent entirely on ‘short-term-fluctuations’. They are not as they will capture skill at modeling both short AND long term fluctuations. This is also why I placed more emphasis skill on modelling trends over climatically relevant time scales. He is also not specific about which conclusions. The conclusion of ‘no credible basis’ is not falsified by lack of evidence.
It should also be noted that the DECR also considered return periods (Tables 8 and 10) so any criticism of returns periods applies equally to the DECR.
* The return period test is based on your own definition of â€˜regional return periodâ€™, which is different from the definition used in the DEC report. Nevertheless, your analysis does highlight the importance of data being collected or produced at different resolutions and the effect this has on interpretations of the frequency of drought. The observed data have the shortest return period as they have the finest spatial resolution and the model based return regions have increasingly larger mean return periods, inversely related to the spatial resolution at which they are reported. We were well aware of this issue prior to the commencement of the study and spent a considerable amount of time designing an analysis that would be robust to take this effect into account.
I appreciate the explanation for the lack of skill at modelling return period, which measures drought frequency, as opposed to drought intensity measured by efficiency. Nevertheless, lack of demonstrated skill at modelling drought frequency stands.
Note they continue to be unresponsive to requests for evidence the climate models have skill at modelling droughts. Where we stand at the moment is, that irrespective of the reliability of my tests, there is still no evidence of skill to be seen, at the short term, long term, or at drought intensity or frequency and so my claim of “no credible basis for the claims of increasing frequency of Exceptional Circumstances declarations” still stands. The Climate Flagship has steadfastly abjured presenting validation evidence. While the concerns expressed have relevance to the quality of the tests (which are widely used, but problematic due to the strange data), they were not precise about the conclusions or claims they were trying to rebut.
I came across a recent drought study that also finds no statistical significance between models and drought observations. This was actually in Ref 27 of the DECR: Sheffield, J. &Wood, E. F. Projected changes in drought occurrence under future global warming from multi-model, multi-scenario, IPCC AR4 simulations. Climate Dynamics 13, 79-105 (2008).
Although the predicted future changes in drought occurrence are essentially monotonic increasing globally and in many regions, they are generally not statistically different from contemporary climate (as estimated from the 1961-1990 period of the 20C3M simulations) or natural variability (as estimated from the PICNTRL simulations) for multiple decades, in contrast to primary climate variables, such as global mean surface air temperature and precipitation.
Below is a plot of the observations for the drought statistic, area experiencing less than 5% exceptionally low rainfall (leading to an Exceptional Circumstances drought declaration). You can see how ‘peaky’ it is, even when the average is taken (black).
In some ways it might have been better to just knuckle down and develop a POT model right from the start, as it might have allowed me to produce a less nuanced response. I have been doing that, but have had to upgrade R. A recompilation is needed, that took all night and lost my graphic interface to R. Even then, the package VGAM doesn’t compile for some reason, so I have to look for other packages.