Previously “A New Temperature Reconstruction” used random data with long term persistence (LTP) to illustrate the circular reasoning behind the ‘hockey stick’ reconstruction of past temperatures. This one shows the potential for false positives due to the statistics used in the ‘hockey stick’. The dynamic simulation below shows future temperatures predicted using a random fractional differencing algorithm that generates realistic LTP behavior. Future temperatures and validation statistics are calculated each time the page is reloaded. One unusual statistic used in MBH98 suggests the future can be predicted using random numbers.

*Note: This is a first version of the application and may contain errors and be improved considerably. The code is freely available under the GPL to order to promote open science. See The Reference Frame for more information.*

The code is written in php and available for download here.

Open application by itself in a window here.

To embed this simulation in a web page, copy and paste the following html:

But you are probably thinking — how can random numbers be a prediction? The series is generated by fractional differencing, a way of integrating random fluctuations across multiple scales. Let me explain…

The validation is based on the 11 points at the end of the temperature record not used in generating the simulated points. Two statistics were calculated and can be seen on the figure:

- The R2 correlation is ubiquitously used for quantifying the strength of association of two variables. A critical value of 0.1 would indicate a possible mild correlation, but values closer to one indicate significance.
- The RE reduction of error statistic is used in dendroclimatology and in the ‘hockey stick’ reconstruction of MBH98, where critical values greater than zero are claimed to indicate significance of the model. RE is claimed to be superior to the R2 statistic in WA06.

Hit reload a few times to get a feel for the average of the statistics. The R2 statistic is usually close to zero indicating the prediction has no statistical skill over the validation period. The RE statistic, however, is always greater than zero, and often greater than 0.5.

MBH98 uses an RE benchmark of zero to indicate significance. The random numbers here give RE statistics greater than the critical value of zero. Therefore, using the RE statistic with a critical value of zero would attribute statistical skill to random numbers. That is, under criteria used in MBH98, random numbers could be regarded as skillful predictors of future temperatures.

This example illustrates (if the code is correct) a situation, similar to MBH98, where the R2 statistic correctly indicates no statistical skill in the predictions, but the RE statistic erroneously indicates statistical skill.

Conclusions hinge on the choice of statistic and where you set the benchmark. MM05 obtain a critical value for RE of greater than 0.5 using random red-noise data in a replication of the procedure used in MBH98. Non-existent statistical skill of the models is one of the main arguments in MM05 against the reconstruction method in MBH98.

### References

WA06 — Wahl, Eugene R. and Caspar M. Ammann, 2006 (under review). Robustness of the Mann, Bradley, Hughes Reconstruction of Surface Temperatures: Examination of Criticisms Based on the Nature and Processing of Proxy Climate Evidence.

MBH98 — Mann, M.E., Bradley, R.S. and Hughes, M.K., 1998. Global-Scale Temperature Patterns and Climate Forcing Over the Past Six Centuries, Nature, 392, 779-787.

MM05 — McIntyre, S., and R. McKitrick, 2005. Hockey sticks, principal components, and spurious significance, Geophys. Res. Lett., 32, L03710, doi:10.1029/2004GL021750

Your img link lacks a final “/” before the last “>” in order to close the “img” tag.

DavidS: FixedYour img link lacks a final “/” before the last “>” in order to close the “img” tag.

DavidS: FixedGreat illustration of the interaction between r2 and RE, and how spurious relationships can be easily formed.

RE heavily weights the mean if the mean is larger than the variance of the validation period. You can get RE consistently near 1.0 by offseting the chart away from the mean in the validation period. If you do this in “nominal degrees C” or just use a different reference for the anomaly (e.g. the first 40 years of the CRU data) you get completely different values for RE (but the same for R2) even though you are essentially looking at the same reconstruction.

It goes to show just looking at the RE value is almost meaningless, and underlines the importance of determining a proper benchmark (as per M&M).

The other thing I would say is this is a far better approach for assessing validation statistics – the user can try lots of “real world” tests – rather than the pathological cases illustrated in Rutherford et al, and repeated in W&A’s criticism of M&M.

Great illustration of the interaction between r2 and RE, and how spurious relationships can be easily formed.

RE heavily weights the mean if the mean is larger than the variance of the validation period. You can get RE consistently near 1.0 by offseting the chart away from the mean in the validation period. If you do this in “nominal degrees C” or just use a different reference for the anomaly (e.g. the first 40 years of the CRU data) you get completely different values for RE (but the same for R2) even though you are essentially looking at the same reconstruction.

It goes to show just looking at the RE value is almost meaningless, and underlines the importance of determining a proper benchmark (as per M&M).

The other thing I would say is this is a far better approach for assessing validation statistics – the user can try lots of “real world” tests – rather than the pathological cases illustrated in Rutherford et al, and repeated in W&A’s criticism of M&M.

Thanks for the additional explanation why RE is problematic Spence. It is a good addition to the post. It goes to show that a test that consists only of a matching a mean value is not really a test at all. I want to illustrate in this (and previous) posts that the issues raised by M&M; regarding MBH98 are not ‘arcane statistical points’ as claimed at RealClimate, but are relatively simple issues.

Thanks for the additional explanation why RE is problematic Spence. It is a good addition to the post. It goes to show that a test that consists only of a matching a mean value is not really a test at all. I want to illustrate in this (and previous) posts that the issues raised by M&M regarding MBH98 are not ‘arcane statistical points’ as claimed at RealClimate, but are relatively simple issues.

Very nice script!

My somewhat complementary explanation is here:

http://motls.blogspot.com/2006/05/predict-your-climate.html

Very nice script!

My somewhat complementary explanation is here:

http://motls.blogspot.com/2006/05/predict-your-climate.html

Nice script! Could you also add MBH-style “uncertainty limits” to the graph, they are 2*the standard error, i.e., 2*sqrt($Dxy), in your code 🙂

Nice script! Could you also add MBH-style “uncertainty limits” to the graph, they are 2*the standard error, i.e., 2*sqrt($Dxy), in your code 🙂

Re 6. You can get the expected mean and range from the fractional differencing parameters. The mean is the last real temperature point, and the range is plus or minus 0.5C as the random part of the model is drawn from a uniform distribution over 1C range. You don’t need to run it many times and calculate the limits from the variation in the individual trajectories. It is interesting to see just how much random variation the future climate might have looking forward, if the past is any indication. It shows how easy it is to read too much into year to year fluctations. And if long term persistance really exists, then the longer trends could also be manifestations of the same random variation only on long time scales.

Re 6. You can get the expected mean and range from the fractional differencing parameters. The mean is the last real temperature point, and the range is plus or minus 0.5C as the random part of the model is drawn from a uniform distribution over 1C range. You don’t need to run it many times and calculate the limits from the variation in the individual trajectories. It is interesting to see just how much random variation the future climate might have looking forward, if the past is any indication. It shows how easy it is to read too much into year to year fluctations. And if long term persistance really exists, then the longer trends could also be manifestations of the same random variation only on long time scales.

Dave,

Is it possible to produce a script with the same sort of LTP that you produced with your original “reconstruction”?

So instead of a range of + or – 0.5 degrees it could go anywhere…

Dave,

Is it possible to produce a script with the same sort of LTP that you produced with your original “reconstruction”?

So instead of a range of + or – 0.5 degrees it could go anywhere…

Hi John, sure, it could be made to take parameters that lead to a ‘random walk’. A random walk has the unusual characteristic of an infinite expected mean — that is, it could really ‘go anywhere’. When I rewrite it I will turn the various pars of the code into functions, and build in more flexibility.

The main purpose of this script as some have guessed was to illustrate a case of random series with low r2’s and high re’s, demonstrating the flawed reasoning behind blind reliance on the re statistic.

Hi John, sure, it could be made to take parameters that lead to a ‘random walk’. A random walk has the unusual characteristic of an infinite expected mean — that is, it could really ‘go anywhere’. When I rewrite it I will turn the various pars of the code into functions, and build in more flexibility.

The main purpose of this script as some have guessed was to illustrate a case of random series with low r2’s and high re’s, demonstrating the flawed reasoning behind blind reliance on the re statistic.

Why was my previous comment deleted?

“What is the RE score over the ~130 point calibration period, the period for which the mean is calculated?

If the calibration period is shortened to 79 points and validation increased from 11 to 58 points, as is done in MBH98, does then the model, in this case noise, give high RE scores over both periods?

Regards”

Why was my previous comment deleted?

“What is the RE score over the ~130 point calibration period, the period for which the mean is calculated?

If the calibration period is shortened to 79 points and validation increased from 11 to 58 points, as is done in MBH98, does then the model, in this case noise, give high RE scores over both periods?

Regards”

Ð‘Ð»ÐµÑÑ‚ÑÑ‰Ð°Ñ Ð¸Ð´ÐµÑ Ð¸ ÑÐ²Ð¾ÐµÐ²Ñ€ÐµÐ¼ÐµÐ½Ð½Ð¾

Ð¡Ð¼Ð¾Ñ‚Ñ€ÐµÑ‚ÑŒ ÐºÐ»Ð¸Ð¿Ñ‹, Ð²Ð¸Ð´ÐµÐ¾ Ð¸ Ð²ÑÑÐºÑƒÑŽ Ð²ÑÑÑ‡Ð¸Ð½Ñƒ