To follow up on the last post, I have calculated the RE as well as the R2 statsitics for the reconstruction from the random series. The same approach was used, i.e. generate 1000 sequences with LTP, select those with positive slope and R2>0.1, calibrate on linear model, and average. Here is the reconstruction again, with the test and training periods marked with a horizontal dashed line (test period to the left, training to right of temperature values):
The table below adds the RE statistic to the previous RE statistic.
|Training period CRU~recon||0.56||0.40|
|Test period CRU~recon||0.002||-0.35|
|Training period CRUgs~recon||0.91||0.45|
|Test period CRUgs~recon||0.44||-39|
Indication of skill in RE is usually some positive value – not sure exacly what here. The RE is positive if the model-predicted values are somewhat better predictions than the mean value. The RE statistci (in R) is as follows, where x are the actual values and y are the predicted values (RMDS03).
RE = 1 – sum((x-y)^2^)/sum((x-mean(x))^2^)
Unlike the R2 statistic, the RE penalizes the predicted values for deviation from the mean value. This is why over the test period the RE is negative while the R2 appears significant. The dynamics of the simulation produces a ‘hook’ in the reconstruction lower than tthe filtered measured temperature values. The similarity in shape produces a significant R2 value, while the difference in location produces negative RE values.
The results for RE are positive in comparisons of the reconstruction over training data, and negative for comparisons on the test data. In this case, RE and R2 statistics on the training data would validae the skill of the reconstruction, while on the test data, R2 would validate the reconstruction while RE would not.
This shows the difficulty in determing skill, as plausible results and viable statistics can be achieved with random sequences. In what is essentally an extrapolation excercise, there is no objective way to test the validity of the extrapolation outside the range of the calibration period. You could mess around with different approaches ad-infinitum but ultimately you are not going to create confidence in an extrapolation without a severe test (in the sense used by Popper). Similarly, greater confidence will not be achieved by multi-proxy studies that combine proxies of uncertain value.
Comparisons of different types of proxies is closer to such a severe test. According to the review by Soon and Baliunas, a range of different types of proxies provide evidence of a LIA, a MWP and 20th century temperatures not the warmest of the millennium (SB03). SB03, based on multiple types of proxy, starkly contradicts with the claims made in MBH98, based on internal statistics. MBH98 is the odd man out in the SB03 test. The approach of SB03 is mindful of the great uncertainties in reconstruction of past temperatures. Excessive confidence in establishing statistical skill via internal testing is naive. Still more severe and quantitative tests of reconstructions could be made by comparisons than were attempted by SB03, to see which proxies indeed pass muster and which don’t.
RMDS02 – S. Rutherford, M. E. Mann, T. L. Delworth, and R. J. Stouffer, Climate Field Reconstruction under Stationary and Nonstationary Forcing, Journal of Climate, 16:462-479, 2003.
SB03 – W. Soon and S. Baliunas, Proxy climatic and environmental changes of the past 1000 years. Climate Research, 23: 89-110, 2003