The coefficient of determination, or *r*^{2}, is the ratio of explained variation to total variation of two variables X and Y. The coefficient ranges from âˆ’1 to 1, where a value of 1 is an exact, positive linear relationship, with all data points lying on the same line. A value of 0 shows no linear relationship between the variables. If *r*^{2} is high, most people assume X and Y are related. Possibly, if the errors are “normal” or independent and identically distributed. But *r*^{2} may also be high and X not related to Y when natural series have “spurious significance”.

It is hard to do better than this set of articles on high correlation statistics from Steve McIntyre at ClimateAudit to explain “spurious significance”.

Spurious Significance #1

“Spurious significance” is a term in statistics used to describe a situation when a statistic returns a value which is “statistically significant”, when it is impossible that there is any significance. Itâ€™s a topic, which sounds easy, but quickly gets difficult.

Spurious Significance #2 : Granger and Newbold 1974

This is the second of a planned series of notes on spurious significance, to give a sense of the statistical background. Granger and Newbold [1974] posted up here is an extremely famous article, which starts off the modern discussion of the problem of spurious regression. Granger is a recent Nobel laureate in economics.

Spurious Significance #3: Some DW Statistics

Granger and Newbold [1974] provided examples of spurious significance in a random walk context. This has been extended by various authors to a number of other persistent processes.

Spurious Significance #4: Phillips [1986]

One of the reasons for discussing Granger-Newbold and Phillips is to show their approach to “spurious” regression statistics (here t-statistics and F-statistics) and why other statistics need to be considered to ensure that there is no mis-specification in the model.

Spurious #5: Variance of Autocorrelated Processes

What is the standard deviation (variance) of an autocorrelated series? Sounds like an easy question, but it isnâ€™t. This issue turns out to affect the spurious regression problem.

### Like this:

Like Loading...

David,

Your “Spurious #5: Variance of Autocorrelated Process” is a reference that one should read only after a good grounding in the Classical Linear Model of regression and the violation of that model’s zero serial correlation assumptions by an autoregressive error term. Percival, the author of “Three Curious Properties of the Sample Variance and Autocovariance for Stationary Processes with Unknown Mean,” is making a big deal about some things that are seldom a problem.

He biggest point is that the variance of a subsegment of a series is a serious underestimate of the process variance. Yes that is true, but we only have subsegments, i.e. finite N. If N is small relative 1 over rho square (rho being the true autocorrelation coefficient), then there are real problems. Percival’s example has 1/(1-rho^2) = 167 and N = 10, hence the problems he shows because N is 6% of 1/(1-rho^2). When that ratio is, say, over 10, there isn’t the extreme degree of underestimation. With the Percival rho of 0.997, using an N of 1670 will leave an underestimate of about 20-30%.

Further, if the autocorrelation really was 0.997, then the series would show unit roots, which means either the researcher should be working with first differences or with a cointegrated model (if one can be found).

Finally, Percival “cheats” a bit in his example of a process X(t)= b*t. If that was the process, it would be easily seen, and furthermore it is not stationary thus the ACF does not have it expected meaning. Thus, the example is making a point about the estimated ACF, but the example, itself, has no real practical purpose.

My criticisms of Percival should not be taken to mean that I don’t think the article is worth reading. Rather, it should not be read by people who don’t have a bit more than a basic understanding of the time series models.

David,

Your “Spurious #5: Variance of Autocorrelated Process” is a reference that one should read only after a good grounding in the Classical Linear Model of regression and the violation of that model’s zero serial correlation assumptions by an autoregressive error term. Percival, the author of “Three Curious Properties of the Sample Variance and Autocovariance for Stationary Processes with Unknown Mean,” is making a big deal about some things that are seldom a problem.

He biggest point is that the variance of a subsegment of a series is a serious underestimate of the process variance. Yes that is true, but we only have subsegments, i.e. finite N. If N is small relative 1 over rho square (rho being the true autocorrelation coefficient), then there are real problems. Percival’s example has 1/(1-rho^2) = 167 and N = 10, hence the problems he shows because N is 6% of 1/(1-rho^2). When that ratio is, say, over 10, there isn’t the extreme degree of underestimation. With the Percival rho of 0.997, using an N of 1670 will leave an underestimate of about 20-30%.

Further, if the autocorrelation really was 0.997, then the series would show unit roots, which means either the researcher should be working with first differences or with a cointegrated model (if one can be found).

Finally, Percival “cheats” a bit in his example of a process X(t)= b*t. If that was the process, it would be easily seen, and furthermore it is not stationary thus the ACF does not have it expected meaning. Thus, the example is making a point about the estimated ACF, but the example, itself, has no real practical purpose.

My criticisms of Percival should not be taken to mean that I don’t think the article is worth reading. Rather, it should not be read by people who don’t have a bit more than a basic understanding of the time series models.

Hi Martin, Thanks for that. Referring to Steve post exerpted from Percival, I think it is a very clear cautionary tail of exactly the problem, small N and high rho series, which is what it was trying to do. Rho’s this high are seen in temperature series, so its not a moot point. The remedies you suggest while things that should be done, are often not done, and that is the point of the article. When I speak to ecologists about autocorrelation I don’t think they relalize the magnitude of error possible. And N=10 is not uncommon. Though the specifics of each situation are different and need to be looked at case by case.

Hi Martin, Thanks for that. Referring to Steve post exerpted from Percival, I think it is a very clear cautionary tail of exactly the problem, small N and high rho series, which is what it was trying to do. Rho’s this high are seen in temperature series, so its not a moot point. The remedies you suggest while things that should be done, are often not done, and that is the point of the article. When I speak to ecologists about autocorrelation I don’t think they relalize the magnitude of error possible. And N=10 is not uncommon. Though the specifics of each situation are different and need to be looked at case by case.

David,

I shouldn’t be saying this, at least in the sense that I am supposed to be an econometrician, but there are some quick and dirty ways to look at the magnitude of the serial correlation. Simply drop a lag (Y(t-1), Y(t-2), … where Y is the dependent variable) into the equation and look at the effect on the coefficients of the X’s. If the effect is significant (relative the standard errors in the equation with the lag structure), the get series about the serial correlation (Cochrane-Orcutt or Prais-Winston for first order autocorrelation then move to second etc. or just go to an ARMA estimating routine — R has them although I don’t know how well they work — and remember that the equation now includes an implicit lag structure).

Steve McIntrye’s point about the Durbin-Watson statistic being cause for rejection of a regression result is not really correct. First, remember that OLS estimates are unbiased even with serial correlation (a point many of his readers apparently never understood). The issue is one of efficiency (size of variance), hence check against a model with a lag structure (or maybe better just estimate the 1st order correlation from the residuals and see if it scares you). Second, in many cases for non-time series data, regressions produce horrible D-W statistics. Just ignore them. For time series, of course, the D-W (or even better the Breusch-Godfrey Serial Correlation Lagrange Multiplier Test or ARCH version) should be used, but with a small sample size, such as 10, one shouldn’t be using time series. Yes, I have done it, but I was living in sin at time although given was I have seen in climate reconstructions, I am beginning to think I might have been saint. 🙂

David,

I shouldn’t be saying this, at least in the sense that I am supposed to be an econometrician, but there are some quick and dirty ways to look at the magnitude of the serial correlation. Simply drop a lag (Y(t-1), Y(t-2), … where Y is the dependent variable) into the equation and look at the effect on the coefficients of the X’s. If the effect is significant (relative the standard errors in the equation with the lag structure), the get series about the serial correlation (Cochrane-Orcutt or Prais-Winston for first order autocorrelation then move to second etc. or just go to an ARMA estimating routine — R has them although I don’t know how well they work — and remember that the equation now includes an implicit lag structure).

Steve McIntrye’s point about the Durbin-Watson statistic being cause for rejection of a regression result is not really correct. First, remember that OLS estimates are unbiased even with serial correlation (a point many of his readers apparently never understood). The issue is one of efficiency (size of variance), hence check against a model with a lag structure (or maybe better just estimate the 1st order correlation from the residuals and see if it scares you). Second, in many cases for non-time series data, regressions produce horrible D-W statistics. Just ignore them. For time series, of course, the D-W (or even better the Breusch-Godfrey Serial Correlation Lagrange Multiplier Test or ARCH version) should be used, but with a small sample size, such as 10, one shouldn’t be using time series. Yes, I have done it, but I was living in sin at time although given was I have seen in climate reconstructions, I am beginning to think I might have been saint. 🙂

Pingback: how to build muscle fast

Pingback: makeanygirlwanttofuck

Pingback: Link Indexr by josh zamora

Pingback: advice

Pingback: fashion with a purpose

Pingback: mp3s

Pingback: link do strony

Pingback: Josh Zipkin

Pingback: VENUSID

Pingback: google sniper 3.0 review

Pingback: M88.Com

Pingback: GHD Lisseur

Pingback: kliknij link

Pingback: link do strony

Pingback: massage thai

Pingback: paparazzi accessories find a consultant

Pingback: depilacja obejrzyj

Pingback: polska jadlodajnia

Pingback: Watch this Zenerect Review

Pingback: sex filmiki

Pingback: witryna firmowa

Pingback: strona firmy

Pingback: Weight Destroyer Program Review

Pingback: Kezia Noble

Pingback: kliknij tutaj

Pingback: https://netforceinfotech.com