When r2 Regression is Very High

The coefficient of determination, or r2, is the ratio of explained variation to total variation of two variables X and Y. The coefficient ranges from −1 to 1, where a value of 1 is an exact, positive linear relationship, with all data points lying on the same line. A value of 0 shows no linear relationship between the variables. If r2 is high, most people assume X and Y are related. Possibly, if the errors are “normal” or independent and identically distributed. But r2 may also be high and X not related to Y when natural series have “spurious significance”.

It is hard to do better than this set of articles on high correlation statistics from Steve McIntyre at ClimateAudit to explain “spurious significance”.

Spurious Significance #1

“Spurious significance” is a term in statistics used to describe a situation when a statistic returns a value which is “statistically significant”, when it is impossible that there is any significance. It’s a topic, which sounds easy, but quickly gets difficult.

Spurious Significance #2 : Granger and Newbold 1974

This is the second of a planned series of notes on spurious significance, to give a sense of the statistical background. Granger and Newbold [1974] posted up here is an extremely famous article, which starts off the modern discussion of the problem of spurious regression. Granger is a recent Nobel laureate in economics.

Spurious Significance #3: Some DW Statistics

Granger and Newbold [1974] provided examples of spurious significance in a random walk context. This has been extended by various authors to a number of other persistent processes.

Spurious Significance #4: Phillips [1986]

One of the reasons for discussing Granger-Newbold and Phillips is to show their approach to “spurious” regression statistics (here t-statistics and F-statistics) and why other statistics need to be considered to ensure that there is no mis-specification in the model.

Spurious #5: Variance of Autocorrelated Processes

What is the standard deviation (variance) of an autocorrelated series? Sounds like an easy question, but it isn’t. This issue turns out to affect the spurious regression problem.

Advertisements

0 thoughts on “When r2 Regression is Very High

  1. David,

    Your “Spurious #5: Variance of Autocorrelated Process” is a reference that one should read only after a good grounding in the Classical Linear Model of regression and the violation of that model’s zero serial correlation assumptions by an autoregressive error term. Percival, the author of “Three Curious Properties of the Sample Variance and Autocovariance for Stationary Processes with Unknown Mean,” is making a big deal about some things that are seldom a problem.

    He biggest point is that the variance of a subsegment of a series is a serious underestimate of the process variance. Yes that is true, but we only have subsegments, i.e. finite N. If N is small relative 1 over rho square (rho being the true autocorrelation coefficient), then there are real problems. Percival’s example has 1/(1-rho^2) = 167 and N = 10, hence the problems he shows because N is 6% of 1/(1-rho^2). When that ratio is, say, over 10, there isn’t the extreme degree of underestimation. With the Percival rho of 0.997, using an N of 1670 will leave an underestimate of about 20-30%.

    Further, if the autocorrelation really was 0.997, then the series would show unit roots, which means either the researcher should be working with first differences or with a cointegrated model (if one can be found).

    Finally, Percival “cheats” a bit in his example of a process X(t)= b*t. If that was the process, it would be easily seen, and furthermore it is not stationary thus the ACF does not have it expected meaning. Thus, the example is making a point about the estimated ACF, but the example, itself, has no real practical purpose.

    My criticisms of Percival should not be taken to mean that I don’t think the article is worth reading. Rather, it should not be read by people who don’t have a bit more than a basic understanding of the time series models.

  2. David,

    Your “Spurious #5: Variance of Autocorrelated Process” is a reference that one should read only after a good grounding in the Classical Linear Model of regression and the violation of that model’s zero serial correlation assumptions by an autoregressive error term. Percival, the author of “Three Curious Properties of the Sample Variance and Autocovariance for Stationary Processes with Unknown Mean,” is making a big deal about some things that are seldom a problem.

    He biggest point is that the variance of a subsegment of a series is a serious underestimate of the process variance. Yes that is true, but we only have subsegments, i.e. finite N. If N is small relative 1 over rho square (rho being the true autocorrelation coefficient), then there are real problems. Percival’s example has 1/(1-rho^2) = 167 and N = 10, hence the problems he shows because N is 6% of 1/(1-rho^2). When that ratio is, say, over 10, there isn’t the extreme degree of underestimation. With the Percival rho of 0.997, using an N of 1670 will leave an underestimate of about 20-30%.

    Further, if the autocorrelation really was 0.997, then the series would show unit roots, which means either the researcher should be working with first differences or with a cointegrated model (if one can be found).

    Finally, Percival “cheats” a bit in his example of a process X(t)= b*t. If that was the process, it would be easily seen, and furthermore it is not stationary thus the ACF does not have it expected meaning. Thus, the example is making a point about the estimated ACF, but the example, itself, has no real practical purpose.

    My criticisms of Percival should not be taken to mean that I don’t think the article is worth reading. Rather, it should not be read by people who don’t have a bit more than a basic understanding of the time series models.

  3. Hi Martin, Thanks for that. Referring to Steve post exerpted from Percival, I think it is a very clear cautionary tail of exactly the problem, small N and high rho series, which is what it was trying to do. Rho’s this high are seen in temperature series, so its not a moot point. The remedies you suggest while things that should be done, are often not done, and that is the point of the article. When I speak to ecologists about autocorrelation I don’t think they relalize the magnitude of error possible. And N=10 is not uncommon. Though the specifics of each situation are different and need to be looked at case by case.

  4. Hi Martin, Thanks for that. Referring to Steve post exerpted from Percival, I think it is a very clear cautionary tail of exactly the problem, small N and high rho series, which is what it was trying to do. Rho’s this high are seen in temperature series, so its not a moot point. The remedies you suggest while things that should be done, are often not done, and that is the point of the article. When I speak to ecologists about autocorrelation I don’t think they relalize the magnitude of error possible. And N=10 is not uncommon. Though the specifics of each situation are different and need to be looked at case by case.

  5. David,
    I shouldn’t be saying this, at least in the sense that I am supposed to be an econometrician, but there are some quick and dirty ways to look at the magnitude of the serial correlation. Simply drop a lag (Y(t-1), Y(t-2), … where Y is the dependent variable) into the equation and look at the effect on the coefficients of the X’s. If the effect is significant (relative the standard errors in the equation with the lag structure), the get series about the serial correlation (Cochrane-Orcutt or Prais-Winston for first order autocorrelation then move to second etc. or just go to an ARMA estimating routine — R has them although I don’t know how well they work — and remember that the equation now includes an implicit lag structure).
    Steve McIntrye’s point about the Durbin-Watson statistic being cause for rejection of a regression result is not really correct. First, remember that OLS estimates are unbiased even with serial correlation (a point many of his readers apparently never understood). The issue is one of efficiency (size of variance), hence check against a model with a lag structure (or maybe better just estimate the 1st order correlation from the residuals and see if it scares you). Second, in many cases for non-time series data, regressions produce horrible D-W statistics. Just ignore them. For time series, of course, the D-W (or even better the Breusch-Godfrey Serial Correlation Lagrange Multiplier Test or ARCH version) should be used, but with a small sample size, such as 10, one shouldn’t be using time series. Yes, I have done it, but I was living in sin at time although given was I have seen in climate reconstructions, I am beginning to think I might have been saint. 🙂

  6. David,
    I shouldn’t be saying this, at least in the sense that I am supposed to be an econometrician, but there are some quick and dirty ways to look at the magnitude of the serial correlation. Simply drop a lag (Y(t-1), Y(t-2), … where Y is the dependent variable) into the equation and look at the effect on the coefficients of the X’s. If the effect is significant (relative the standard errors in the equation with the lag structure), the get series about the serial correlation (Cochrane-Orcutt or Prais-Winston for first order autocorrelation then move to second etc. or just go to an ARMA estimating routine — R has them although I don’t know how well they work — and remember that the equation now includes an implicit lag structure).
    Steve McIntrye’s point about the Durbin-Watson statistic being cause for rejection of a regression result is not really correct. First, remember that OLS estimates are unbiased even with serial correlation (a point many of his readers apparently never understood). The issue is one of efficiency (size of variance), hence check against a model with a lag structure (or maybe better just estimate the 1st order correlation from the residuals and see if it scares you). Second, in many cases for non-time series data, regressions produce horrible D-W statistics. Just ignore them. For time series, of course, the D-W (or even better the Breusch-Godfrey Serial Correlation Lagrange Multiplier Test or ARCH version) should be used, but with a small sample size, such as 10, one shouldn’t be using time series. Yes, I have done it, but I was living in sin at time although given was I have seen in climate reconstructions, I am beginning to think I might have been saint. 🙂

  7. Pingback: how to build muscle fast

  8. Pingback: makeanygirlwanttofuck

  9. Pingback: Link Indexr by josh zamora

  10. Pingback: advice

  11. Pingback: fashion with a purpose

  12. Pingback: mp3s

  13. Pingback: link do strony

  14. Pingback: Josh Zipkin

  15. Pingback: VENUSID

  16. Pingback: google sniper 3.0 review

  17. Pingback: M88.Com

  18. Pingback: GHD Lisseur

  19. Pingback: kliknij link

  20. Pingback: link do strony

  21. Pingback: massage thai

  22. Pingback: paparazzi accessories find a consultant

  23. Pingback: depilacja obejrzyj

  24. Pingback: polska jadlodajnia

  25. Pingback: Watch this Zenerect Review

  26. Pingback: sex filmiki

  27. Pingback: witryna firmowa

  28. Pingback: strona firmy

  29. Pingback: Weight Destroyer Program Review

  30. Pingback: Kezia Noble

  31. Pingback: kliknij tutaj

  32. Pingback: https://netforceinfotech.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s