Linear Regression Example

Here is an example of doing statistics in R, illustrating a pitfall of simple linear regression, using the global temperatures by satellite from 1979 to 2008. I have never seen the problem of ‘spurious compensation’ mentioned, but it is a common problem when trying to develop a model from a set of additive factors.

The linear regression below ‘invents’ large positive and negative factors that when added together, create a small difference that is a very good fit. The factors themselves are pronounced ‘highly significant’. Could ‘spurious compensation’ explain high positive forcing from CO2 and high negative forcing from aerosols found in some attribution studies?

Below I have fit the lower troposphere temperature to a linear regression made up of a list of the linear, square, and cubic multiples of time:

y = a + bx + cx2 + dx3

The resulting curve is the red line in the figure below. Nice fit. However, each of the components are shown as colored lines. The linear and cubic terms are dropping faster than mortgage loan companies Fannie Mae and Freddie Mac. The squared term is rising faster than female viagra. Who would like to ‘attribute’ a physical factor to each of these components?

The summary listing indicates this is a model with high significance, overall and in its components.

> source(“script.R”)
> l|t|)
(Intercept) -1.333e-02 3.634e-02 -0.367 0.714
x -4.265e-02 1.059e-02 -4.027 6.92e-05 ***
I(x^2) 4.789e-03 8.290e-04 5.777 1.68e-08 ***
I(x^3) -1.042e-04 1.837e-05 -5.671 2.98e-08 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1693 on 351 degrees of freedom
Multiple R-Squared: 0.4559, Adjusted R-squared: 0.4512
F-statistic: 98.03 on 3 and 351 DF, p-value: < 2.2e-16

The code for generating this plot is below. Comment out the line readRSS() after the first time so that it does not keep reading the data in from the remote website.

TLT< -"ftp://ftp.ssmi.com/msu/monthly_time_series/rss_monthly_msu_amsu_channel_tlt_anomalies_land_and_ocean_v03_1.txt"
#tlt<-readRSS(file=TLT)

do<-function(Y) {
x<-(1:length(Y))/12
Y1 <- as.vector(Y)
l <- lm(Y1 ~ x + I(x^2) + I(x^3))
plot(x,Y1,type="l",ylim=c(-1,1),ylab="Anomaly DegC",xlab="Years")
lines(x,predict(l),col="red")
lines(x,l$coefficients[2]*x,col="blue")
lines(x,l$coefficients[3]*x^2,col="green")
lines(x,l$coefficients[4]*x^3,col="orange")
print(summary(l))
l
}

Update: Here is the result for the fifth and sixth order in x linear regression of tropospheric temperatures. Note that the coefficients all have very low significance in the fifth order version, while the coefficients have mild significance in the sixth order.

Call:
lm(formula = Y1 ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5))

Residuals:
Min 1Q Median 3Q Max
-0.392083 -0.122727 -0.007861 0.092455 0.721341

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.179e-02 5.482e-02 -0.580 0.562
x -4.572e-02 3.715e-02 -1.231 0.219
I(x^2) 9.152e-03 7.734e-03 1.183 0.237
I(x^3) -7.418e-04 6.599e-04 -1.124 0.262
I(x^4) 3.144e-05 2.451e-05 1.283 0.200
I(x^5) -5.024e-07 3.288e-07 -1.528 0.127

Residual standard error: 0.1679 on 349 degrees of freedom
Multiple R-Squared: 0.4681, Adjusted R-squared: 0.4605
F-statistic: 61.43 on 5 and 349 DF, p-value: < 2.2e-16

Call:
lm(formula = Y1 ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6))

Residuals:
Min 1Q Median 3Q Max
-0.384645 -0.121876 -0.006046 0.093070 0.731062

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.067e-01 6.420e-02 -1.662 0.0975 .
x 5.739e-02 5.954e-02 0.964 0.3357
I(x^2) -2.531e-02 1.740e-02 -1.455 0.1465
I(x^3) 3.889e-03 2.197e-03 1.770 0.0776 .
I(x^4) -2.608e-04 1.345e-04 -1.939 0.0534 .
I(x^5) 8.160e-06 3.935e-06 2.073 0.0389 *
I(x^6) -9.733e-08 4.406e-08 -2.209 0.0278 *

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.167 on 348 degrees of freedom
Multiple R-Squared: 0.4755, Adjusted R-squared: 0.4664
F-statistic: 52.57 on 6 and 348 DF, p-value: < 2.2e-16

Finally, here is the result for the second order in x linear regression. Note that the squared term has a small coefficient and is not significant. The linear term has a coefficient of 0.012 or +0.12 degC per decade.

Call:
lm(formula = Y1 ~ x + I(x^2))

Residuals:
Min 1Q Median 3Q Max
-0.431278 -0.118049 -0.007253 0.107826 0.753653

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1504715 0.0282913 -5.319 1.86e-07 ***
x 0.0124321 0.0044039 2.823 0.00503 **
I(x^2) 0.0001537 0.0001438 1.069 0.28560

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1767 on 352 degrees of freedom
Multiple R-Squared: 0.406, Adjusted R-squared: 0.4027
F-statistic: 120.3 on 2 and 352 DF, p-value: < 2.2e-16

Without the squared term the linear term is +0.17 degC per decade.

Call:
lm(formula = Y1 ~ x)

Residuals:
Min 1Q Median 3Q Max
-0.418624 -0.119757 -0.006886 0.108515 0.745554

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.173087 0.018798 -9.208 <2e-16 ***
x 0.016993 0.001098 15.472 <2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1767 on 353 degrees of freedom
Multiple R-Squared: 0.4041, Adjusted R-squared: 0.4024
F-statistic: 239.4 on 1 and 353 DF, p-value: < 2.2e-16

Advertisements

0 thoughts on “Linear Regression Example

  1. “Linear” regression with a formula y = a + bx + cx2 + dx3?

    I was taught that “linear” relationships had to have a formula based on y = mx + c. Surely you mean “cubic” regression, or am I out of date?

  2. “Linear” regression with a formula y = a + bx + cx2 + dx3?

    I was taught that “linear” relationships had to have a formula based on y = mx + c. Surely you mean “cubic” regression, or am I out of date?

  3. Musing aloud

    Suppose you took this or a much longer series (to increase the probablility of repeat patterns of temp) and worked a daily basis to get ever more numbers. Like Melbourne 1860-now.

    Then, you do the above exercise increasing the powers to quadratic, 5th, 6th, etc. Do you think that you would head off to needing a Guide to the Galaxy, or do you think that some of the components would start to come back towards the curve? I know there are more conventional ways to attack this, but I’ve never seen the exercise done.

  4. Musing aloud

    Suppose you took this or a much longer series (to increase the probablility of repeat patterns of temp) and worked a daily basis to get ever more numbers. Like Melbourne 1860-now.

    Then, you do the above exercise increasing the powers to quadratic, 5th, 6th, etc. Do you think that you would head off to needing a Guide to the Galaxy, or do you think that some of the components would start to come back towards the curve? I know there are more conventional ways to attack this, but I’ve never seen the exercise done.

  5. Patrick: Ah ha, I am glad you asked. No they are called linear regressions because they are linear in the terms. As opposed to non-linear like y=a+bxz. The formula you show is a ‘simple’ linear regression.

  6. Patrick: Ah ha, I am glad you asked. No they are called linear regressions because they are linear in the terms. As opposed to non-linear like y=a+bxz. The formula you show is a ‘simple’ linear regression.

  7. David, I have to agree with Patrick there; it’s cubic regression in x. It’s true that it is linear in each of the three variables x, x^2 and x^3, but these aren’t independent.

    As you add powers, as Geoff suggests, you run into the difficulty that they all start to behave rather similarly, so it is difficult to discriminate the effects of each one. That is why people recommend fitting orthogonal polynomials (eg Chebyshev). You still just get a polynomial, but the coefficients are more meaningful.

  8. David, I have to agree with Patrick there; it’s cubic regression in x. It’s true that it is linear in each of the three variables x, x^2 and x^3, but these aren’t independent.

    As you add powers, as Geoff suggests, you run into the difficulty that they all start to behave rather similarly, so it is difficult to discriminate the effects of each one. That is why people recommend fitting orthogonal polynomials (eg Chebyshev). You still just get a polynomial, but the coefficients are more meaningful.

  9. David,
    On re-reading your post, I can’t see the point of it at all. When you fit a cubic, the only claim is that the cubic fits. No-one expects the terms added to make the cubic polyniomial will fit. What do you expect them to do?

  10. David,
    On re-reading your post, I can’t see the point of it at all. When you fit a cubic, the only claim is that the cubic fits. No-one expects the terms added to make the cubic polyniomial will fit. What do you expect them to do?

  11. Nick, Its an interesting pathological example. I don’t know about ‘no-one’. The same thing can happen when the terms are not higher order. The point is that just because the significances are all high in linear regression, the individual terms cannot reliably be ‘attributed’ to anything.

    People attribute the terms of multiple regressions all the time, interpreting the significance of the variables as meaning something. The conditions under which the terms of a regression can be attributed to something are more strict.

  12. Nick, Its an interesting pathological example. I don’t know about ‘no-one’. The same thing can happen when the terms are not higher order. The point is that just because the significances are all high in linear regression, the individual terms cannot reliably be ‘attributed’ to anything.

    People attribute the terms of multiple regressions all the time, interpreting the significance of the variables as meaning something. The conditions under which the terms of a regression can be attributed to something are more strict.

  13. David,
    There’s nothing pathological about it. I repeat my question – how do you think the components should behave?
    If you move the origin to the centre point, you’ll get components with much less apparent variation (but exactly the same cubic) If you move it to, say, 50 years, you’ll get components looking much less like the final curve (but again, the same cubic).

  14. David,
    There’s nothing pathological about it. I repeat my question – how do you think the components should behave?
    If you move the origin to the centre point, you’ll get components with much less apparent variation (but exactly the same cubic) If you move it to, say, 50 years, you’ll get components looking much less like the final curve (but again, the same cubic).

  15. Well pathological in the sense that if you were expecting the multiple regression to return sensible values for the coefficients, eg. small positive linear term, and very small non-significant square and cubic terms, you would be disappointed.

    You would be misled if you were to conclude from this fit that there was a couple of strong exponential factors ‘forcing’ global temperatures.

    Obviously the linear regression gives the ‘right’ answer subject to its assumptions. The point is that the results that it produces are not always meaningful and can be misleading.

  16. Well pathological in the sense that if you were expecting the multiple regression to return sensible values for the coefficients, eg. small positive linear term, and very small non-significant square and cubic terms, you would be disappointed.

    You would be misled if you were to conclude from this fit that there was a couple of strong exponential factors ‘forcing’ global temperatures.

    Obviously the linear regression gives the ‘right’ answer subject to its assumptions. The point is that the results that it produces are not always meaningful and can be misleading.

  17. David,
    The lack of independence of the powers leads to apparently high significance which may be misleading. When you quote the coefficient d as -1.042e-04 with a std error of 1.837e-05, you test the significance of varying d keeping other coefficients fixed. But if you tested the significance of d when varied letting other coefficients take their corresponding fitted values, it would be much less. That is one reason why using orthogonal polynomials is better.

    Again, as a check, try doing the analysis just subtracting 15 from all the x values.

  18. David,
    The lack of independence of the powers leads to apparently high significance which may be misleading. When you quote the coefficient d as -1.042e-04 with a std error of 1.837e-05, you test the significance of varying d keeping other coefficients fixed. But if you tested the significance of d when varied letting other coefficients take their corresponding fitted values, it would be much less. That is one reason why using orthogonal polynomials is better.

    Again, as a check, try doing the analysis just subtracting 15 from all the x values.

  19. My reasoning in post 2 is that the component curves disappear off the graph with potentially increasing speed as the power is raised. Thus, the composite final involves the subtraction of large numbers (which as an aside raises Q of significant figures in the calculations). I merely wished to see if there was a power, however high within reason, that had a very small component or coefficient and so did not streak off towards infinity so fast. Then I would wonder again of the components did have some significance other than numeric, because of the result being unusual and seeking an explanation.

  20. My reasoning in post 2 is that the component curves disappear off the graph with potentially increasing speed as the power is raised. Thus, the composite final involves the subtraction of large numbers (which as an aside raises Q of significant figures in the calculations). I merely wished to see if there was a power, however high within reason, that had a very small component or coefficient and so did not streak off towards infinity so fast. Then I would wonder again of the components did have some significance other than numeric, because of the result being unusual and seeking an explanation.

  21. The quote from Wikipedia means that if the expression is not linear in x then there is not a linear regression in x. How can you not understand that? It seems pretty simple to me.

    Why not call it a cubic regression? Or do you think that be a cubic regression in x you would need cubic variables in the coefficients of x?

  22. The quote from Wikipedia means that if the expression is not linear in x then there is not a linear regression in x. How can you not understand that? It seems pretty simple to me.

    Why not call it a cubic regression? Or do you think that be a cubic regression in x you would need cubic variables in the coefficients of x?

  23. Patrick, polynomials of the form above are referred to a linear regressions in statistics, despite being not linear in x. The reason I think is because the solution is via basic linear algebra, inversion of a matrix. When the equation is non-linear then linear algebra won’t work.

  24. Patrick, polynomials of the form above are referred to a linear regressions in statistics, despite being not linear in x. The reason I think is because the solution is via basic linear algebra, inversion of a matrix. When the equation is non-linear then linear algebra won’t work.

  25. David,
    I ran your R script, just renumbering the years so year 15 was set as year 0. The table of coefficients printed out is now:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 7.288e-02 1.349e-02 5.404 1.21e-07 ***
    x 3.071e-02 2.630e-03 11.676 2e-16 ***
    I(x^2) 1.017e-04 1.381e-04 0.736 0.462
    I(x^3) -1.042e-04 1.837e-05 -5.671 2.98e-08 ***

    The graph of components looks quite different; you can see it here

    These results shouldn’t change just by choosing a different year to call zero. The reason is that, as I said above, the coefficients are highly correlated, and you should test the correlation matrix for significance. Instead, you are just testing the diagonal components separately.

    Incidentally, although there is justification for calling this regression linear in the coefficients, the term “cubic regression” gets over 9500 hits on Google. And this is what they mean.

  26. David,
    I ran your R script, just renumbering the years so year 15 was set as year 0. The table of coefficients printed out is now:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 7.288e-02 1.349e-02 5.404 1.21e-07 ***
    x 3.071e-02 2.630e-03 11.676 2e-16 ***
    I(x^2) 1.017e-04 1.381e-04 0.736 0.462
    I(x^3) -1.042e-04 1.837e-05 -5.671 2.98e-08 ***

    The graph of components looks quite different; you can see it here

    These results shouldn’t change just by choosing a different year to call zero. The reason is that, as I said above, the coefficients are highly correlated, and you should test the correlation matrix for significance. Instead, you are just testing the diagonal components separately.

    Incidentally, although there is justification for calling this regression linear in the coefficients, the term “cubic regression” gets over 9500 hits on Google. And this is what they mean.

  27. Re 14 Admin,

    Ta, That shows graphically that the components are not likely to have physical meaning, which was your first main point. I note the better fit of the composite at higher power equations (visible in the last 5 years) and mildly supported by the stats. But the better fit is probably not worth the chase in this example.

    Do you have an opinion that is repeatable here on your original comment –

    “Could ’spurious compensation’ explain high positive forcing from CO2 and high negative forcing from aerosols found in some attribution studies?”

  28. Re 14 Admin,

    Ta, That shows graphically that the components are not likely to have physical meaning, which was your first main point. I note the better fit of the composite at higher power equations (visible in the last 5 years) and mildly supported by the stats. But the better fit is probably not worth the chase in this example.

    Do you have an opinion that is repeatable here on your original comment –

    “Could ’spurious compensation’ explain high positive forcing from CO2 and high negative forcing from aerosols found in some attribution studies?”

  29. David,
    Here’s the result of using orthogonal (Legendre) polynomials, which gives the performance that you want. The fitted curve is exactly the same, but it is expressed as components of the orthogonal polynomials of increasing order. However, the polynomial contributions don’t have the divergent behaviour that you didn’t like, and in fact you can extend to high orders (as Goeff asked about) without further difficulty.

    Here’s the plot.
    Same color scheme as yours
    and here’s the R code
    do<-function(Y) {
    x<- ((1:length(Y))-177)/12
    x1 <- x*12.0/177
    x2 <- 1.5*x1^2-0.5
    x3 <- 2.5*x1^3-1.5*x1
    Y1 <- as.vector(Y)
    l <- lm(Y1 ~ x1 + x2 + x3)
    print(summary(l, correlation = TRUE))
    plot(x,Y1,type=”l”,ylim=c(-1,1),ylab=”Anomaly DegC”,xlab=”Years”)
    lines(x,predict(l),col=”red”)
    lines(x,l$coefficients[2]*x1,col=”blue”)
    lines(x,l$coefficients[3]*x2,col=”green”)
    lines(x,l$coefficients[4]*x3,col=”orange”)
    l
    }

  30. David,
    Here’s the result of using orthogonal (Legendre) polynomials, which gives the performance that you want. The fitted curve is exactly the same, but it is expressed as components of the orthogonal polynomials of increasing order. However, the polynomial contributions don’t have the divergent behaviour that you didn’t like, and in fact you can extend to high orders (as Goeff asked about) without further difficulty.

    Here’s the plot.
    Same color scheme as yours
    and here’s the R code
    do<-function(Y) {
    x<- ((1:length(Y))-177)/12
    x1 <- x*12.0/177
    x2 <- 1.5*x1^2-0.5
    x3 <- 2.5*x1^3-1.5*x1
    Y1 <- as.vector(Y)
    l <- lm(Y1 ~ x1 + x2 + x3)
    print(summary(l, correlation = TRUE))
    plot(x,Y1,type=”l”,ylim=c(-1,1),ylab=”Anomaly DegC”,xlab=”Years”)
    lines(x,predict(l),col=”red”)
    lines(x,l$coefficients[2]*x1,col=”blue”)
    lines(x,l$coefficients[3]*x2,col=”green”)
    lines(x,l$coefficients[4]*x3,col=”orange”)
    l
    }

  31. Geoff,
    A lot of the attribution studies use simple linear regression. I just wonder if ‘spurious compensation’ might be partly responsible for the ‘global dimming’ scare — the idea that aerosols are imposing a strong cooling keeping temperatures down that would otherwise be rising rapidly from strong GHG forcing. I know these are not polynomials but compensation can still occur, inflating the coefficients of each attributed forcing.

    I wonder how an attribution study would work out using orthogonal (Legendre) polynomials?

  32. Geoff,
    A lot of the attribution studies use simple linear regression. I just wonder if ‘spurious compensation’ might be partly responsible for the ‘global dimming’ scare — the idea that aerosols are imposing a strong cooling keeping temperatures down that would otherwise be rising rapidly from strong GHG forcing. I know these are not polynomials but compensation can still occur, inflating the coefficients of each attributed forcing.

    I wonder how an attribution study would work out using orthogonal (Legendre) polynomials?

  33. For Nick Stokes,

    Many thanks for helping with my curiousity. Nice result. The maths in both cases are correct, but it shows a point about presentation. Care to repeat it on a longer temperature dataset, like Melbourne daily max since 1860? Just to see what washes up? I realise it’s not the optimum way, but am still keen to see why, in pictures.

  34. For Nick Stokes,

    Many thanks for helping with my curiousity. Nice result. The maths in both cases are correct, but it shows a point about presentation. Care to repeat it on a longer temperature dataset, like Melbourne daily max since 1860? Just to see what washes up? I realise it’s not the optimum way, but am still keen to see why, in pictures.

  35. Geoff,
    One of my hobbyhorses in professional life is discouraging the fitting of complex curves with no underlying model in mind. I think people look at the fitted curves and invent explanations without good reason. However, I can run the program if you like. I don’t know where to find that particular dataset, however – do you have somewhere in mind?

  36. Geoff,
    One of my hobbyhorses in professional life is discouraging the fitting of complex curves with no underlying model in mind. I think people look at the fitted curves and invent explanations without good reason. However, I can run the program if you like. I don’t know where to find that particular dataset, however – do you have somewhere in mind?

  37. Kick,

    It is a small hobby-horse of mine too, originating a while back from some medical meta studies.
    If you are happy to send me your email I’m happy to send the data, but only if you can really spare the time because the point of the thread has already been made.
    sherro1@optusnet.com.au

  38. Kick,

    It is a small hobby-horse of mine too, originating a while back from some medical meta studies.
    If you are happy to send me your email I’m happy to send the data, but only if you can really spare the time because the point of the thread has already been made.
    sherro1@optusnet.com.au

  39. What Nick said: “One of my hobbyhorses in professional life is discouraging the fitting of complex curves with no underlying model in mind.”

    Exactly. If you know the model, fine, you can fit a curve. Otherwise, all you are doing is smoothing one particular region of the data. You might as well use a spline. On the other hand, if you are trying to find and fit a model you might want to pick one which is not obviously wrong outside the fitted region. Can you fit a curve to climate data that someone somewhere won’t use for prediction?

  40. What Nick said: “One of my hobbyhorses in professional life is discouraging the fitting of complex curves with no underlying model in mind.”

    Exactly. If you know the model, fine, you can fit a curve. Otherwise, all you are doing is smoothing one particular region of the data. You might as well use a spline. On the other hand, if you are trying to find and fit a model you might want to pick one which is not obviously wrong outside the fitted region. Can you fit a curve to climate data that someone somewhere won’t use for prediction?

  41. [url=http://www.netvibes.com/pharmacy-store/][size=20][color=red][b]
    viagra pill cheapest cheap[/b][/color][/size][/url]
    [URL=http://www.netvibes.com/pharmacy-store/][IMG]http://www1.picfront.org/picture/TFievCkVAT/img/buy-viagra-cialis-levitra-online.jpg[/IMG][/URL]

    buy site viagra
    buy cialis viagra
    buy viagra sale
    buy viagra australia
    buy real viagra
    cheap man viagra
    16 viagra canada
    cheap kamagra viagra
    buy softtabs viagra
    discount viagra canada
    buy discount viagra

  42. Pingback: wynajem samochodow

  43. Pingback: wynajem samochodów

  44. Pingback: zobacz tutaj

  45. Pingback: strona firmy

  46. Pingback: kliknij link

  47. Pingback: strona firmy

  48. Pingback: katalog stron

  49. Pingback: witryna www

  50. Pingback: strona

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s