Link to home

Linear regression in R

The AUDPC is not the only method for summarizing disease progress; regression analyses are also often applied. Regression analysis is a statistical tool for describing the relationship between two or more quantitative variables such that one variable (the dependent or response variable) may be predicted from other variable(s) (the independent or predictor variable(s)). For example, if the relationship between the severity of disease and time is known, disease severity can be predicted at a specified time. If we have only one predictor variable and the response and the predictor variable have a linear relationship, the data can be analyzed with a simple linear model. When there is more than one predictor variable, multiple regression may be used. When linear models are not sufficient to describe the data, a nonlinear regression model may be useful. In this section, some examples of simple linear regression are presented using R.

Linear regression

Linear regression compares two variables x and y to answer the question, 'how does y change with x?' ' For example, what is disease severity (y) at two weeks (x)? Or, what will be the expected disease severity at the end of the growing season if no treatment is applied?

A simple linear regression model is of the form: yi = β0ixii, where εi is iid Normal (0, σ2); that is, the error terms are independent and identically distributed following a normal distribution with mean 0 and variance σ2. The word "linear" in the model refers to the linear influence of the parameters β0 and β1, which are the regression coefficients. Specifically, β1 is the slope of the regression line, that is, the change in y corresponding to a unit change in x. β0 is the intercept, or the y value when x=0. The intercept has no practical meaning if the condition x=0 cannot occur, but is necessary to specify the model.

In simple linear regression the influence of random error in the model is allowed only in the error term ε. The predictor variable, x, is considered to be a deterministic quantity. The validity of the result for a typical linear regression model requires the fulfillment of the following assumptions:

  • the linear model is appropriate,
  • the error terms are independent,
  • the error terms are approximately normally distributed,
  • and the error terms have a common variance.

For more discussion on simple linear regression the reader should refer to a text for regression analysis (e.g. Harrell 2001). If model checking for a simple linear regression model indicates that a linear model is not appropriate, one could consider transformation, nonlinear regression, or other nonparametric options.

How to perform linear regression in R

R makes the function lm(formula, data, subset) available.
For a complete description of lm, type: help(lm).
Here is a simple example of the relationship between disease severity and temperature:

## Disease severity as a function of temperature
# Response variable, disease severity
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
# Take a look at the data
plot(temperature,diseasesev)

 

Output

Click on the image for larger version.

## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize
## the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
## Generate a summary of the linear model
summary(severity.lm)

 

Output

Residuals:
Min 1Q Median 3Q Max
-2.1959 -1.3584 -0.3417 0.7460 3.6957
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.66233 1.10082 2.418 0.04195 *
temperature 0.24168 0.06346 3.808 0.00518 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.028 on 8 degrees of freedom
Multiple R-Squared: 0.6445, Adjusted R-squared: 0.6001
F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175

 

Graphical tools for testing assumptions

Common graphical tools for testing assumptions include plots evaluating the characteristics of residuals, the part of the observations that is not explained by the model:

  1. scatter plots of the residuals vs. x or the fitted value
  2. normal probability plots of the residuals.

Look for a pattern in the graph of residual vs. fitted values that might suggest a non-constant variance. Equal variances should be dispersed evenly around zero. Other patterns can indicate that a linear regression model may not be appropriate for the data. If the pattern indicates unequal variances, a more complicated model that specifies unequal variances may be appropriate, or transformation of y might be useful.

## which=1 produces a graph of residuals vs fitted values
plot(severity.lm, which=1)

The graph shown below indicates some pattern in residuals which could be explored further, though this level of pattern relative to the magnitude of changes in the observed data may be unimportant.

Output

Click on the image for larger version.

## which=2 produces a graph of a quantile-quantile (QQ) plot
plot(severity.lm,which=2)

 

Output

Click on the image for larger version.

Points should lie close to a line on the QQ plot if residuals are from a normal distribution. The QQ plot above shows a very slight indication that the residual might not come from a normal distribution, since the last two observations deviate farther from the fitted line.

After fitting the linear model, the function predict() can be used to generate fitted values. The function data.frame() is used below to create a table of original data and fitted values.

options(digits=4)
fit.with.se<-predict(
severity.lm,
se.fit=TRUE
)
data.frame(
severity,
fitted.value=predict(severity.lm),
residual=resid(severity.lm),
fit.with.se
)

 

Output

disease temperature fitted.value residual   fit se.fit df residual.scale
1 1.9 2 3.146 -1.2457 3.146 1.0004 8 2.028
2 3.1 1 2.904 0.1960 2.904 1.0499 8 2.028
3 3.3 5 3.871 -0.5707 3.871 0.8629 8 2.028
4 4.8 5 3.871 0.9293 3.871 0.8629 8 2.028
5 5.3 20 7.496 -2.1959 7.496 0.7425 8 2.028
6 6.1 20 7.496 -1.3959 7.496 0.7425 8 2.028
7 6.4 23 8.221 -1.8209 8.221 0.8545 8 2.028
8 7.6 10 5.079 2.5209 5.079 0.6920 8 2.028
9 9.8 30 9.913 -0.1127 9.913 1.1955 8 2.028
10 12.4 25 8.704 3.6957 8.704 0.9432 8 2.028

 

We can also plot the data set together with the fitted line.

plot(
diseasesev~temperature,
data=severity,
xlab="Temperature",
ylab="% Disease Severity",
pch=16
)
abline(severity.lm,lty=1)
title(main="Graph of % Disease Severity vs Temperature")

 

Output

Click on the image for larger version.

If the process of checking model assumptions indicates that the linear model is adequate, the linear model can be used to predict the response variable for a given predictor variable.

## Predict disease severity for three new temperatures
new <- data.frame(temperature=c(15,16,17))
predict(
severity.lm,
newdata=new,
interval="confidence"
)

 

Output

fit   lwr   upr
1 6.288 4.803 7.772
2 6.529 5.025 8.034
3 6.771 5.233 8.309

 

Linear regression with intercept and slope parameters can be used to describe straight-line relationships between variables. For disease progress measured over a limited time period, a straight line may provide a perfectly adequate description. Linear regression that includes additional parameters can be used to describe curved relationships, for example by including a squared term (quadratic term) in the set of predictors, such as time squared. More complicated relationships can readily be described using nonlinear regression.

 

Next, nonlinear regression in R