- APS
- Membership
- Meetings
- Publications
- Education
- Disease Impacts & Management
- Diseases and Pathogens
- For Educators
- Resources

- Career Development

Linear regression in R

The AUDPC is not the only method for summarizing disease progress; regression analyses are also often applied. Regression analysis is a statistical tool for describing the relationship between two or more quantitative variables such that one variable (the dependent or response variable) may be predicted from other variable(s) (the independent or predictor variable(s)). For example, if the relationship between the severity of disease and time is known, disease severity can be predicted at a specified time. If we have only one predictor variable and the response and the predictor variable have a linear relationship, the data can be analyzed with a simple linear model. When there is more than one predictor variable, multiple regression may be used. When linear models are not sufficient to describe the data, a nonlinear regression model may be useful. In this section, some examples of simple linear regression are presented using R.

Linear regression compares two variables *x* and *y* to answer the question, 'how does *y* change with *x*?' ' For example, what is disease severity (*y*) at two weeks (*x*)? Or, what will be the expected disease severity at the end of the growing season if no treatment is applied?

A simple linear regression model is of the form: *y _{i} = β_{0}+β_{i}x_{i}+ε_{i}*, where

In simple linear regression the influence of random error in the model is allowed only in the error term *ε*. The predictor variable, *x*, is considered to be a deterministic quantity. The validity of the result for a typical linear regression model requires the fulfillment of the following assumptions:

- the linear model is appropriate,
- the error terms are independent,
- the error terms are approximately normally distributed,
- and the error terms have a common variance.

For more discussion on simple linear regression the reader should refer to a text for regression analysis (e.g. Harrell 2001). If model checking for a simple linear regression model indicates that a linear model is not appropriate, one could consider transformation, nonlinear regression, or other nonparametric options.

R makes the function `lm(formula, data, subset)`

available.

For a complete description of `lm`

, type: `help(lm)`

.

Here is a simple example of the relationship between disease severity and temperature:

`## Disease severity as a function of temperature`

# Response variable, disease severity

diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)

# Predictor variable, (Centigrade)

temperature<-c(2,1,5,5,20,20,23,10,30,25)

# Take a look at the data

plot(temperature,diseasesev)

## For convenience, the data may be formatted into a dataframe

severity <- as.data.frame(cbind(diseasesev,temperature))

## Fit a linear model for the data and summarize

## the output from function lm()

severity.lm <- lm(diseasesev~temperature,data=severity)

## Generate a summary of the linear model

summary(severity.lm)

Residuals:

Min 1Q Median 3Q Max

-2.1959 -1.3584 -0.3417 0.7460 3.6957

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.66233 1.10082 2.418 0.04195 *

temperature 0.24168 0.06346 3.808 0.00518 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.028 on 8 degrees of freedom

Multiple R-Squared: 0.6445, Adjusted R-squared: 0.6001

F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175

Common graphical tools for testing assumptions include plots evaluating the characteristics of residuals, the part of the observations that is not explained by the model:

- scatter plots of the residuals vs.
*x*or the fitted value - normal probability plots of the residuals.

Look for a pattern in the graph of residual vs. fitted values that might suggest a non-constant variance. Equal variances should be dispersed evenly around zero. Other patterns can indicate that a linear regression model may not be appropriate for the data. If the pattern indicates unequal variances, a more complicated model that specifies unequal variances may be appropriate, or transformation of *y* might be useful.

## which=1 produces a graph of residuals vs fitted values

plot(severity.lm, which=1)

The graph shown below indicates some pattern in residuals which could be explored further, though this level of pattern relative to the magnitude of changes in the observed data may be unimportant.

## which=2 produces a graph of a quantile-quantile (QQ) plot

plot(severity.lm,which=2)

Points should lie close to a line on the QQ plot if residuals are from a normal distribution. The QQ plot above shows a very slight indication that the residual might not come from a normal distribution, since the last two observations deviate farther from the fitted line.

After fitting the linear model, the function `predict()`

can be used to generate fitted values. The function `data.frame()`

is used below to create a table of original data and fitted values.

options(digits=4)

fit.with.se<-predict(

severity.lm,

se.fit=TRUE

)

data.frame(

severity,

fitted.value=predict(severity.lm),

residual=resid(severity.lm),

fit.with.se

)

disease temperature fitted.value residual fit se.fit df residual.scale

1 1.9 2 3.146 -1.2457 3.146 1.0004 8 2.028

2 3.1 1 2.904 0.1960 2.904 1.0499 8 2.028

3 3.3 5 3.871 -0.5707 3.871 0.8629 8 2.028

4 4.8 5 3.871 0.9293 3.871 0.8629 8 2.028

5 5.3 20 7.496 -2.1959 7.496 0.7425 8 2.028

6 6.1 20 7.496 -1.3959 7.496 0.7425 8 2.028

7 6.4 23 8.221 -1.8209 8.221 0.8545 8 2.028

8 7.6 10 5.079 2.5209 5.079 0.6920 8 2.028

9 9.8 30 9.913 -0.1127 9.913 1.1955 8 2.028

10 12.4 25 8.704 3.6957 8.704 0.9432 8 2.028

We can also plot the data set together with the fitted line.

plot(

diseasesev~temperature,

data=severity,

xlab="Temperature",

ylab="% Disease Severity",

pch=16

)

abline(severity.lm,lty=1)

title(main="Graph of % Disease Severity vs Temperature")

If the process of checking model assumptions indicates that the linear model is adequate, the linear model can be used to predict the response variable for a given predictor variable.

## Predict disease severity for three new temperatures

new <- data.frame(temperature=c(15,16,17))

predict(

severity.lm,

newdata=new,

interval="confidence"

)

fit lwr upr

1 6.288 4.803 7.772

2 6.529 5.025 8.034

3 6.771 5.233 8.309

Linear regression with intercept and slope parameters can be used to describe straight-line relationships between variables. For disease progress measured over a limited time period, a straight line may provide a perfectly adequate description. Linear regression that includes additional parameters can be used to describe curved relationships, for example by including a squared term (quadratic term) in the set of predictors, such as time squared. More complicated relationships can readily be described using nonlinear regression.