Application of multiple regression for disease forecasting

In a simple linear regression, a response variable y is regressed on a single predictor (explanatory) variable x. In the context of disease forecasting, y might denote disease severity with x denoting a predictor like mean temperature. The model is y=β_{0}+β_{1}x+ε where ε is iid Normal (0,>σ^{2}); see Sparks et al. (2008) for an introduction to linear regression with examples in R. Multiple linear regression generalizes this methodology to allow for multiple predictor variables, such as mean temperature and mean relative humidity. Notation for a model including m predictors is y=β_{0}+β_{1}x_{1}+β_{2}x_{2}+...+β_{m}x_{m}+ε. One common multiple linear regression model is the polynomial model: y=β_{0}+β_{1}x_{1}+β_{2}x_{1}^{2}+β_{3}x_{2}+β_{4}x_{2}^{2}+β_{5}x_{1}x_{2}+ε. The estimate of the parameter can be calculated by redefining variables: w_{1}=x_{1}, w_{2}=x_{1}^{2}, w_{3}=x_{2}, w_{4}=x_{2}^{2}, w_{5}=x_{1}x_{2}. and performing a multiple linear regression model using y=β_{0}+β_{1}w_{1}+β_{2}w_{2}+β_{3}w_{3}+β_{4}w_{4}+β_{5}w_{5}+ε. This now is an ordinary multiple linear regression model using w as predictor variables.

The goal of multiple linear regression may be accurate prediction, or the regression coefficients may also be of direct interest themselves (Maindonald and Braun 2003).

Example of Multiple Linear Regression using R

The function used to fit multiple regression models is the same as for simple linear regression models: lm(formula,data)

A scatterplot of all variables may be constructed using the following command:

pairs(severity.mult)

Output

This produces the following scatterplot matrix illustrating the relationship between each pair of variables:

Click to enlarge.

The patterns illustrated in the graph indicate an approximately straight-line relationship between each pair of variables. A scatterplot is a useful tool to examine not only how the potential predictor variables possibly relate to the response, but to also provide an exploratory method to determine if predictor variables are associated with one another (i.e., multicollinearity).

Next we fit a multiple linear regression model to the data with disease as a response variable, and temperature, rel.humidity, and rainfall as predictor variables.

##Multiple regression model severity.mult.lm <- lm(disease~temperature+rel.humidity+rainfall, data=severity.mult) summary(severity.mult.lm) #summary of the linear model

These commands produce the following output:

Output

Residuals: Min 1Q Median 3Q Max -1.5630 -0.7119 -0.2576 0.3774 2.9775

Residual standard error: 1.563 on 6 degrees of freedom Multiple R-Squared: 0.8415, Adjusted R-squared: 0.7623 F-statistic: 10.62 on 3 and 6 DF, p-value: 0.008172

In the output from this first analysis, the p-value for the full model is 0.008172, small enough to suggest that at least one of the predictor variables may be useful for prediction. The adjusted R^{2} is 0.7622 which indicates that the predictor variables explain 76% of the variance in the response variable. The next step is to address the apparent correlation between the predictor variables.

Multicollinearity

When using multiple linear regression, multicollinearity, or correlation among predictor variables, is an important consideration. When multicollinearity is a problem, standard errors for the estimates will tend to be large and parameter estimates may be illogical. One can calculate a variance inflation factor (VIF) for each independent variable, where a VIF of five or greater indicates the existence of multicollinearity. A function vif() is available in R version 2.4.1. and the car library (requires downloading of the package from CRAN). In our example, the VIF values were 22.9 for temperature, 21.3 for relative humidity, and 2.4 for rainfall. The scatterplots had shown that temperature and relative humidity were probably strongly related, which may explain such large VIF values. For more conclusions on multicollinearity and how to handle it, see Heiburger and Holland (2003). In the example given above, there is some indication of multicollinearity. One way to try to eliminate multicollinearity problems is by doing variable selection to select a subset of predictor variables that are not strongly correlated.

One method to select appropriate predictor variables, ones that are useful for prediction and not strongly correlated with each other, is backward elimination (Neter et al. 1996). In this approach, all predictor variables are included in the starting models, and then particular predictor variables are removed to iteratively search for the best fit. The Akaike information criterion (AIC) can be used to select among models for an optimal combination of parsimony (limiting the model to the smallest number of parameters needed) and good fit. A lower AIC indicates a better model.

Df Sum of Sq RSS AIC - rainfall 1 0.804 15.467 10.361 14.662 11.827 - rel.humidity 1 17.326 31.988 17.628 - temperature 1 28.189 42.852 20.552

Step: AIC= 10.36 disease ~ temperature + rel.humidity

Df Sum of Sq RSS AIC 15.467 10.361 - rel.humidity 1 17.425 32.892 15.906 - temperature 1 33.031 48.497 19.789

Call: lm(formula = disease ~ temperature + rel.humidity, data = severity.mult)

Coefficients: (Intercept) temperature rel.humidity -78.2751 0.8308 0.5438

This output indicates the steps taken to select the model. The first model tested includes all three predictor variables and the AIC for that model is calculated. Then the AIC for each model that excludes a single predictor variable is calculated. Excluding rainfall results in a lower AIC, so that model is selected in the next step. When either temperature or relative humidity are removed from that model, the AIC goes up, indicating that the best model includes temperature and relative humidity as predictors and excludes rainfall. The new model still has the issue of multicollinearity between the two predictors (temperature and relative humidity). Further considering may be given for other methods to deal with this effect, including Ridge or Bayesian regression models (Neter et al. 1996). In the next step illustrated below, this model is analyzed.

Residual standard error: 1.486 on 7 degrees of freedom Multiple R-Squared: 0.8328, Adjusted R-squared: 0.7851 F-statistic: 17.44 on 2 and 7 DF, p-value: 0.00191

Just as for simple linear regression (see Sparks et al., 2008), it’s important to check that the model assumptions are adequately met. The following code produces four plots to help in model evaluation. The plot function in R provides several types of diagnostic graphs that can be selected by changing the number called in which = _.

par(mfcol=c(2, 2)) #Sets the graphic to display all # four plots in one window plot(severity.mult.lm2, which=1) #residuals versus fitted plot(severity.mult.lm2, which=2) #QQ plot plot(severity.mult.lm2, which=3) #plot of square-root # standardized residuals # against fitted value plot(severity.mult.lm2, which=5) #Residuals versus leverage plot

The four plots are of: (i) Residuals vs. fitted (or predicted) values, (ii) Scale-location (square-root of the standardized residual) vs. fitted values, (iii) Normal Quantile-Quantile (Q-Q) relation, and (iv) Residuals vs. leverage (influence of a single observation on model estimates).

Output

Click to enlarge.

Ideally the first two plots will show no relationship. A straight line in the Q-Q plot indicates a normal distribution for the residuals. Ideally the leverage of each observation would be equal. If Cook's distance is > 1 the observation may be considered an outlier. To obtain a summary of influential cases, using measures such as DFFITS, DFBETAS, and Cook's distance, there is a function called influence.measures that can be used with lm or glm objects (Neter et al. 1996).

The model can also be used to predict disease severity for new values of temperature and relative humidity. The following code produces the prediction for temperature = 95 and relative humidity = 30, along with confidence limits for the prediction.

new <- data.frame(temperature=95, rel.humidity=30) #new temperature and new rel humidity predict(severity.mult.lm2, newdata=new, interval="confidence")

Output

fit lwr upr [1,] 16.96165 11.42027 22.50303

Application of maximum likelihood based methods for disease forecasting

With the development of more rapid computing technologies, it is possible to compute a large number of simulations to support new modeling methods for plant disease forecasting. The development of disease forecasting models is based on determining the value of unknown parameters for an appropriate model.

One method for estimation of parameters is maximum likelihood estimation (MLE), first developed by R.A. Fisher (Myung 2003). The basic idea behind MLE is that one can calculate the probability of observing a particular set of data given the model of interest and this is called the likelihood of a sample. The likelihood of observing the actual data for each of the potential parameter values is calculated to determine which set of parameter values maximizes the likelihood.

To illustrate the concept of MLE, let's examine a random sample from a binomial distribution with given parameter p. MLE is then used to determine the parameter estimate, as if there was no knowledge of the parameter. Finally, the parameter estimate based on MLE is compared to the true value.

Before getting into this example, some background on MLE is necessary. To begin, based on some prior knowledge, a model for the data may be developed, which may be defined as a probability distribution function, f. The likelihood, L, is defined as a function of the parameter(s) in the model given the data: L(θ|x). When the number of samples is > 1, the likelihood is a joint likelihood of observations. In practice, it is often easier computationally to work with the log-likelihood, defined as logL(θ|x). The logL is maximized by taking the first derivative with respect to the parameter in order to obtain the estimate of the parameter.

Consider a binomial sampling situation where a coin is tossed 10 times (in R: rbinom(10,1,0.5)). If using 1 to represent the heads and 0 to the tails of the coin, this is one possible outcome for those 10 events: 0 1 0 1 1 0 0 1 0 0.

Given this data, a natural question is: What is the probability that a head will be observed on the next toss? One approach may be to count the number of times that a head was observed in the initial probability experiment. Following this approach, given an observation of four heads out of 10 coin tosses, the probability of a head on the next toss may be 0.4. This probability may also be estimated using MLE as a simple example of the MLE method, assuming that the occurrence of heads follows a binomial distribution with a probability parameter p. Formally, Prob(x = 1) = p and Prob(x = 0) = 1 - p, where x is an observation. All 10 tosses of the coin are considered independent, such that the 10 observations x1, x2,...,x10 can be considered a random sample from a binomial distribution. The joint likelihood becomes:

= (1 - p) * p * (1 - p) * p * p * (1 - p) * (1 - p) * p * (1 - p) * (1 - p)

= (1 - p)^{6}p^{4}

and logL as

logL = 6log(1 - p) + 4 log( p)

Now, the goal is to maximize the logL function. This may be accomplished by taking the first derivative of logL with respect to p and setting it equal to zero. In this situation, an explicit solution for p is obtained as:

solving this equation, gives p = 0.4.

This implies that the probability of heads is 0.4. But how good is this estimate? For the example above, the observations were obtained from a random binomial distribution with probability p = 0.5. Does this imply that the MLE estimate is not useful? In this case the random sample was only of size 10, which is a fairly small sample. Sample size is an important consideration for use of MLE or other estimation methods for plant disease forecasting models. The smaller the number of observations, the bigger the difference between the estimate and the true parameters is likely to be.

Suggested Exercise

Determine the MLE estimate for the probability of heads and then compare it to p = 0.5.

In R, there are currently several libraries that enable development of logistic regression models using MLE, as well as development of expert systems (classification and regression trees and neural networks, for example). In the Stats library, generalized linear models can be fit using the glm function. Yuen (2006; http://www.apsnet.org/edcenter/advanced/topics/Pages/DerivingDecisionRules.aspx) discusses generalized linear models and shows how they may be fit using SAS in his document. Furthermore, the Design library offers regressions models, including logistic regression, Cox regression, accelerated failure time models, and others (Harrell 2001). Plant pathological examples using functions in the Design library include Esker et al. (2006). Paul and Munkvold (2004) used the Design library, as developed for S-Plus, by the same author.

For the purpose of this article, an examination of logistic regression will be handled using the glm function and to show how a model may be constructed and compared using the deviance values. For further information regarding logistic regression, consult Harrell (2001) or Agresti (2002). Briefly, however, logistic regression is applied to determine the probability that disease occurs as:

where x are predictor variables of interest and α and β represent logistic coefficients to be estimated (Agresti 2002).

#The glm function is available in the Stats library. # For more information on glm, type: ?glm

#Example data where the event is presence or absence of disease # (prevalence) (1=presence,0=absence), with associated air # temperature over 3-months, a measure of maximum snowfall, # and an indicator for whether the disease occurred the # previous year. We might assume that favorable winter # weather increases the survival of a pathogen. This # information can be used to develop a model using logistic # regression, which uses maximum likelihood to obtain the # estimates of a parameter

Call: glm(formula = prevalence ~ month1, family = binomial, data = example.forecasting)

Deviance Residuals: Min 1Q Median 3Q Max -1.301 -1.230 1.010 1.135 1.275

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.7501 1.5077 -0.498 0.619 month1 -0.1812 0.3057 -0.593 0.553

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 103.64 on 74 degrees of freedom Residual deviance: 103.28 on 73 degrees of freedom AIC: 107.28

Number of Fisher Scoring iterations: 3

#To determine if month1 is a useful predictor, we can perform a # deviance test, which compares the residual deviance with the # parameter added to the null deviance. Here, this is # 103.64-103.28 with 1 df and is a chi-square test.

1-pchisq(103.64-103.28,1)

Output

1-pchisq(103.64-103.28,1) [1] 0.5485062

#We can pick a particular model, but also may use the step # function to compare models and select the model with the # best fit:

Degrees of Freedom: 74 Total (i.e. Null); 71 Residual Null Deviance: 103.6 Residual Deviance: 80.78 AIC: 88.78

Here forward selection was used to select the best model; that is, the simplest model was evaluated and predictor variables were added if they improved the fit using the AIC as criterion. Based on the above output, the following is a potential forecasting model . Click to enlarge

Suggested Exercise

When model1 was entered initially and its deviance was determined (e.g., the reduction in log-likelihood from the null model), it appeared that this explanatory variable did not improve the model. However, when using the stepwise procedure, this variable was selected for inclusion in the best model. From this, consider the development of a forecasting model where variables are added individually and compared, as well as a forecasting regression model based on a backwards selection procedure. Do you obtain the same models? The following R code can be used for performing a backward selection procedure.