APSnet Education Center Advanced Topics | Return to the Topics Index | Ecology and Epidemiology in R



Sparks, A.H., P.D. Esker, G. Antony, L. Campbell, E.E. Frank, L. Huebel, M.N. Rouse, B. Van Allen, and K.A. Garrett. 2008. Ecology and Epidemiology in R: Spatial Analysis. The Plant Health Instructor. DOI:10.1094/PHI-A-2008-0129-03.

Ecology and Epidemiology in R: Spatial Analysis

Advanced Illustration: Using the Beta-binomial Distribution

Roumagnac, P. Pruvost, O. Chiroleu, F., and Hughes, G. 2004. Spatial and temporal analyses of bacterial blight of onion caused by Xanthomonas axonopodis pv. allii. Phytopathology 94:138-146.

Four different methods for describing spatial patterns have been discussed and illustrated thus far in this document, and now we introduce the use of probability distributions to describe spatial pattern. This is a well-established method in plant pathology. Beta-binomial analysis is useful for describing aggregated patterns of plant disease (Hughes and Madden 1993).

Bacterial blight of onion (BBO) is a serious threat to onion production world-wide. The disease appears as lenticular water soaked spots on the leaves which in later stages turn into dry chlorotic lesions. In severe cases these lesions coalesce resulting in leaf dieback and reduced bulb size. A number of different Allium spp., including garlic, welsh onion, shallot, and leek, are hosts of this bacterium. Recent studies have confirmed that the pathogen is seed borne (Roumagnac et al. 2004).

Roumagnac et al. (2004) examined the temporal and spatial dynamics of BBO to hypothesize whether BBO epidemics can arise from seed borne inoculum. The experimental plots were at the CIRAD experimental station in Reunion Island, France, 1999 to 2001. A total of five plots (P1-P5) were sown with onion seed naturally contaminated with X. axonopodis pv. alli. The incidence of X. axonopodis pv. alli in seed prior to sowing was 0.04%.

Plots were sown at different sites each year to avoid inoculum from previous crop debris. In 2000 and 2001, 70 m separated the two plots and, to reduce inoculum movement from plot to plot, each plot was surrounded by seven rows of maize windbreak followed by 15 rows of onion trap plants established from an uninfested seed lot. Each plot included 8200 plants and was watered using overhead irrigation.

BBO incidence in each plot was determined at 15 day intervals after leaf emergence and 30 day intervals in onion trap outer rings. Symptom development on each plant was recorded by visual assessment. Spatial patterns were analyzed using beta-binomial distributions, binary power law analysis, and spatial autocorrelation analysis.

As described in "An introduction to the R programming environment" (Garrett et al. 2007), the major difference between beta-binomial and binomial distributions is that the probability that a plant is scored as 'diseased' is variable in the former while it is constant in the latter. Therefore, the beta-binomial distribution has a higher variance and better fits the data derived from aggregated patterns of disease incidence (Hughes and Madden 1993). The beta-binomial is fit by estimating two parameters, the mean disease incidence and the aggregation parameter that characterizes variability in the probability of infection.

With the use of R and the data from Roumagnec et al. (2004), let's look at how the beta-binomial parameter estimates change over time for this example.

Change in Beta-binomial Parameter Over Time

# Input aggregation parameter estimates and disease assessments for each plot
P1.aggregation.parameter <- c(0.0339, 0.0801, 0.2533, 0.3202, 0.3203)
P2.aggregation.parameter <- c(0.0214, 0.1877, 0.3991, 0.5125, 0.6247,0.6199)
P3.aggregation.parameter <- c(0.0071, 0.1252, 0.2023, 0.6080, 0.5709,0.4649, 0.4641)

P1.diseaseAssessment.daysAfterSowing <- c(81, 96, 109, 123, 137)
P2.diseaseAssessment.daysAfterSowing <- c(63, 77, 91, 105, 119, 133)
P3.diseaseAssessment.daysAfterSowing <- c(63, 77, 91, 105, 119, 133, 147)
plot       (P1.diseaseAssessment.daysAfterSowing,
 P1.aggregation.parameter,
  col='medium blue',
  type='b',
  lty=1,
  pch=19,
  xlab='Disease assessment (days after sowing)',
  ylab='Aggregation parameter',
  ylim=c(0,0.7),
  xlim=c(50, 150),
  main="Aggregation Parameter Over Time")
lines      ( P2.diseaseAssessment.daysAfterSowing,
 P2.aggregation.parameter,
  type='b',
  lty=1,
  pch=20,
  col='red')
lines      ( P3.diseaseAssessment.daysAfterSowing,
 P3.aggregation.parameter,
  type='b',
  lty=1,
  pch=21,
  col='dark green')
legend     ('bottomright',
  c("P1 - 1999","P2 - 2000","P3 - 2000"),
  lty=c(1,1,1),
  pch=c(19,20,21),
  col=c("medium blue", "red", "dark green"),
  inset=0.05)

Output

Click to enlarge

Fitting a beta-binomial model for disease incidence using R

This example will show how maximum likelihood and optimization may be used to obtain the estimates of the beta-binomial distribution. Our approach will closely follow the description of Hughes and Madden (1993) who illustrated the use of the beta-binomial distribution for handling overdispersed disease incidence data obtained from quadrats. The example will illustrate the use of optimization following the approach of Altham (2002). Although the Altham example was written for S-Plus it is still useful as a basis for our R program. Because the goal is to illustrate optimization of a likelihood function for plant pathology, some of the more technical details of using the beta-binomial distribution as specified in the analyses of Hughes and Madden (1993) will not be covered and the reader is recommended to consult that reference.

The beta-binomial distribution

Here, n corresponds to the number observed (i.e., plants assessed within a quadrat), x corresponds to the number of positive results (i.e., diseased plants assessed within a quadrat), and α and β are the beta-binomial parameters. The two parameters are both positive constants and can be varied to produce a wide range of distributional forms (which is why they are useful for describing overdispersed data).

Next, how to estimate α and β? R includes useful functions for numerical optimization, an approach that can help us determine parameter estimates from distributions, models, and functions. For our example, a general purpose optimization function (optim) from R will be used. To use the optim function, two key pieces are needed: (1) the log-likelihood function for the beta-binomial distribution and (2) initial starting estimates for the two parameters of the model, α and β. For further information on the optim function in R, type: ?optim or help(optim). Furthermore, for an excellent introduction to maximum likelihood and optimization in R, we recommend consulting Bolker (In Press).

In the following example, the data set, named tasmania_test_1.txt, were kindly provided by S.J. Pethybridge (Pethybridge et al. 2005), based on an assessment of the incidence of foliar symptoms due to ray blight disease of pyrethrum in 62 quadrats. The data are in three columns: column one lists the quadrat number, column two the number of samples assessed per quadrat (here = 6), and the third column gives the number of diseased samples in the quadrat (i.e., 0, 1, 2, 3, 4, 5, or 6). Save the file to your computer to use in this example. Refer to "An introduction to the R programming environment" for use of text data files in R (Garrett et al. 2007).

Importing the data into R

# The name of the dataset is (as shown here): tasmania_test_1.txt
# Be sure to modify the file location to the location where
# you saved your tasmania_test_1.txt file.

tasmania <- read.table('path_to/tasmania_test_1.txt', header=T)

##### First, a binomial model will be fit to the data using the glm function
# The assumption here is a constant probability, such that
# counti ~ independent      Binomial(group_sizei, p), 1 = i = N
# Reference, Altham 2002

attach(tasmania) # This way, column headings may be called out for analysis
names(tasmania) # What are those column headings?

#### Examine for potential overdispersion using non-statistical methods

table(tasmania[,3]) #### notice how there are more quadrats with 6 disease plants

Output


 [1] "quad"       "group_size" "count"     

 2  3  4  5  6 
 1  6  4  4 47 

Run glm on the data to check for overdispersion

binom.tas=glm(count/group_size~1, family=binomial, weights=group_size)
summary(binom.tas)

# The deviance is 117.76 on 61 degrees-of-freedom, 
# indicating that there is much variation that is still unexplained (i.e., overdispersion)

Output


Call:
glm(formula = count/group_size ~ 1, family = binomial, weights = group_size)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.448   1.072   1.072   1.072   1.072  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.2967     0.1799   12.77   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 117.76  on 61  degrees of freedom
Residual deviance: 117.76  on 61  degrees of freedom
AIC: 152.12

Number of Fisher Scoring iterations: 5

Estimate fitted values

#### An estimate of the fitted values for each quadrat may be obtained as:

binom.tas$fitted.values # For illustration, the probability value is = 0.9086022

#### Note, you could also obtain the probability as: exp(x) / (1+exp(x)), where is the 
#### parameter in the model (here, the intercept term)

Output


        1         2         3         4         5         6         7         8 
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 
        9        10        11        12        13        14        15        16 
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 
       17        18        19        20        21        22        23        24 
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 
       25        26        27        28        29        30        31        32 
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 
       33        34        35        36        37        38        39        40 
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 
       41        42        43        44        45        46        47        48 
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 
       49        50        51        52        53        54        55        56 
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 
       57        58        59        60        61        62 
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 


Fit Beta-binomial

#### Beta-Binomial fitting
# In order to obtain the estimates for alpha and beta, the log-likelihood function needs 
# to be defined.  Consult Bolker (In Press) for further information regarding the determination 
# of the likelihood.  For the purpose of this exercise, the following function for the 
# log-likelihood is used:

lbetabin = function (data, inits) {
n=data[,1] # This corresponds to the group_size
r=data[,2] # This corresponds to the data on incidence
alpha=inits[1] # This corresponds to the initial starting parameter for alpha
beta=inits[2] # This corresponds to the initial starting parameter for beta

# Because optim minimizes a function, the negative log-likelihood is used
# Also, the factorial is not necessary, as it does not depend on the parameters of interest

sum(-lgamma(alpha+r)-lgamma(beta+n-r)+lgamma(alpha+beta+n)+lgamma(alpha)+lgamma(beta)-lgamma(alpha+beta))

}

#### Define initial starting parameters:
# For alpha, use the estimated p from the binomial and for beta, use 1-p

inits=c(binom.tas$fitted.values[1],1-binom.tas$fitted.values[1])

# ?optim for further information
# The method used was BFGS = Broyden-Fletcher-Goldfarb-Shanno algorithm
# Consult Bolker (In Press) for discussion of different methods

optim.tas=optim(inits, lbetabin, method='BFGS', hessian=T, data=tasmania[,2:3])
optim.tas

# optim.tas$par provided the estimated alpha and beta, which will be used to obtain the new p

# optim.tas$value is the estimated log-likelihood value, which matches results from 
# Pethybridge using the BBD program of Hughes and Madden

Output


$par
        1         1 
1.9235961 0.1813362 

$value
[1] 96.11122

$counts
function gradient 
      24        9 

$convergence
[1] 0

$message
NULL

$hessian
          1         1
1   3.30003 -29.37454
1 -29.37454 436.27087


Estimate mean disease incidence

#### An estimate of the mean disease incidence may be obtained as:

new.p = optim.tas$par[1]/(optim.tas$par[1]+optim.tas$par[2])
new.p # =0.9138518

Output


        1 
0.9138518 


Estimate of variation in mean disease incidence

#### An estimate of the variation in pi (mean disease incidence) is:

var.pi = 1/( optim.tas$par[1]+optim.tas$par[2])
var.pi # = 0.4750747

Ouput


        1 
0.4750747 


The value, var.pi, represents an estimate of the variation in p and can be thought of as an index of aggregation. When this value is larger than 0, the variance observed in the beta-binomial distribution is larger than for that of the binomial, indicating aggregation.