Link to home

Advanced Discussion and Illustration: Beta-binomial analysis

Roumagnac, P. Pruvost, O. Chiroleu, F., and Hughes, G. 2004. Spatial and temporal analyses of bacterial blight of onion caused by Xanthomonas axonopodis pv. allii. Phytopathology 94:138-146.

Four different methods for describing spatial patterns have been discussed and illustrated thus far in this document, and now we introduce the use of probability distributions to describe spatial pattern. This is a well-established method in plant pathology. Beta-binomial analysis is useful for describing aggregated patterns of plant disease (Hughes and Madden 1993).

Bacterial blight of onion (BBO) is a serious threat to onion production world-wide. The disease appears as lenticular water soaked spots on the leaves which in later stages turn into dry chlorotic lesions. In severe cases these lesions coalesce resulting in leaf dieback and reduced bulb size. A number of different Allium spp., including garlic, welsh onion, shallot, and leek, are hosts of this bacterium. Recent studies have confirmed that the pathogen is seed borne (Roumagnac et al. 2004).

Roumagnac et al. (2004) examined the temporal and spatial dynamics of BBO to hypothesize whether BBO epidemics can arise from seed borne inoculum. The experimental plots were at the CIRAD experimental station in Reunion Island, France, 1999 to 2001. A total of five plots (P1-P5) were sown with onion seed naturally contaminated with X. axonopodis pv. alli. The incidence of X. axonopodis pv. alli in seed prior to sowing was 0.04%.

Plots were sown at different sites each year to avoid inoculum from previous crop debris. In 2000 and 2001, 70 m separated the two plots and, to reduce inoculum movement from plot to plot, each plot was surrounded by seven rows of maize windbreak followed by 15 rows of onion trap plants established from an uninfested seed lot. Each plot included 8200 plants and was watered using overhead irrigation.

BBO incidence in each plot was determined at 15 day intervals after leaf emergence and 30 day intervals in onion trap outer rings. Symptom development on each plant was recorded by visual assessment. Spatial patterns were analyzed using beta-binomial distributions, binary power law analysis, and spatial autocorrelation analysis.

As described in "An introduction to the R programming environment" (Garrett et al. 2007), the major difference between beta-binomial and binomial distributions is that the probability that a plant is scored as 'diseased' is variable in the former while it is constant in the latter. Therefore, the beta-binomial distribution has a higher variance and better fits the data derived from aggregated patterns of disease incidence (Hughes and Madden 1993). The beta-binomial is fit by estimating two parameters, the mean disease incidence and the aggregation parameter that characterizes variability in the probability of infection.

With the use of R and the data from Roumagnec et al. (2004), let's look at how the beta-binomial parameter estimates change over time for this example.

Change in Beta-binomial Parameter Over Time

# Input aggregation parameter estimates and disease
# assessments for each plot
P1.aggregation.parameter <-
c(0.0339, 0.0801, 0.2533, 0.3202, 0.3203)
P2.aggregation.parameter <-
c(0.0214, 0.1877, 0.3991, 0.5125, 0.6247,0.6199)
P3.aggregation.parameter <-
c(0.0071, 0.1252, 0.2023, 0.6080, 0.5709,0.4649, 0.4641)
P1.diseaseAssessment.daysAfterSowing <-
c(81, 96, 109, 123, 137)
P2.diseaseAssessment.daysAfterSowing <-
c(63, 77, 91, 105, 119, 133)
P3.diseaseAssessment.daysAfterSowing <-
c(63, 77, 91, 105, 119, 133, 147)
plot (P1.diseaseAssessment.daysAfterSowing,
col='medium blue',
xlab='Disease assessment (days after sowing)',
ylab='Aggregation parameter',
xlim=c(50, 150),
main="Aggregation Parameter Over Time")
lines (P2.diseaseAssessment.daysAfterSowing,
lines ( P3.diseaseAssessment.daysAfterSowing,
col='dark green')
legend ('bottomright',
c("P1 - 1999","P2 - 2000","P3 - 2000"),
col=c("medium blue", "red", "dark green"),



Click to enlarge

Fitting a beta-binomial model for disease incidence using R

This example will show how maximum likelihood and optimization may be used to obtain the estimates of the beta-binomial distribution. Our approach will closely follow the description of Hughes and Madden (1993) who illustrated the use of the beta-binomial distribution for handling overdispersed disease incidence data obtained from quadrats. The example will illustrate the use of optimization following the approach of Altham (2002). Although the Altham example was written for S-Plus it is still useful as a basis for our R program. Because the goal is to illustrate optimization of a likelihood function for plant pathology, some of the more technical details of using the beta-binomial distribution as specified in the analyses of Hughes and Madden (1993) will not be covered and the reader is recommended to consult that reference.

The beta-binomial distribution

Here, n corresponds to the number observed (i.e., plants assessed within a quadrat), x corresponds to the number of positive results (i.e., diseased plants assessed within a quadrat), and α and β are the beta-binomial parameters. The two parameters are both positive constants and can be varied to produce a wide range of distributional forms (which is why they are useful for describing overdispersed data).

Next, how to estimate α and β? R includes useful functions for numerical optimization, an approach that can help us determine parameter estimates from distributions, models, and functions. For our example, a general purpose optimization function (optim) from R will be used. To use the optim function, two key pieces are needed: (1) the log-likelihood function for the beta-binomial distribution and (2) initial starting estimates for the two parameters of the model, α and β. For further information on the optim function in R, type: ?optim or help(optim). Furthermore, for an excellent introduction to maximum likelihood and optimization in R, we recommend consulting Bolker (In Press).

In the following example, the data set, named tasmania_test_1.txt, were kindly provided by S.J. Pethybridge (Pethybridge et al. 2005), based on an assessment of the incidence of foliar symptoms due to ray blight disease of pyrethrum in 62 quadrats. The data are in three columns: column one lists the quadrat number, column two the number of samples assessed per quadrat (here = 6), and the third column gives the number of diseased samples in the quadrat (i.e., 0, 1, 2, 3, 4, 5, or 6). Save the file to your computer to use in this example. Refer to "An introduction to the R programming environment" for use of text data files in R (Garrett et al. 2007).

Importing the data into R

# The name of the dataset is (as shown here): tasmania_test_1.txt
# Be sure to modify the file location to the location where
# you saved your tasmania_test_1.txt file.
tasmania <- read.table('path_to/tasmania_test_1.txt', header=T)
##### First, a binomial model will be fit to
##### the data using the glm function
# The assumption here is a constant probability, such that
# counti ~ independent Binomial(group_sizei, p), 1 = i = N
# Reference, Altham 2002
attach(tasmania) # This way, column headings may be
# called out for analysis
names(tasmania) # What are those column headings?
#### Examine for potential overdispersion
#### using non-statistical methods
table(tasmania[,3]) #### notice how there are more
#### quadrats with 6 disease plants


 [1] "quad"       "group_size" "count"
  2  3  4  5  6
1 6 4 4 47

Run glm on the data to check for overdispersion

# The deviance is 117.76 on 61 degrees-of-freedom,
# indicating that there is much variation that is
# still unexplained (i.e., overdispersion)



glm(formula = count/group_size ~ 1, family = binomial, weights = group_size)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.448 1.072 1.072 1.072 1.072
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.2967 0.1799 12.77 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 117.76  on 61  degrees of freedom
Residual deviance: 117.76 on 61 degrees of freedom
AIC: 152.12
Number of Fisher Scoring iterations: 5

Estimate fitted values

#### An estimate of the fitted values for each quadrat
#### may be obtained as:
binom.tas$fitted.values # For illustration, the probability
# value is = 0.9086022
#### Note, you could also obtain the probability
#### as: exp(x) / (1+exp(x)), where is the parameter
#### in the model (here, the intercept term)


        1         2         3         4         5         6         7         8 
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022
9 10 11 12 13 14 15 16
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022
17 18 19 20 21 22 23 24
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022
25 26 27 28 29 30 31 32
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022
33 34 35 36 37 38 39 40
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022
41 42 43 44 45 46 47 48
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022
49 50 51 52 53 54 55 56
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022
57 58 59 60 61 62
0.9086022 0.9086022 0.9086022 0.9086022 0.9086022 0.9086022

Fit Beta-binomial

#### Beta-Binomial fitting
# In order to obtain the estimates for alpha and beta, the
# log-likelihood function needs to be defined. Consult Bolker
#(In Press) for further information regarding the determination
# of the likelihood. For the purpose of this exercise, the
# following function for the log-likelihood is used:
lbetabin = function (data, inits) {
n=data[,1] # This corresponds to the group_size
r=data[,2] # This corresponds to the data on incidence
alpha=inits[1] # This corresponds to the initial
# starting parameter for alpha
beta=inits[2] # This corresponds to the initial
# starting parameter for beta
# Because optim minimizes a function, the
# negative log-likelihood is used
# Also, the factorial is not necessary, as it does not depend
# on the parameters of interest
#### Define initial starting parameters:
# For alpha, use the estimated p from the binomial
# and for beta, use 1-p
# ?optim for further information
# The method used was BFGS = Broyden-Fletcher-
# Goldfarb-Shanno algorithm
# Consult Bolker (In Press) for discussion of
# different methods
optim.tas=optim(inits, lbetabin, method='BFGS',
hessian=T, data=tasmania[,2:3])
# optim.tas$par provided the estimated alpha and beta,
# which will be used to obtain the new p
# optim.tas$value is the estimated log-likelihood value,
# which matches results from Pethybridge using the BBD
# program of Hughes and Madden


$par        1         1 
1.9235961 0.1813362
$value[1] 96.11122
function gradient
24 9
[1] 0
1 1
1 3.30003 -29.37454
1 -29.37454 436.27087

Estimate mean disease incidence

#### An estimate of the mean disease incidence
#### may be obtained as:
new.p = optim.tas$par[1]/
new.p # =0.9138518



Estimate of variation in mean disease incidence

#### An estimate of the variation in pi
#### (mean disease incidence) is:
var.pi = 1/( optim.tas$par[1]+optim.tas$par[2])
var.pi # = 0.4750747



The value, var.pi, represents an estimate of the variation in p and can be thought of as an index of aggregation. When this value is larger than 0, the variance observed in the beta-binomial distribution is larger than for that of the binomial, indicating aggregation.


Next, Conclusion for using spatial analysis in plant pathology