Link to home

Statistical Distributions

Statistical distributions are often used as a tool to simplify comparisons between groups, such as comparisons of the mean and variance of plant biomass under different experimental treatments. R includes four types of functions for manipulating several distributions useful in biological modeling:

  1. random generation of variables from the distribution,
  2. returning values of the density,
  3. returning percentiles for the distribution, and
  4. returning quantiles of the distribution.

One of the most commonly used distributions is the normal distribution (aka the Gaussian distribution, aka the bell-shaped curve). The normal distribution is described completely by two parameters, the mean and the standard deviation. The following commands illustrate some properties of the Normal distribution.

# Illustrate a normal distribution 
# with mean 0 and standard deviation 1
# by drawing 100,000 random# observations from the distribution
hist(rnorm(n=100000,mean=0,sd=1))
# If your computer is fast enough
# you can quickly draw 10,000,000
# (or more) random observations and revel in the power
hist(rnorm(n=10000000,mean=0,sd=1))
# You can get an idea of why small sample
# sizes only supply limited
# information about a process by
# sampling just a few random
# observations at a time,
# while keeping the x axis range constant
hist(rnorm(n=3, mean=0, sd=1), xlim=c(-4,4),
xlab='Observation', main='Example')
# The mfrow command adjusts how many#
graphs are illustrated at once
par(mfrow=c(2, 3))
# A loop can be used to illustrate several
# samples at once from the normal
# distribution with mean 0 and standard deviation 1
for(j in 1:6){
hist(rnorm(n=3, mean=0, sd=1), xlim=c(-4, 4),
xlab='Observation', main='Example')
}
# Compare the normal distributions
# for a range of different parameters
hist(rnorm(n=100000, mean=0, sd=1), main='Mean=0, SD=1',
xlab='Observation', xlim=c(-40, 140))
hist(rnorm(n=100000, mean=0, sd=2), main='Mean=0, SD=2',
xlab='Observation', xlim=c(-40, 140))
hist(rnorm(n=100000, mean=0, sd=10), main='Mean=0, SD=10',
xlab='Observation', xlim=c(-40, 140))
hist(rnorm(n=100000, mean=100, sd=1), main='Mean=100, SD=1',
xlab='Observation', xlim=c(-40,1 40))
hist(rnorm(n=100000, mean=100, sd=2), main='Mean=100, SD=2',
xlab='Observation',x lim=c(-40,1 40))
hist(rnorm(n=100000, mean=100, sd=10), main='Mean=100, SD=10',
xlab='Observation', xlim=c(-40, 140))
# For reference, generate a single histogram
# of a normal distribution
# with mean 0 and standard deviation 1 again
par(mfrow=c(1, 1))
hist(rnorm(n=100000, mean=0, sd=1), main='Mean=0, SD=1',
xlab='Observation')
# The density of the random distribution
# at any given value indicates
# how likely values in that range are
# What is the value of the density at the mean, 0?
dnorm(0, mean=0, sd=1)
# What is the value of the density
# for an unlikely part of the range, 4?
dnorm(4, mean=0, sd=1)
# The values of the density don't
# provide the probability of those
# values per se, but they give an idea
# of the relative probability
# The percentile provides information
# about the probability that
# values from a normal distribution
# fall below a particular value
# What is the probability that values fall below -1?
pnorm(-1, mean=0, sd=1)
# What is the probability that values fall below 0?
# Or, phrased differently,
# what proportion of values fall below 0?
pnorm(0, mean=0, sd=1)
# In statistical analyses, a normal distribution may be used
# to represent potential observations under a null hypothesis.
# If observations fall far in the tails of the distribution, this
# suggests that the null hypothesis is not true. For example,
# the distribution might represent the mean difference between
# responses to two treatments, where the mean difference is 0
# under the null hypothesis of no difference.
# For this normal distribution, what is the probability
# of an observation below -1.96?
pnorm(-1.96, mean=0, sd=1)
# What is the probability that an observation falls above 1.96?
1 - pnorm(1.96, mean=0,s d=1)
# Note that these are frequently used cut-offs, such that the
# probability an observation falls either below -1.96 or
#
above 1.96 adds up to 0.05.
# Here 0.05 represents an acceptably small probability that the
# extreme mean difference could be observed for a sample even
# if the mean difference is actually 0.
# The quantile function allows you to ask
# a related but different type of question
# Below what value does half the distribution fall?
qnorm(0.5, mean=0, sd=1)
# Below what value does 2.5% of the distribution fall?
qnorm(0.025, mean=0, sd=1)
# Above what value does 2.5% of the distribution fall?
qnorm((1-0.025), mean=0, sd=1)

 

Try repeating the same types of analyses for another distribution, the binomial, using the functions rbinom, dbinom, pbinom, and qbinom. The binomial distribution has two parameters, the probability of success, prob, and the number of trials, size. The binomial distribution is often illustrated in terms of a series of flips of a coin, where success might be defined in terms of the coin flip yielding 'heads', and where prob=0.5 if the coin is fair. The binomial distribution can only take on non-negative integer values, in contrast to the normal distribution which includes any number from -∞ to ∞.

hist(rbinom(n=100000, size=10, prob=.5))

 

The Use of Statistical Distributions in Epidemiology

Applications of several distributions in ecology and epidemiology are illustrated using R in other Plant Health Instructor documents.

The exponential model can be used to describe disease progress over time or the declining probability of disease with increasing distance from an inoculum source.

The uniform distribution can be used to generate a random number across a continuous range in which every value is equally likely. Drawing from a uniform distribution is used for selecting initial coordinates of infections and for selecting directions for dispersal in a spatial epidemic simulator.

The binomial distribution can be used to describe the number of ‘successes’ out of a fixed number of trials. For example, it can be used to generate rain events in a disease simulation.

Example: The Beta-binomial Distribution for Analysis of Spatial Pattern

The binomial distribution can also be used to describe the number of diseased plants within a sampling area. For example, if a sampling quadrat contains size=40 plants and each plant has probability prob=0.1 of being infected, then the distribution of the number of plants infected within a quadrat is:

hist(rbinom(n=100000, size=40, prob=0.1),
main='Number infected')
# This could also be expressed
# in terms of the proportion
# of plants infected by dividing by size=40
hist(rbinom(n=100000, size=40, prob=0.1)/40,
main='Proportion infected')
# On average, how many quadrats
# will have 10% or fewer plants #
infected (4 out of 40 or fewer)?
pbinom(4,size=40,prob=0.1)

 

But when estimates of the proportion infection per quadrat are collected in the field, the binomial distribution may not always provide a good fit. For example, if disease is aggregated, the number of quadrats with high disease incidence and low disease incidence will be higher than would be expected for a binomial distribution. One way to deal with this lack of fit, and to provide a parameter that describes the degree of aggregation, is to fit disease incidence in quadrats using a beta-binomial distribution (Hughes and Madden 1993).

For the binomial distribution, the probability parameter is constant within any given realization. (That is, varying the probability parameter results in a different distribution.) For the beta-binomial distribution, the probability parameter is not constant but instead is described by the beta distribution.

The beta distribution is a very flexible distribution describing probabilities between 0 and 1 with two shape parameters. R provides four different types of functions for describing the beta distribution and a range of other distributions, as illustrated for the normal distribution above.

hist(rbeta(n=100000, shape1=1, shape2=1))
 
# The next command allows you to plot
# figures in 2 rows and 3 columns
par(mfrow=c(2,3))
# Try plotting several parameter
# combinations to compare the
# distributions for values 0.5, 1, 2, 3, and 4
# for each shape parameter
hist(rbeta(n=100000, shape1=1, shape2=1))
# Now repeat the hist command with different parameters
 
# We can generate beta-binomial
# random variables in R by using
# beta random variables as parameters
# in the binomial function
par(mfrow=c(1, 1))
hist(rbinom(n=100000, size=10,
prob=rbeta(n=100000, shape1=1, shape2=3)))

 

The shape1 parameter is often notated as α and the shape2 parameter as β. Then θ = 1/(α + β) measures the variability in the number of events, where events might be, for example, the number of infected plants per quadrat. So, as the sum of the two shape parameters becomes larger, the variability decreases. You can see that this makes sense because as the shape parameters become larger, the beta distribution becomes more centralized.

hist(rbeta(n=100000, shape1=2, shape2=1),
main='shape1=2, shape2=1',
xlim=c(0, 1))
# …as compared to…
hist(rbeta(n=100000, shape1=26, shape2=30),
main='shape1=32, shape2=30',
xlim=c(0, 1))

 

Since the beta distribution supplies the probability parameter in the beta-binomial distribution, the less variable the beta distribution is, the less variable the beta-binomial will be.

hist(rbinom(n=100000, size=10,
prob=rbeta(n=100000, shape1=2, shape2=1)),
main='shape1=2, shape2=1')
# …as compared to…
hist(rbinom(n=100000, size=10,
prob=rbeta(n=100000, shape1=26, shape2=30)
)
,
main='shape1=32, shape2=30')

 

Next: Statistical Analyses