Statistics in R: Binomial Distribution

Statistics in R: Binomial Distribution

2018, Mar 23     alyosha

Let’s use the R and RStudio to learn more about the binomial distribution!

Downloading the Project

We will be using the ProbabilityDistributions project, which you can download in the form of a ZIP archive ProbabilityDistributions.zip. After you’ve downloaded the archive, proceed by extracting it, go into a newly created ProbabilityDistributions folder and double-click on the ProbabilityDistributions.Rproj icon. This opens up the ProbabilityDistributions project in the RStudio.


This entire paragraph was taken over from the Introduction to Statistical Thought (written by Michael Lavine; published on June 24, 2010).

Statisticians often have to consider observations of the following type:

  • A repeatable event results in either a success or a failure.
  • Many repetitions are observed
  • Successes and failures are counted
  • The number of successes helps us learn about the probability of success.

Because binomial experiments are so prevalent there is specialized language to describe them. Each repetition is called a trial; the number of trials is usually denoted N; the unknown probability of success is usually denoted either p or θ; the number of successes is usually denoted X. We write _X ∼ Bin(N, p) _. The symbol “∼” is read is distributed as; we would say “X is distributed as Binomial N p” or “X has the Binomial N, p distribution”.

Some important assumptions about binomial experiments are that N is fixed in advance, θ is the same for every trial, and the outcome of any trial does not influence the outcome of any other trial.

Generating binomial data

Let’s create a function summarize.binomial.observations, which takes the following parameters:

  • trials - a number of trials (N)
  • theta - a probability of success (θ)
  • observations - a number of observations to be made

summarize.binomial.observations <- function(trials, theta, observations) {
  sample.space <- c(1,0)
  results <- 1:observations %>%
        sample(sample.space, size = trials, replace = TRUE, prob = c(theta, 1 - theta))

Now we can use the function to generate 3000 observations of X ∼ Bin(80, 0.6):

summary <- summarize.binomial.observations(80, 0.6, 3000)
resulting.df <- data.frame(flips <- summary) # ggplot only works with data frames
names(resulting.df) <- c("flips")
#binPDF <- data.frame(x<-1:80, y<-dbinom(1:80, size=80, prob = 0.6))
#names(binPDF) <- c("flips", "prob")

ggplot(resulting.df) + 
  geom_histogram(breaks=seq(1, 80, 2), aes(x=flips, y=..density..), position="identity") + 
  geom_density(aes(x=flips, y = ..density..), colour="red")

binomial probability distribution function

It turns out that R already contains a function which produces the same data as summarize.binomial.observations. The function is called rbinom and it repeatedly performs a random draw from a binomial distribution.

drawsV1 <- summarize.binomial.observations(80, 0.6, 3000)
drawsV2 <- rbinom(3000, 80, 0.6)


[1] 48.01467
[1] 48.052
[1] 3000
[1] 3000

Probability Density

The probability of observing exactly 5 heads out of 10 flips can be estimated as

mean(rbinom(1000, 10, .5) == 5)


[1] 0.232

There is a special function in R which calculates the probability density - it is called dbinom:

dbinom(5, 10, .5)


[1] 0.2460938

Cumulative Density

The probability of observing exactly at most 4 heads out of 10 flips can be estimated as

mean(rbinom(1000, 10, .5) <= 4)


[1] 0.392

There is a special function in R which calculates the cumulative density - it is called pbinom:

pbinom(4, 10, .5)


[1] 0.3769531


Quantile works as an inverse of the cumulative density function, as illustrated below:

qbinom(pbinom(4, 10, .5), 10, .5)
[1] 4

Visit Basic Probability Distributions in R for more information.

Quantile-Quantile Plots

As described in the Q-Q Plot Tutorial, The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.

The following code compares observations stored in x.sample with those in y.sample:

x.sample <- rbinom(1000, 19, .5)
y.sample <- rbinom(1000, 19, .5)
qqplot(x.sample, y.sample)

qq-plot of binomial data

The same test using ggplot2:

nq <- 31
p <- (1 : nq) / nq - 0.5 / nq
ggplot() + geom_point(aes(x = quantile(x.sample, p), y = quantile(y.sample, p))) +
  geom_abline(intercept = 0, slope = 1)

qq-plot of binomial data created with the ggplot2

Do the following to compare the x.sample with a binomial distribution Bin(19, 0.5):

xdf <-data.frame(x = x.sample)
ggplot(xdf, aes(sample = x)) + 
  stat_qq(distribution = stats::qbinom, dparams = list(size = 19, prob=0.5)) + 
  geom_abline(intercept = 0, slope = 1)

qq-plot of a data sample compared to data generated by qbinom

Further Reading