# Statistics in R: Binomial Distribution

Let’s use the R and RStudio to learn more about the binomial distribution!

- Downloading the Project
- Introduction
- Generating binomial data
- Probability Density
- Cumulative Density
- Quantiles
- Quantile-Quantile Plots
- Further Reading

# Downloading the Project

We will be using the ProbabilityDistributions project, which you can download in the form of a ZIP archive ProbabilityDistributions.zip. After you’ve downloaded the archive, proceed by extracting it, go into a newly created ProbabilityDistributions folder and double-click on the ProbabilityDistributions.Rproj icon. This opens up the ProbabilityDistributions project in the RStudio.

# Introduction

This entire paragraph was taken over from the Introduction to Statistical Thought (written by Michael Lavine; published on June 24, 2010).

Statisticians often have to consider observations of the following type:

- A repeatable event results in either a success or a failure.
- Many repetitions are observed
- Successes and failures are counted
- The number of successes helps us learn about the probability of success.

Because binomial experiments are so prevalent there is specialized language to describe them. Each repetition is called a trial; the number of trials is usually denoted N; the unknown probability of success is usually denoted either *p* or *θ*; the number of successes is usually denoted *X*. We write _X ∼ Bin(N, p) _. The symbol “∼” is read is distributed as; we would say “X is distributed as Binomial N p” or “X has the Binomial N, p distribution”.

Some important assumptions about binomial experiments are that *N* is fixed in advance, *θ* is the same for every trial, and the outcome of any trial does not influence the outcome of any other trial.

# Generating binomial data

Let’s create a function *summarize.binomial.observations*, which takes the following parameters:

- trials - a number of trials (
*N*) - theta - a probability of success (
*θ*) - observations - a number of observations to be made

```
library(tidyverse)
summarize.binomial.observations <- function(trials, theta, observations) {
sample.space <- c(1,0)
results <- 1:observations %>%
map_int(function(x)
as.integer(
sum(
sample(sample.space, size = trials, replace = TRUE, prob = c(theta, 1 - theta))
)
)
)
return(results)
}
```

Now we can use the function to generate 3000 observations of X ∼ Bin(80, 0.6):

```
summary <- summarize.binomial.observations(80, 0.6, 3000)
resulting.df <- data.frame(flips <- summary) # ggplot only works with data frames
names(resulting.df) <- c("flips")
#binPDF <- data.frame(x<-1:80, y<-dbinom(1:80, size=80, prob = 0.6))
#names(binPDF) <- c("flips", "prob")
ggplot(resulting.df) +
geom_histogram(breaks=seq(1, 80, 2), aes(x=flips, y=..density..), position="identity") +
geom_density(aes(x=flips, y = ..density..), colour="red")
```

It turns out that R already contains a function which produces the same data as *summarize.binomial.observations*. The function is called *rbinom* and it repeatedly performs a random draw from a binomial distribution.

```
drawsV1 <- summarize.binomial.observations(80, 0.6, 3000)
drawsV2 <- rbinom(3000, 80, 0.6)
mean(drawsV1)
mean(drawsV2)
length(drawsV1)
length(drawsV2)
```

produces:

```
[1] 48.01467
[1] 48.052
[1] 3000
[1] 3000
```

# Probability Density

The probability of observing exactly 5 heads out of 10 flips can be estimated as

```
mean(rbinom(1000, 10, .5) == 5)
```

result:

```
[1] 0.232
```

There is a special function in R which calculates the probability density - it is called *dbinom*:

```
dbinom(5, 10, .5)
```

yields:

```
[1] 0.2460938
```

# Cumulative Density

The probability of observing exactly at most 4 heads out of 10 flips can be estimated as

```
mean(rbinom(1000, 10, .5) <= 4)
```

result:

```
[1] 0.392
```

There is a special function in R which calculates the cumulative density - it is called *pbinom*:

```
pbinom(4, 10, .5)
```

output:

```
[1] 0.3769531
```

# Quantiles

Quantile works as an inverse of the cumulative density function, as illustrated below:

```
qbinom(pbinom(4, 10, .5), 10, .5)
```

```
[1] 4
```

Visit Basic Probability Distributions in R for more information.

# Quantile-Quantile Plots

As described in the Q-Q Plot Tutorial, The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.

The following code compares observations stored in *x.sample* with those in *y.sample*:

```
x.sample <- rbinom(1000, 19, .5)
y.sample <- rbinom(1000, 19, .5)
qqplot(x.sample, y.sample)
abline(0,1)
```

The same test using *ggplot2*:

```
nq <- 31
p <- (1 : nq) / nq - 0.5 / nq
ggplot() + geom_point(aes(x = quantile(x.sample, p), y = quantile(y.sample, p))) +
geom_abline(intercept = 0, slope = 1)
```

Do the following to compare the *x.sample* with a binomial distribution Bin(19, 0.5):

```
xdf <-data.frame(x = x.sample)
ggplot(xdf, aes(sample = x)) +
stat_qq(distribution = stats::qbinom, dparams = list(size = 19, prob=0.5)) +
geom_abline(intercept = 0, slope = 1)
```