Statistics in R: Poisson Distribution

Statistics in R: Poisson Distribution

2018, Apr 17     alyosha

Let’s use the R and RStudio to learn more about the Poisson distribution!

Downloading the Project

We will be using the ProbabilityDistributions project, which you can download in the form of a ZIP archive ProbabilityDistributions.zip. After you’ve downloaded the archive, proceed by extracting it, go into a newly created ProbabilityDistributions folder and double-click on the ProbabilityDistributions.Rproj icon. This opens up the ProbabilityDistributions project in the RStudio.


This entire paragraph was taken over from the Introduction to Statistical Thought (written by Michael Lavine; published on June 24, 2010).

Poisson observations occur in the following situation:

  • There is a domain of study, usually a block of space or time.
  • Events arise seemingly at random in the domain.
  • There is an underlying rate at which events arise.

Such observations are called Poisson after the 19th century French mathematician Siméon- Denis Poisson. The number of events in the domain of study helps us learn about the rate.

The rate at which events occur is often called λ; the number of events that occur in the domain of study is often called X; we write X ∼ Poi(λ). Important assumptions about Poisson observations are that two events cannot occur at exactly the same location in space or time, that the occurence of an event at location \(l_1\) does not influence whether an event occurs at any other location \(l_2\), and the rate at which events arise does not vary over the domain of study.

Generating Poisson data

In the following simulation of a Poisson distribution, we are going to uniformly pick 200 points from a square of 5000x5000 points. The square can be divided into four quadrants, each having 2500x2500 points. Each of the four quadrants can be thought of as a separate domain of study and the rate with which points are picked from a given domain is \(λ = \cfrac{200}{4} = 50\)

The probability of picking exactly X points from the first quadrant follows a Poisson distribution X ~ Poi(50).

quadrants of the experiment

Let’s create a function summarize.poisson.observations, which takes the following parameters:

  • lambda - a rate with which events occur in a given domain
  • observations - a number of observations to be made

summarize.poisson.observations <- function(lambda, observations) {
  results <- results <- 1:observations %>%
          filter(data.frame(x = sample(5000, size=4*lambda, replace=FALSE), y = sample(5000, size=4*lambda, replace=FALSE)),
                 x <= 2500,
                 y <= 2500)

Now we can use the function to generate 400 observations of X ∼ Poi(50):

summary <- summarize.poisson.observations(50, 400)
resulting.df <- data.frame(o <- summary) # ggplot only works with data frames
names(resulting.df) <- c("o")
#binPDF <- data.frame(x<-1:80, y<-dbinom(1:80, size=80, prob = 0.6))
#names(binPDF) <- c("flips", "prob")

ggplot(resulting.df) + 
  geom_histogram(breaks=seq(20, 80, 2), aes(x=o, y=..density..), position="identity") + 
  geom_density(aes(x=o, y = ..density..), colour="red") # + geom_line(data=binPDF, aes(flips, prob), colour="blue")

binomial probability distribution function

It turns out that R already contains a function which produces the same data as summarize.poisson.observations. The function is called rpois and it repeatedly performs a random draw from a Poisson distribution.

drawsV1 <- summarize.poisson.observations(50, 400)
drawsV2 <- rpois(400, 50)


[1] 49.95
[1] 50.175
[1] 400
[1] 400

Probability Density

The probability of observing exactly 49 events in a given domain, given the rate is 50, can be estimated as

mean(rpois(400, 50) == 49)


[1] 0.05

There is a special function in R which calculates the probability density - it is called dpois:

dpois(49, 50)


[1] 0.05632501

Cumulative Density

The probability of observing at most 45 events in a given domain, given the rate is 50, can be estimated as

mean(rpois(400, 50) <= 45)


[1] 0.2975

There is a special function in R which calculates the cumulative density - it is called ppois:

ppois(45, 50)


[1] 0.2668665


Quantile works as an inverse of the cumulative density function, as illustrated below:

qpois(ppois(45, 50), 50)
[1] 45

Visit Basic Probability Distributions in R for more information.

Quantile-Quantile Plots

As described in the Q-Q Plot Tutorial, The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.

The following code compares observations stored in x.sample with those in y.sample:

x.sample <- rpois(400, 50)
y.sample <- rpois(400, 50)
qqplot(x.sample, y.sample)

qq-plot of poisson data

The same test using ggplot2:

nq <- 31
p <- (1 : nq) / nq - 0.5 / nq
ggplot() + geom_point(aes(x = quantile(x.sample, p), y = quantile(y.sample, p))) +
  geom_abline(intercept = 0, slope = 1)

qq-plot of poisson data created with the ggplot2

Do the following to compare the x.sample with a Poisson distribution Poi(50):

xdf <-data.frame(x = x.sample)
ggplot(xdf, aes(sample = x)) + 
  stat_qq(distribution = stats::qpois, dparams = list(lambda=50)) + 
  geom_abline(intercept = 0, slope = 1)

qq-plot of a data sample compared to data generated by qpois

Further Reading