×

Pastinak

VařeníMath

Pastinak

Hunger&Data

Statistics in R: Poisson Distribution

Statistics in R: Poisson Distribution

2018, Apr 17     alyosha

Let’s use the R and RStudio to learn more about the Poisson distribution!

Downloading the Project

We will be using the ProbabilityDistributions project, which you can download in the form of a ZIP archive ProbabilityDistributions.zip. After you’ve downloaded the archive, proceed by extracting it, go into a newly created ProbabilityDistributions folder and double-click on the ProbabilityDistributions.Rproj icon. This opens up the ProbabilityDistributions project in the RStudio.

Introduction

This entire paragraph was taken over from the Introduction to Statistical Thought (written by Michael Lavine; published on June 24, 2010).

Poisson observations occur in the following situation:

  • There is a domain of study, usually a block of space or time.
  • Events arise seemingly at random in the domain.
  • There is an underlying rate at which events arise.

Such observations are called Poisson after the 19th century French mathematician Siméon- Denis Poisson. The number of events in the domain of study helps us learn about the rate.

The rate at which events occur is often called λ; the number of events that occur in the domain of study is often called X; we write X ∼ Poi(λ). Important assumptions about Poisson observations are that two events cannot occur at exactly the same location in space or time, that the occurence of an event at location \(l_1\) does not influence whether an event occurs at any other location \(l_2\), and the rate at which events arise does not vary over the domain of study.

Generating Poisson data

In the following simulation of a Poisson distribution, we are going to uniformly pick 200 points from a square of 5000x5000 points. The square can be divided into four quadrants, each having 2500x2500 points. Each of the four quadrants can be thought of as a separate domain of study and the rate with which points are picked from a given domain is \(λ = \cfrac{200}{4} = 50\)

The probability of picking exactly X points from the first quadrant follows a Poisson distribution X ~ Poi(50).

quadrants of the experiment

Let’s create a function summarize.poisson.observations, which takes the following parameters:

  • lambda - a rate with which events occur in a given domain
  • observations - a number of observations to be made
library(tidyverse)

summarize.poisson.observations <- function(lambda, observations) {
  results <- results <- 1:observations %>%
    map_int(function(x)
      as.integer(
        count(
          filter(data.frame(x = sample(5000, size=4*lambda, replace=FALSE), y = sample(5000, size=4*lambda, replace=FALSE)),
                 x <= 2500,
                 y <= 2500)
        )
      )
    )
  return(results)
}

Now we can use the function to generate 400 observations of X ∼ Poi(50):

summary <- summarize.poisson.observations(50, 400)
resulting.df <- data.frame(o <- summary) # ggplot only works with data frames
names(resulting.df) <- c("o")
#binPDF <- data.frame(x<-1:80, y<-dbinom(1:80, size=80, prob = 0.6))
#names(binPDF) <- c("flips", "prob")

ggplot(resulting.df) + 
  geom_histogram(breaks=seq(20, 80, 2), aes(x=o, y=..density..), position="identity") + 
  geom_density(aes(x=o, y = ..density..), colour="red") # + geom_line(data=binPDF, aes(flips, prob), colour="blue")

binomial probability distribution function

It turns out that R already contains a function which produces the same data as summarize.poisson.observations. The function is called rpois and it repeatedly performs a random draw from a Poisson distribution.

drawsV1 <- summarize.poisson.observations(50, 400)
drawsV2 <- rpois(400, 50)
mean(drawsV1)
mean(drawsV2)
length(drawsV1)
length(drawsV2)

produces:

[1] 49.95
[1] 50.175
[1] 400
[1] 400

Probability Density

The probability of observing exactly 49 events in a given domain, given the rate is 50, can be estimated as

mean(rpois(400, 50) == 49)

result:

[1] 0.05

There is a special function in R which calculates the probability density - it is called dpois:

dpois(49, 50)

yields:

[1] 0.05632501

Cumulative Density

The probability of observing at most 45 events in a given domain, given the rate is 50, can be estimated as

mean(rpois(400, 50) <= 45)

result:

[1] 0.2975

There is a special function in R which calculates the cumulative density - it is called ppois:

ppois(45, 50)

output:

[1] 0.2668665

Quantiles

Quantile works as an inverse of the cumulative density function, as illustrated below:

qpois(ppois(45, 50), 50)
[1] 45

Visit Basic Probability Distributions in R for more information.

Quantile-Quantile Plots

As described in the Q-Q Plot Tutorial, The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.

The following code compares observations stored in x.sample with those in y.sample:

x.sample <- rpois(400, 50)
y.sample <- rpois(400, 50)
qqplot(x.sample, y.sample)
abline(0,1)

qq-plot of poisson data

The same test using ggplot2:

nq <- 31
p <- (1 : nq) / nq - 0.5 / nq
ggplot() + geom_point(aes(x = quantile(x.sample, p), y = quantile(y.sample, p))) +
  geom_abline(intercept = 0, slope = 1)

qq-plot of poisson data created with the ggplot2

Do the following to compare the x.sample with a Poisson distribution Poi(50):

xdf <-data.frame(x = x.sample)
ggplot(xdf, aes(sample = x)) + 
  stat_qq(distribution = stats::qpois, dparams = list(lambda=50)) + 
  geom_abline(intercept = 0, slope = 1)

qq-plot of a data sample compared to data generated by qpois

Further Reading