Lesson 39 – Discrete distributions in R: Part I

It happened again. I got a ticket for parking close to the fire hydrant. Since my first hydrant violation ticket, I have been carrying a measuring tape. I may have miscalculated the curb length this time, or the enforcer’s tape is different than mine. Either way, today’s entry for the Department of Finance’s account is +$115, and my account is -$115.

We can’t win this game. Most hydrants in New York City don’t have painted curbs. It is up to the parking enforcer’s expert judgment — also called our fate. We park and hope that our number does not get picked in the lucky draw.

I want to research the fire hydrant violation tickets in my locality. Since we are learning discrete probability distributions, the violation tickets data can serve as a neat example. New York City Open Data has this information. I will use a subset of this data: parking violations on Broadway in precinct 24 in 2017.

I am also only analyzing those unfortunate souls whose vehicles are not registered in New York and who are parked at least seven feet from the hydrant. No excuse for those who park on the hydrant. Here is a look at the 27 instances under the given criteria.

Today’s lesson includes a journey through Bernoulli trials and Binomial distribution in R and how this example fits the description. Let’s start.

First Steps

Step 1: Get the data
You can download the filtered data file here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson39”. Save the data file in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson39” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson39_code.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson39” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5: Read the data into R workspace
I have the filtered data in the file named “parking_data.txt“. It only contains the date when the ticket is issued. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the data to the workspace #
parking_violations = read.table("parking_data.txt",header=T)
Bernoulli Trials

There are two possibilities each day, a ticket (event occurred – success) or no ticket (event did not occur – failure). A yes or a no. These events can be represented as a sequence of 0’s and 1’s (0001100101001000) called Bernoulli trials with a probability of occurrence of p. This probability is constant over all the trials, and the trials are assumed to be independent, i.e., the occurrence of one event does not influence the occurrence of the subsequent event.

In R, we can use the command “rbinom” to create as many outcomes as we require, assuming each trial is independent.

The input arguments are the number of observations we want, the trial (1 in the case of Bernoulli) and the probability p of observing 1s.

#### Bernoulli Trials ####

# The generalized Case #
p = 0.5 # probability of success -- user defined (using 0.5 here)
rbinom(1,1,p) # create 1 random Bernoulli trial
rbinom(10,1,p) # create 10 random Bernoulli trials
rbinom(100,1,p) # create 100 random Bernoulli trials

For this example, based on our data, there are 27 parking violation tickets issues in 181 days from January 1, 2017, to June 30, 2017.

Let us first create a binary coding 0 and 1 from the data. Use the following lines to convert the day into a 1 or a 0.

# Create a Date Series #
days = seq(from=as.Date('2017/1/1'), to=as.Date('2017/6/30'), by="day")

y = as.numeric(format(days,"%y"))
m = as.numeric(format(days,"%m"))
d = as.numeric(format(days,"%d"))

binary_code = matrix(0,nrow=length(m),ncol=1)

for (i in 1:nrow(parking_violations))
{
 dummy = which(m == parking_violations[i,1] & d == parking_violations[i,2])
 
 binary_code[dummy,1] = 1
}

plot(days,binary_code,type="h",font=2,font.lab=2,xlab="Days",ylab="(0,1)")

Assuming each day is a trial where the outcome can be getting a ticket or not getting a ticket, we can estimate the probability of getting a ticket on any day as 27/181 = 0.15.

So, with p = 0.15, we can simulate 181 outcomes (0 if the ticket is not issued or 1 if the ticket is issued) using the rbinom command. An example sequence:

# For the example case #
n = 1 # it is the number of trials 
p = 0.15 # probability of the event 
nobs = 181

rbinom(nobs,n,p)

0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0
0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 
0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 
0 0 0 0 0 0

plot(rbinom(nobs,n,p),type="h",font=2,font.lab=2,xlab="Days",ylab="(0,1)")

If you simulate such sequences multiple times, on the average, you will see 27 ones.

Run this command multiple times and see how the plot changes. Each run is a new simulation of possible tickets during the 181 days. It occurs randomly, with a probability of 0.15.

Slightly advanced plotting using animation and GIF (for those more familiar with R)

In R, you can create animation and GIF based on these simulations. For example, I want to run the “rbinom” command five times and visualize the changing plots as an animation.

We first save the plots as “.png” files and then combine them into a GIF. You will need to install the “animation” and “magick” packages for this. Try the following lines to create a GIF for the changing Bernoulli plot.

######## Animation #########
# Create png files for Bernoulli sequence #
library(animation)

png(file="bernoulli%02d.png", width=600, height=300)
for (i in 1:5)
 {
 plot(rbinom(nobs,n,p),type="h",font=2,font.lab=2,xlab="Days",ylab="(0,1)")
 }
dev.off()

# Combine the png files saved in the folder into a GIF #

library(magick)

bernoulli_png1 <- image_read("bernoulli01.png","x150")
bernoulli_png2 <- image_read("bernoulli02.png","x150")
bernoulli_png3 <- image_read("bernoulli03.png","x150")
bernoulli_png4 <- image_read("bernoulli04.png","x150")
bernoulli_png5 <- image_read("bernoulli05.png","x150")

frames <- image_morph(c(bernoulli_png1, bernoulli_png2, bernoulli_png3, bernoulli_png4, bernoulli_png5), frames = 10)
animation <- image_animate(frames)

image_write(animation, "bernoulli.gif")

This is how the final GIF looks. Each frame is a simulation from the Bernoulli distribution. Depending on what version of R you are using, there may be more straightforward functions to create GIF.

Binomial Distribution

From the above, if you are interested in the number of successes (1s) in n trials, this number follows a Binomial distribution. If you assume X is a random variable that represents the number of successes in a Bernoulli sequence of n trials, then this X should follow a binomial distribution.

The number of trials is n = 181. The number of successes (getting a ticket) can be between 0 (if there are no tickets in all the 181 days) or 181
(if there is a ticket issued every day).

In R, the probability that this random variable X takes any value k (between 0 and 181), i.e., the probability of exactly k successes in n trials is computed using the command “dbinom.”

For computing the probability of exactly ten tickets in 181 days we can input:

px = dbinom(10,181,p)

For computing the probability of exactly 20 tickets in 181 days we can input:

px = dbinom(20,181,p)

To compute this probability for all possible k‘s and visualizing the probability distribution, we can use the following lines:

n = 181 # define the number of trials 
p = 0.15 # probability of the event 
x = 0:181 # number of successes varying from 0 to 181

px = dbinom(x,n,p)

plot(x,px,type="h",xlab="Random Variable X (Number of tickets in 181 days)",ylab="Probability P(X=k)",font=2,font.lab=2)

Do you know why the probability distribution is centered on 27? What is the expected value of a Binomial distribution?

If we want to compute the probability of getting more than five tickets in one month (30 days), we first calculate the probability for k = 6 to 30 (i.e., for exactly 6 tickets in 30 days to exactly 30 tickets in 30 days) with n = 30 to represent 30 trials.

n = 30 # define the number of trials 
p = 0.15 # probability of the event 
x = 6:30 # number of successes varying from 6 to 30

px = dbinom(x,n,p)
sum(px)

We add all these probability since
P(More than 5 in 30) = P(6 in 30) or P(7 in 30 ) or P(8 in 30) …

P(X > 5 in 30) = P(X = 6 in 30) + P(X=7 in 30) + ... + P(30 in 30).
Slightly advanced plotting using animation and GIF (for those more familiar with R)

I want to experiment with different values of p and check how the probability distribution changes.

I can use the animation and GIF trick to create this visualization.

Run the following lines in your code and see for yourself.

######## Animation #########
# Create png files for Binomial distribution #

png(file="binomial%02d.png", width=600, height=300)

n = 181
x = 0:181

p = 0.1
for (i in 1:5)
{
 px = dbinom(x,n,p)
 
 plot(x,px,type="h",xlab="Random Variable X (Number of tickets in 181 days)",ylab="Probability P(X=k)",font=2,font.lab=2)
 txt = paste("p=",p,sep="")
 text(150,0.04,txt,col="red",cex=2)
 p = p+0.2
}
dev.off()

# Combine the png files saved in the folder into a GIF #

library(magick)

binomial_png1 <- image_read("binomial01.png","x150")
binomial_png2 <- image_read("binomial02.png","x150")
binomial_png3 <- image_read("binomial03.png","x150")
binomial_png4 <- image_read("binomial04.png","x150")
binomial_png5 <- image_read("binomial05.png","x150")

frames <- image_morph(c(binomial_png1, binomial_png2, binomial_png3, binomial_png4, binomial_png5), frames = 15)
animation <- image_animate(frames)

image_write(animation, "binomial.gif")

You can also try changing the values for n and animate those plots.

We will return next week with more R games for the Geometric distribution, Negative Binomial distribution, and the Poisson distribution.

Meanwhile, did you know that RocketMan owes NYC $156,000 for unpaid parking tickets?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)