## Lesson 41 – Struck by a smooth function

Review lesson 32.

If you assume X is a random variable that represents the number of successes in a Bernoulli sequence of n trials, then this X follows a binomial distribution. The probability that this random variable X takes any value k, i.e., the probability of exactly k successes in n trials is:

Review lesson 33.

If we consider independent Bernoulli trials of 0s and 1s with some probability of occurrence p and assume X to be a random variable that measures the number of trials it takes to see the first success, then, X is said to be Geometrically distributed. The probability of first success in the kth trial is:

Review lesson 36.

The number of times an event occurs (counts) in an interval follows a Poisson distribution. The probability that X can take any particular value P(X = k) is:

The characteristic feature in all these distributions is that the random variable X is discrete. The possible outcomes are distinct numbers, which is why we called them discrete probability distributions.

Have you asked yourself, “what if the random variable X is continuous?” What is the probability that X can take any particular value x on the real number line which has infinite possibilities?

I am going to ask you to draw a ball at random from a box of ten balls. I am also going to ask you “what is the probability of selecting any particular ball?

Your answer will include, “not again,” and “since the balls are all identical, and there are ten in the box, the probability of selecting any particular ball is one-tenth (1/10); P(X = any ball out of ten balls) = 1/10.”

As I am about to ask my next question, you will interrupt me and give me the answer. “And if there are 20 balls, the probability will be 1/20.” You might also say, “spare your next question, because the answer is 1/100, and the visual for increasing number of balls looks like this.”

I am sure you have recognized the pattern here. As the sample size (n) becomes large, the probability of any one value approaches zero. For a continuous random variable, the number of possible outcomes is infinite, hence,

P(X = x) = 0.

For continuous random variables, the probability is defined in an interval between two values. It is computed using continuous probability distribution functions.

If you go back to lesson 15, you will recall how we made frequency plots. We partitioned the real number space into intervals or groups, recorded the number of observations (values) that fall into each group and used this grouping to build stacks.

Based on the number of observations in each interval, we can compute the probability that the random variable will occur in that intervals. For example, if there are ten observations out of 100 observations in a group, we estimate the probability that the variable occurs in this group as 10/100.

For continuous random variables, the proportion of observations in the group approaches the probability of being in the group, and the size of the group (interval range) approaches zero. For a large n, we can imagine a large number of very small intervals.

Is it too abstract?

If so, let’s take some data and observe this behavior.

We will use the same data that we used last week — daily temperature data for New York City. We have this data from 1869 to 2017, a large sample of 54227 values. We can assume that temperature data is a continuous random variable that has infinite possible values on the real number line.

I will take 500 data points at a time and place them on the number line. If there are two or more observations with the same temperature value, I will stack them. Recall that this is how we create histograms, the only difference is that I am not grouping. Each value is independent.

Observe this animation.

As the number of data points (sample size) increases, the stacks get denser and denser with overlaps. The final compact histogram can be approximated using a smooth function – a continuous probability distribution function.

Since for continuous random variables, the proportion of observations in the group approaches the probability of being in the group, the area of the interval block or the area under the curve of the smooth function is the probability that X is in that interval.

Finally, from calculus, you can see that the probability of a continuous variable in an interval a and b is:

An example area computation between -1 and 2 is shown.

The continuous probability distribution functions should obey the property of unit probability.

The limits of the integral are negative to positive infinity.

Now, we can integrate this function, f(x) up to any value x to get the cumulative distribution function (F(x)).

Since cumulative distribution function is the area under the curve up to a value of x, we are essentially computing .

Having this cumulative function is handy for computing the percentiles of the random variable.

Do you remember the concept of percentiles?

We learned in lesson 14 that percentiles are order statistics that can be used to summarize the data. A 75th percentile is that value of x which has 75% of the data less than this number. In other words, F(x) = 0.75.

Can you see how the cumulative distribution function, F(x) =  can be used to compute the percentiles?

Over the next few weeks, we will learn some special types of continuous distribution functions. Since you’ve been struck by smooth functions today, I will invite you to solve this.

If X is a random variable with a probability distribution function defined as

• What is the median of X?
• What is the probability that X is between 0.2 and 0.3?
• What is the probability that X will exceed 0.9?

No medal for solving it, but you don’t have to feel bad if you cannot. Many “modern engineering Ph.D. students” cannot solve this. Basic mathematics is no more a requirement in the new world with diverse backgrounds.

If you find this useful, please like, share and subscribe.

## Lesson 40 – Discrete distributions in R: Part II

The scarfs and gloves come out of the closet.

The neighborhood Starbucks coffee cups change red.

It’s a reminder that autumn is ending.

It’s a reminder that 4 pm is 8 pm.

It’s a reminder that winter is coming.

Today’s temperature in New York is below 30F – a cold November day.

1. Do you want to know what the probability of a cold November day is?
2. Do you want to know what the return period of such an event is?
3. Do you want to know how many such events happened in the last five years?

Get yourself some warm tea. Let the room heater crackle. We’ll dive into rest of the discrete distributions in R.

###### Get the Data

The National Center for Environmental Information (NCEI) archives weather day for most of the United States. I requested temperature data for Central Park, NYC. Anyone can go online and submit requests for data. They will deliver it to your email in your preferred file format. I filtered the data for our lesson. You can get it from here.

Preliminary Analysis

You can use the following line to read the data file into R workspace.

```# Read the temperature data file #

The data has six columns. The first three columns indicate the year, month and day of the record. The fourth, fifth and sixth columns provide the data for average daily temperature, maximum daily temperature, and minimum daily temperature. We will work with the average daily temperature data.

Next, I want to choose the coldest day in November for all the years in the record. For this, I will look through each year’s November data, identify the day with lowest average daily temperature and store it in a matrix. You can use the following lines to get this subset data.

```# Years #
years = unique(temperature_data\$Year) # Identifying unique years in the data
nyrs = length(years) # number of years of data

# November Coldest Day #
november_coldtemp = matrix(NA,nrow=nyrs,ncol=1)

for (i in 1:nyrs)
{
days = which((temperature_data\$Year==years[i]) & (temperature_data\$Month==11)) # index to find november days in each year
november_coldtemp[i,1] = min(temperature_data\$TAVG[days]) # computing the minimum of these days
}```

Notice how I am using the which command to find values.

When I plot the data, I notice that there is a long-term trend in the temperature data. In later lessons, we will learn about identifying trends and their causes. For now, let’s take recent data from 1982 – 2016 to avoid the issues that come with the trend.

```# Plot the time series #
plot(years, november_coldtemp,type="o")
# There is trend in the data #

# Take a subset of data from recent years to avoid issues with trend (for now)-- #
# 1982 - 2016
november_recent_coldtemp = november_coldtemp[114:148]
plot(1982:2016,november_recent_coldtemp,type="o")```
###### Geometric Distribution

In lesson 33, we learned that the number of trials to the first success is Geometric distribution.

If we consider independent Bernoulli trials of 0s and 1s with some probability of occurrence p and assume X to be a random variable that measures the number of trials it takes to see the first success, then, X is said to be Geometrically distributed.

In our example, the independent Bernoulli trials are years. Each year can have a cold November day (if the lowest November temperature in that year is less than 30F) or not.

The probability of occurrence is the probability of experiencing a cold November day. A simple estimate of this probability can be obtained by counting the number of years that had a temperature < 30F and dividing this number by the total sample size. In our restricted example, we chose 35 years of data (1982 – 2016) in which we see ten years with lowest November temperature less than 30F. You can see them in the following table.

Success (Cold November) can happen in the first year, in which case X will be 1. We can see the success in the second year, in which case the sequence will be 01, and X will be 2 and so on.

In R, we can compute the probability P(X=1), P(X=2), etc., using the command dgeom. The Inputs are X and p. Try the following lines to create a visual of the distribution.

```######################### GEOMETRIC DISTRIBUTION #########################

# The real data case #
n = length(november_recent_coldtemp)

cold_years = which(november_recent_coldtemp <= 30)

ncold = length(cold_years)

p = ncold/n

x = 0:n
px = dgeom(x,p)
plot((x+1),px,type="h",xlab="Random Variable X (Cold on kth year)",ylab="Probability P(X=k)",font=2,font.lab=2)
abline(v = (1/p),col="red",lwd=2)
txt1 = paste("probability of cold year (p) = ",round(p,2),sep="")
txt2 = paste("Return period of cold years = E[X] = 1/p ~ 3.5 years",sep="")
text(20,0.2,txt1,col="red",cex=1)
text(20,0.1,txt2,col="red",cex=1)```

Notice the geometric decay in the distribution. It can take X years to see the first success (or the next success from the current success). You must have seen that I have a thick red line at 3.5 years. This is the expected value of the geometric distribution. In lesson 34, we learned that the expected value of the geometric distribution is the return period of the event. On average, how many years does it take before we see the cold year again?

Did we answer the first two questions?

1. Do you want to know what the probability of a cold November day is?
2. Do you want to know what the return period of such an event is?

Suppose we want to compute the probability that the first success will occur within the next five years, we can use the command pgeom for this purpose.

pgeom computes P(X < 5) as P(X = 1) + P(X = 2) + P(X=3) + P(X = 4). Try it for yourself and verify that they both match.

Suppose the probability is higher or lower, how do you think the distribution will change?

For this, I created an animation of the geometric distribution with changing values of p. See how the distribution is wider for smaller values of p and steeper for larger values of p. A high value of p (probability of the cold November year) indicates that the event will occur more often; hence the trials to success are less in number. On the other hand, a smaller value for p suggests that the event will occur less frequently. The number of trials it takes to see the first/next success is more; creating a wider distribution.

Here is the code for creating the animation. We used similar code last week for animating the binomial distribution.

```######## Animation (varying p) #########
# Create png files for Geometric distribution #

png(file="geometric%02d.png", width=600, height=300)

n = 35 # to mimic the sample size for november cold
x = 0:n

p = 0.1
for (i in 1:5)
{
px = dgeom(x,p)

plot(x,px,type="h",xlab="Random Variable X (First Success on kth trial)",ylab="Probability P(X=k)",font=2,font.lab=2)
txt = paste("p=",p,sep="")
text(20,0.04,txt,col="red",cex=2)
p = p+0.2
}
dev.off()

# Combine the png files saved in the folder into a GIF #

library(magick)

frames <- image_morph(c(geometric_png1, geometric_png2, geometric_png3, geometric_png4, geometric_png5), frames = 15)
animation <- image_animate(frames)

image_write(animation, "geometric.gif")```

###### Negative Binomial Distribution

In lesson 35, we learned that the number of trials it takes to see the second success is Negative Binomial distribution. The number of trials it takes to see the third success is Negative Binomial distribution. More generally, the number of trials it takes to see the ‘r’th success is Negative Binomial distribution.

We can think of a similar situation where we ask the question, how many years does it take to see the third cold year from the current cold year. It can happen in year3, year 4, year 5, and so on, following a probability distribution.

You can set this up in R using the following lines of code.

```################ Negative Binomial DISTRIBUTION #########################
require(combinat)

comb = function(n, x) {
return(factorial(n) / (factorial(x) * factorial(n-x)))
}

# The real data case #
n = length(november_recent_coldtemp)

cold_years = which(november_recent_coldtemp <= 30)

ncold = length(cold_years)

p = ncold/n

r = 3 # third cold year

x = r:n

px = NA

for (i in r:n)
{
dum = comb((i-1),(r-1))*p^r*(1-p)^(i-r)
px = c(px,dum)
}

px = px[2:length(px)]

plot(x,px,type="h",xlab="Random Variable X (Third Cold year on kth trial)",ylab="Probability P(X=k)",font=2,font.lab=2)```

There is an inbuilt command in R for Negative Binomial distribution (dnbinom). I chose to write the function myself using the logic of the negative binomial distribution for a change.

The distribution has a mean of 10.5 years. The third cold year can occur approximately on the 10th year on average.

If you are comfortable so far, think about the following questions:

What happens to the distribution if you change r.

What is the probability that the third cold year will occur within seven years?

###### Poisson Distribution

Now let’s address the question: how many such events happened in the last five years?”

In lesson 36, Able and Mumble taught us about the Poisson distribution. We now know that counts, i.e., the number of times an event occurs in an interval follows a Poisson distribution. In our example, we are counting events that occur in time, and the interval is five years. Observe the data table and start counting how many events (red color rows) are there in each five-year span starting from 1982.

From 1982 – 1986, there is one event; 1987 – 1991, there are two events; 1992 – 1996, there is one event; 1997 – 2001, there is one event; 2002 – 2006, there are two events; 2007 – 2011, there is one event; 2012 – 2016, there are two events.

These counts (1, 2, 1, 1, 2, 1, 2) follow a Poisson distribution with an average rate of occurrence of 1.43 per five-years.

The probability that X can take any particular value P(X = k) can be computed using the dpois command in R.

Before we create the probability distribution, here are a few tricks to prepare the data.

Data Rearrangement

We have the data in a single vector. If we want to rearrange the data into a matrix form with seven columns of five years each, we can use the array command.

```# rearrange the data into columns of 5 years #
data_rearrange = array(november_recent_coldtemp,c(5,7))```

This rearrangement will help in computing the number of events for each column.

Counting the number of events

We can write a for loop to count the number of years with a temperature less than 30F for each column. But, R has a convenient function called apply that will perform this same analysis faster.

The apply command can be used to perform any function on the data row-wise, column-wise or both. The user can define the function.

For example, we can count the number of years with November temperature less than 30F for each column using the following one line code.

```# count the number of years in each column with temp < 30
counts = apply(data_rearrange,2,function(x) length(which(x <= 30)))```

The first argument is the data matrix; the second argument “2” indicates that the function has to be applied for the columns (1 for rows); the third argument is the definition of the function. In this case, we are counting the number of values with a temperature less than 30F. This one line code will count the number of events.

The rate of occurrence is the average of these numbers = 1.43 per five-year period.

We are now ready to plot the distribution for counts assuming they follow a Poisson distribution. Use the following line:

`plot(0:5,dpois(0:5,mean(counts)),type="h",xlab="Random Variable X (Cold events in 5 years)",ylab="Probability P(X=k)",font=2,font.lab=2)`

You can now tune the knobs and see what happens to the distribution. Remember that the tuning parameter for Poisson distribution is

I will leave you this week with these thoughts.

If we know the function f(x), we can find out the probability of any possible event from it. If the outcomes are discrete (as we see so far), the function is also discrete for every outcome.

What if the outcomes are continuous?

How does the probability distribution function look if the random variable is continuous where the possibilities are infinite?

Like the various types of discrete distributions, are there different types of continuous distributions?

I reminded you at the beginning that the autumn is ending. I am reminding you now that continuous distributions are coming.

If you find this useful, please like, share and subscribe.

## Lesson 39 – Discrete distributions in R: Part I

It happened again. I got a ticket for parking close to the fire hydrant. Since my first hydrant violation ticket, I have been carrying a measuring tape. I may have miscalculated the curb length this time, or the enforcer’s tape is different than mine. Either way, today’s entry for the Department of Finance’s account is +\$115, and my account is -\$115.

We can’t win this game. Most hydrants in New York City don’t have painted curbs. It is up to the parking enforcer’s expert judgment — also called our fate. We park and hope that our number does not get picked in the lucky draw.

I want to research the fire hydrant violation tickets in my locality. Since we are learning discrete probability distributions, the violation tickets data can serve as a neat example. New York City Open Data has this information. I will use a subset of this data: parking violations on Broadway in precinct 24 in 2017.

I am also only analyzing those unfortunate souls whose vehicles are not registered in New York and who are parked at least seven feet from the hydrant. No excuse for those who park on the hydrant. Here is a look at the 27 instances under the given criteria.

Today’s lesson includes a journey through Bernoulli trials and Binomial distribution in R and how this example fits the description. Let’s start.

###### First Steps

Step 1: Get the data

Step 2: Create a new folder on your computer
Let us call this folder “lesson39”. Save the data file in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson39” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson39_code.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson39” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

`setwd("path to your folder")`

Step 5: Read the data into R workspace
I have the filtered data in the file named “parking_data.txt“. It only contains the date when the ticket is issued. Type the following line in your code and execute it. Use header=TRUE in the command.

```# Read the data to the workspace #
###### Bernoulli Trials

There are two possibilities each day, a ticket (event occurred – success) or no ticket (event did not occur – failure). A yes or a no. These events can be represented as a sequence of 0’s and 1’s (0001100101001000) called Bernoulli trials with a probability of occurrence of p. This probability is constant over all the trials, and the trials are assumed to be independent, i.e., the occurrence of one event does not influence the occurrence of the subsequent event.

In R, we can use the command “rbinom” to create as many outcomes as we require, assuming each trial is independent.

The input arguments are the number of observations we want, the trial (1 in the case of Bernoulli) and the probability p of observing 1s.

```#### Bernoulli Trials ####

# The generalized Case #
p = 0.5 # probability of success -- user defined (using 0.5 here)
rbinom(1,1,p) # create 1 random Bernoulli trial
rbinom(10,1,p) # create 10 random Bernoulli trials
rbinom(100,1,p) # create 100 random Bernoulli trials```

For this example, based on our data, there are 27 parking violation tickets issues in 181 days from January 1, 2017, to June 30, 2017.

Let us first create a binary coding 0 and 1 from the data. Use the following lines to convert the day into a 1 or a 0.

```# Create a Date Series #
days = seq(from=as.Date('2017/1/1'), to=as.Date('2017/6/30'), by="day")

y = as.numeric(format(days,"%y"))
m = as.numeric(format(days,"%m"))
d = as.numeric(format(days,"%d"))

binary_code = matrix(0,nrow=length(m),ncol=1)

for (i in 1:nrow(parking_violations))
{
dummy = which(m == parking_violations[i,1] & d == parking_violations[i,2])

binary_code[dummy,1] = 1
}

plot(days,binary_code,type="h",font=2,font.lab=2,xlab="Days",ylab="(0,1)")```

Assuming each day is a trial where the outcome can be getting a ticket or not getting a ticket, we can estimate the probability of getting a ticket on any day as 27/181 = 0.15.

So, with p = 0.15, we can simulate 181 outcomes (0 if the ticket is not issued or 1 if the ticket is issued) using the rbinom command. An example sequence:

```# For the example case #
n = 1 # it is the number of trials
p = 0.15 # probability of the event
nobs = 181

rbinom(nobs,n,p)

0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0
0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0
0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0
0 0 0 0 0 0

plot(rbinom(nobs,n,p),type="h",font=2,font.lab=2,xlab="Days",ylab="(0,1)")```

If you simulate such sequences multiple times, on the average, you will see 27 ones.

Run this command multiple times and see how the plot changes. Each run is a new simulation of possible tickets during the 181 days. It occurs randomly, with a probability of 0.15.

###### Slightly advanced plotting using animation and GIF (for those more familiar with R)

In R, you can create animation and GIF based on these simulations. For example, I want to run the “rbinom” command five times and visualize the changing plots as an animation.

We first save the plots as “.png” files and then combine them into a GIF. You will need to install the “animation” and “magick” packages for this. Try the following lines to create a GIF for the changing Bernoulli plot.

```######## Animation #########
# Create png files for Bernoulli sequence #
library(animation)

png(file="bernoulli%02d.png", width=600, height=300)
for (i in 1:5)
{
plot(rbinom(nobs,n,p),type="h",font=2,font.lab=2,xlab="Days",ylab="(0,1)")
}
dev.off()

# Combine the png files saved in the folder into a GIF #

library(magick)

frames <- image_morph(c(bernoulli_png1, bernoulli_png2, bernoulli_png3, bernoulli_png4, bernoulli_png5), frames = 10)
animation <- image_animate(frames)

image_write(animation, "bernoulli.gif")```

This is how the final GIF looks. Each frame is a simulation from the Bernoulli distribution. Depending on what version of R you are using, there may be more straightforward functions to create GIF.

###### Binomial Distribution

From the above, if you are interested in the number of successes (1s) in n trials, this number follows a Binomial distribution. If you assume X is a random variable that represents the number of successes in a Bernoulli sequence of n trials, then this X should follow a binomial distribution.

The number of trials is n = 181. The number of successes (getting a ticket) can be between 0 (if there are no tickets in all the 181 days) or 181
(if there is a ticket issued every day).

In R, the probability that this random variable X takes any value k (between 0 and 181), i.e., the probability of exactly k successes in n trials is computed using the command “dbinom.”

For computing the probability of exactly ten tickets in 181 days we can input:

`px = dbinom(10,181,p)`

For computing the probability of exactly 20 tickets in 181 days we can input:

`px = dbinom(20,181,p)`

To compute this probability for all possible k‘s and visualizing the probability distribution, we can use the following lines:

```n = 181 # define the number of trials
p = 0.15 # probability of the event
x = 0:181 # number of successes varying from 0 to 181

px = dbinom(x,n,p)

plot(x,px,type="h",xlab="Random Variable X (Number of tickets in 181 days)",ylab="Probability P(X=k)",font=2,font.lab=2)```

Do you know why the probability distribution is centered on 27? What is the expected value of a Binomial distribution?

If we want to compute the probability of getting more than five tickets in one month (30 days), we first calculate the probability for k = 6 to 30 (i.e., for exactly 6 tickets in 30 days to exactly 30 tickets in 30 days) with n = 30 to represent 30 trials.

```n = 30 # define the number of trials
p = 0.15 # probability of the event
x = 6:30 # number of successes varying from 6 to 30

px = dbinom(x,n,p)
sum(px)```

We add all these probability since
P(More than 5 in 30) = P(6 in 30) or P(7 in 30 ) or P(8 in 30) …

`P(X > 5 in 30) = P(X = 6 in 30) + P(X=7 in 30) + ... + P(30 in 30).`
###### Slightly advanced plotting using animation and GIF (for those more familiar with R)

I want to experiment with different values of p and check how the probability distribution changes.

I can use the animation and GIF trick to create this visualization.

Run the following lines in your code and see for yourself.

```######## Animation #########
# Create png files for Binomial distribution #

png(file="binomial%02d.png", width=600, height=300)

n = 181
x = 0:181

p = 0.1
for (i in 1:5)
{
px = dbinom(x,n,p)

plot(x,px,type="h",xlab="Random Variable X (Number of tickets in 181 days)",ylab="Probability P(X=k)",font=2,font.lab=2)
txt = paste("p=",p,sep="")
text(150,0.04,txt,col="red",cex=2)
p = p+0.2
}
dev.off()

# Combine the png files saved in the folder into a GIF #

library(magick)

frames <- image_morph(c(binomial_png1, binomial_png2, binomial_png3, binomial_png4, binomial_png5), frames = 15)
animation <- image_animate(frames)

image_write(animation, "binomial.gif")```

You can also try changing the values for n and animate those plots.

We will return next week with more R games for the Geometric distribution, Negative Binomial distribution, and the Poisson distribution.

Meanwhile, did you know that RocketMan owes NYC \$156,000 for unpaid parking tickets?

If you find this useful, please like, share and subscribe.

## Lesson 38 – Correct guesses: The language of Hypergeometric distribution

Now that John prefers Pepsi let’s put him on the spot and ask him to choose it correctly. We will tell him that there is one Pepsi drink. We’ll ask him to chose it correctly when given three drinks.

These are his possible outcomes. He will pick the first, second or third can as Pepsi. If he chooses the first can, his guess is correct. So the probability that John is correctly guessing the one Pepsi can is 1/3. The probability that his guess is wrong is 2/3.

Since he is only choosing one correct Pepsi can, his outcomes are 0 or 1 right guess.

```P(X = 0) = 2/3
P(X = 1) = 1/3```

Let’s say that he liked Pepsi so much that he guessed it correctly in our first test.

We will put him through another test. This time, we will ask him to choose Pepsi correctly, when there are two cans in three.

He is to choose two cans, 1-2, 1-3, or 2-3. His guess outcomes are selecting a combination of two cans that will yield one Pepsi or both Pepsis. The only way to correctly guess both Pepsis is to choose 1-2. The probability is 1/3 (one option among three combinations).

There are two other ways (1-3, 2-3) for him to choose one Pepsi correctly. So the probability is 2/3.

```P(X = 1) = 2/3
P(X = 2) = 1/3```

John is good at this. He correctly guessed both the Pepsi cans. Let’s troll him by saying he got lucky so that he will take on the next test.

This time, he has to choose Pepsi correctly where there are two cans in four.

His picks can be 1-2, 1-3, 1-4, 2-3, 2-4, and 3-4. Six possibilities, since he is choosing two from four – 4C2. Based on his pick, he can get zero Pepsi, one Pepsi or both Pepsis. X = 0, 1, 2.

There is one possibility of both Pepsis; the combination 1-2. The probability that he chooses this combination among all others is 1/6.

`P(X = 2) = 1/6`

Similarly, there is one possibility of no Pepsi, the combination 3-4. The probability that he chooses these cans among all his options is 1/6.

`P(X = 0) = 1/6`

The other guess is choosing one Pepsi correctly. There are two Pepsi cans; he can pick either the first can or the second can as his choice, and then his second can will be the third or the fourth can.

1 with 3
1 with 4

2 with 3
2 with 4

There are 2C1 ways of choosing one Pepsi can among the two Pepsi cans. There are 2C1 ways of picking one Coke can from the two Coke cans.
The total possibilities are 2C1*2C1.

`P(X = 1) = 4/6`

Fortune favors the brave. John again correctly picked 1-2. Ask him to take one more test, and he will ask us to get out.

If you are like me, you want to play out one more scenario to understand the patterns and where we are going with this. So, let’s leave John alone and do this thought experiment ourselves. I will pretend you are John.

I will ask you to correctly pick Pepsis when there are three Pepsi cans among five.

When you pick three cans out of five, only one of them can be Pepsi, two of them can be Pepsi, or all three of them can be Pepsi. So X = 1, 2, 3.
You have a total of 10 combinations; pick three correctly from five. 5C3 ways as shown in this table.

The probability of picking all three Pepsis is 1/10.

`P(X = 3) = 1/10`

Now, let’s work out how many options we have for picking exactly one Pepsi in three cans. It has to be one Pepsi (one correct guess) with two Cokes (two wrong guesses). One Pepsi from three Pepsi cans can be selected in 3C1 ways. Two Cokes from two Cokes can be selected in 2C2 ways making it a total of 3C1*2C2 = 3 options. We can identify these options from our table as 1-4-5, 2-4-5 and 3-4-5. So,

`P(X = 1) = 3/10`

By the same logic, picking exactly two Pepsis in three cans can be done in 3C2*2C1 ways. Two Pepsi cans selected from three and one Coke chosen from two Coke cans. Six ways. Can you identify them in the table?

`P(X = 2) = 6/10`

Let us generalize.

If there are R Pepsi cans in a total of N cans (N-R Cokes) and we are asked to identify them correctly, in our choice selection of R Pepsis, we can get k = 0, 1, 2, … R Pepsis. The probability of correctly selecting k Pepsis is

X, the number of correct guesses (0, 1, 2, …, R) assumes a hypergeometric distribution.

The control parameters of the hypergeometric distribution are N and R.

The probability distribution for N = 5, and R = 2 is given.

When N =5 and R = 3.

In more generalized terms, if there are R Pepsi cans in a total of N cans (N-R Cokes) and we randomly select n cans from this lot of N and define X to be the number of Pepsis in our sample of n, then the distribution of X is hypergeometric distribution. P(X=k) in this sample is then:

The denominator NCn is the number of ways you can select n cans out of a total of N. A sample of n from N. The first term in the numerator is selecting k (correct guesses) out of R Pepsis. The second term is selecting (n-k) (wrong guesses) from the remaining (N-R) Cokes.

I want you to start visualizing how the probability distribution changes for different values of N, R, and n.

Next week, we will learn how to work with all discrete distributions in our computer programming tool R.

Hypergeometric distribution is typically used in quality control analysis for estimating the probability of defective items out of a selected lot.

The Pepsi-Coke marketing analysis is another example application. Companies can analyze the preferences of one product to other among a subset of customers in their region.

Now think of this:

There are R Republican leaning voters in a population of N. For simplicity (and for all practical purposes since LP and GP will not win); the other N-R voters are leaning Democratic. If you select a random sample of n voters, what is the probability that you will have more than k of them voting Republican? You know that the control parameters for this “election forecast” model are N, R, and n. What if you underestimated the number of R leaning voters? What if your sample of n voters is not random?

If you find this useful, please like, share and subscribe.

## Lesson 37 – Still counting: Poisson distribution

The conference table was arranged neatly with a notebook and pen at each chair. Mumble’s Macbook Air is hooked up to the projector at the other end.

He peeps over the misty window. A hazy and rainy outside did not elevate his senses. He will meet Lani from the risk team of California builders. Able invited him to New York to discuss a potential deal on earthquake insurance products.

Mumble goes over the talking points in his mind. He is to discuss the technical details with Able and Lani in the morning. Later, he will present his analysis of product lines and market studies to the executive officers.

Still facing the window, Mumble takes a deep breath, feels confident about his opening jokes, and turns around to greet Able and Lani who just entered the conference room.

Lani initiates the conversation and talks about teamwork and success. Mumble was not impressed. He has heard this teamwork mantra many a time now. Lani came across as an all talk no action guy.

Lani nods and picks up the pen to take some notes.

Able’s eyes rolled over towards Lani. He expected him to interject. Afterall, Lani’s bio says that he is a mathematician.

Lani did not interrupt. He was busy writing down points on his notepad.

Able was again expecting Lani to latch onto the equations.

“This is very good. Spending time on these details is essential. I am looking forward to our successful collaboration Mumble. I love Math too.”

Able and Mumble glanced at each other and politely concealed their emotions.

“Definitely. California Builders are at the forefront in Earthquake risk for housing projects. We can collect data for tremors in California. We work on several models for the benefit of our clients. Our team consists of many mathematicians and engineers. Mumble, I agree with Mr. Able. You did a fantastic job. We can work together on this project to produce risk models.”

Mumble has now confirmed that Lani is all talk no substance. He is just looking to delegate work and take credit.

Able has little patience for these platitudes. He is beginning to realize that Lani is a paper mathematician who may have taken one extra Math course in college and just calls himself a “mathematician” since it sounds “Einsteinian.” The deal with California Builders just became “no deal.”

If you find this useful, please like, share and subscribe.

## Lesson 36 – Counts: The language of Poisson distribution

I want you to meet my friends, Mr. Able and Mr. Mumble.

Unlike me who talks about risk and risk management for a living, they take and manage risk and make a killing.

True to his name, Mr. Mumble is soft-spoken and humble. He is the go-to person for numbers, data, and models. Like many of his contemporaries, he checked off the bucket list given to him and stepped on the ladder to success.

Some of you may be familiar with this bucket list.

Mumble was a “High School STEM intern,” was part of a “High School to College Bridge Program for STEM,” has a “STEM degree,” and completed his “STEM CAREER Development Program.” Heck, he even was called a “STEM Coder” during his brief stint with a computer programming learning center. A few more “STEMs,” and he’d be ready for stem cell research.

These “STEM initiatives” have landed him a risk officer job for an assets insurance company in the financial district.

On the contrary, Mr. Able is your quintessential American pride who has a standard high school education, was able to pay for his college through odd jobs, worked for a real estate company for some time, learned the trade and branched off to start his own business that insures properties against catastrophic risk.

He is very astute and understands the technical aspects involved in his business. Let’s say you cannot throw in some lingo into your presentation and get away without his questions. He is not your “I don’t get the equations stuff” person. He is the BOSS.

Last week, while you were learning the negative binomial distribution, Able and Mumble were planning a new hurricane insurance product. Their company would sell insurance against hurricane damages. Property owners will pay an annual premium to collect payouts in case of damages.

As you know, the planning phase involves discussion about available data and hurricane and damage probabilities. The meeting is in their 61st-floor conference room that oversees the Brooklyn Bridge.

Mumble, is there an update on the hurricane data? Do you have any thoughts on how we can compute the probabilities of a certain number of hurricanes per year?

Mr. Able, the National Oceanic and Atmospheric Administration’s (NOAA) National Hurricane Center archives the data on hurricanes and tropical storms. I could find historical information on each storm, their track history, meteorological statistics like wind speeds, pressures, etc. They also have information on the casualties and damages.

That is excellent. A good starting point. Have you crunched the numbers yet? There must be a lot of these hurricanes this year. I keep hearing they are unprecedented.

Counting Ophelia, we had ten hurricanes this year. Take a look at this table from their website. I am counting hurricanes of all categories. I recall from our last meeting that Hurin will cover all categories. By the way, I never liked the name Hurin for hurricane insurance. It sounds like aspirin.

Don’t worry about the name. Our marketing team has it covered. Funny name branding has its influence. You will learn when you rotate through the sales and marketing team. Tell me about the counts for 2016, 2015, etc. Did you count the number of hurricanes for all the previous years?

Yes. Here are the table and a plot showing the counts for each year from 1996 to 2017.

Based on this 22-year data, we see that the lowest number per year is two hurricanes and the highest number is 15 hurricanes. When we are designing the payout structure, we should have this in mind. Our claim applications will be a function of the number of hurricanes. Can we compute the probability of having more than 15 hurricanes in a year using some distribution?

Absolutely. If we assume hurricane events are independent (the occurrence of one event does not affect the probability that a second event will occur), then the counts per year can be assumed a random variable that follows a probability distribution. Counts, i.e., the number of times an event occurs in an interval follows a Poisson distributionIn our case, we are counting events that occur in time, and the interval is one year.

Let’s say the random variable is X and it can be any value zero hurricanes, one hurricane, two hurricanes, ….. What is the probability that X can take any particular value P(X = k)? What are the control parameters?

Poisson distribution has one control parameter

It is the rate of occurrence; the average number of hurricanes per year. Based on our data, lambda is 7.18 hurricanes per year. The probability P(X = k) for a unit time interval t is

The expected value and the variance of this distribution are both

We can compute the probability of having more than 15 hurricanes in a year by adding P(X = 16) + P(X = 17) + P(X = 18) and so on. Since 15 happened to be in the extreme, the probability will be small, but our risk planning should include it. Extreme events will create a catastrophic damage. I see you have more slides on your deck. Do you also have the probability distribution plotted?

Yes, I have them. I computed the P(X = k) for k = 0, 1, 2, …, 20 and plotted the distribution. It looks like this for = 7.18.

Let me show you one more probability distribution. This one is for storms originating in the western Pacific. They are reported to the Joint Typhoon Warning Center. Since we also insure assets in Asia, this data and the probability estimates will be useful to design premiums and payouts there. The rate of events is higher in Asia; an average of 14.95 typhoons per year. The maximum number of typhoons is 21.

Very impressive Mumble. You have the foresight to consider different scenarios.

As the meeting comes to closure, Mr. Able is busy checking his emails on the phone. A visibly jubilant Mumble sits in his chair and collects the papers from the table. He is happy for having completed a meeting with Mr. Able without many questions. He is already thinking of his evening drink.

The next meeting is in one week. Just as Mr. Able gets up to leave the conference room, he pauses and looks at Mumble.

“Why is it called Poisson distribution? How is this probability distribution different from the Binomial distribution? Didn’t you say in a previous meeting that exactly one landfall in the next four hurricanes is binomial?”

Mumble gets cold feet. His mind already switched over to the drinks after the last slide; he couldn’t come up with a quick answer. As he begins to mumble, Able gets sidetracked with a phone call. “See you next week Mumble.” He leaves the room.

Mumble gets up and watches over the window — bright sunny afternoon. He refills his coffee mug, takes a sip and reflects on the meeting and the question.

To be continued…

If you find this useful, please like, share and subscribe.

## Lesson 35 – Trials to ‘r’th success: The language of Negative Binomial distribution

We all know that the trials to the first success is a Geometric distribution. It can take one trial, two trials, three trials, etc., to see the first success. These trials are assumed a random variable X = {1, 2, 3, … }; they have a probability, i.e., P(X = 1), P(X = 2), P(X = 3), and so on.

However, Mr. Gardner needs more success. He has already sold his first “time machine” (aka bone density scanner). He’d have to sell more. He is looking for his second success, third success and so forth to get through.

The number of trials it takes to see the second success is Negative Binomial distribution.

The number of trials it takes to see the third success is Negative Binomial distribution.

The number of trials it takes to see the ‘r’th success is Negative Binomial distribution.

Are you paying attention to the pattern here? Negative Binomial distribution is Geometric distribution if r is 1 (trials to first success).

The Geometric distribution has one control parameter, p, the probability of success.

Since we are interested in more than the first success, r is another parameter in the Negative Binomial distribution. Together, p and r determine how the distribution looks.

Let’s take the example of Mr. Gardner. “I’d have to sell one more.” Each of his visit to a doctor’s office is a trial. They will buy the bone density scanner or show him the door. So Mr. Gardner is working with a probability of success, p. Let’s say that p is 0.25, i.e., there is a 25% chance that he will succeed in selling it.

Let’s assume that he sold one machine. He is looking for his second sale. r = 2.

His second success can occur in the second trial, the third trial, the fourth trial, etc. When r = 2, X, the random variable of the number of trials will be X = {2, 3, 4, …}.

When r = 3, X = {3, 4, 5, …}.

You correctly guessed my next line. When r = 1, i.e., for Geometric distribution, X = {1, 2, 3, …}.

There is a pattern. We are learning a distribution which is an extension of Geometric distribution.

Now let us compute the probability that X can take any integer value.

P(X = 2) is the probability that he makes his second sale (second success) on the second trial.

Remember the trials are independent. The second doctor’s decision is not dependent on what happened before. He buys or not with a 0.25 probability.

So, P(X = 2) is 0.25*0.25 = 0.0625. The first trial is a success and the second trial is a success.

P(X = 3) is the probability that he makes his second sale on the third trial. It means he must have made his first sale in either the first trial or the second trial, and then he makes his second sale on the third trial.

The probability of succeeding second time on the third trial is the probability of succeeding once in two trials, and the third trial is a success.

`P(1 success in 2 trials) * P(3rd is a success)`

The probability of the one success in two trials is computed using the Binomial distribution.

`2C1*p^1*(1-p)^2-1`

This probability is multiplied by p, the probability of success in the third trial.

`P(X = 3) = 2C1*p^1*(1-p)^2-1*p`

If this is your expression now, 😕 let’s take another case to clear up the concept.

Suppose we want P(X = 6), the probability of making the second sale on the sixth trial.

This will happen in the following way.

Mr. Gardner has to sell one machine in five trials. P(1 in 5), one success in five trials → Binomial.

`5C1*p^1(1-p)^(5-1)`

Then, the sixth trial is a sell. So we multiply the above binomial probability with p, the hit in the sixth.

You should have noticed the origin of the name → Negative Binomial.

To generalize,

I computed these probabilities for X ranging from 2 to 50. Here is the probability distribution. Remember r = 2 and p = 0.25.

Now I change the value for r to 3 and 5 to see how the Negative Binomial distribution looks. r = 3 means Mr. Gardner will sell his third machine.

r = 3

r = 5

Notice how the probability distribution shifts with increasing values of r.

The control parameters are r and p.

You can try different values of p and see what happens. For a fixed value of p and changing values of r, the tail is getting bigger and bigger. What does it mean regarding the number trials and their probability for Mr. Gardner?

Think about this. If he sets a target of selling three machines in the day, what is the probability that he can achieve his goal within 20 doctor visits?

What if he needs to sell five machines within this 20 visits to make his ends meet. Can he make it?

If you try out the probability distribution plots for p = 0.5, you will see that his chance of selling the fifth machine within 20 visits goes up tremendously. So perhaps he should learn the six principles of influence and persuasion to increase p, the probability of saying yes by the doctors.

People mostly prefer to say yes to the request of someone they know and like. I want our blog to have the 9000th user within nine months. See, I am requesting in the language of Negative Binomial distribution.

If you find this useful, please like, share and subscribe.

## Lesson 34 – I’ll be back: The language of Return Period

The average time between Arnold Schwarzenegger’s being back is the return period of his stunt.

I have to wait 5 minutes for my next bus. Some days I wait for 1 minute; some days, I wait for 15 minutes. The wait time is variable → random variable 😉 The average of these wait times is the return period of my bus.

Your recent vocabulary may include “100-year event” (happening more often), (drainage system designed for) “10-year storm,” and so on, courtesy mainstream media and news outlets.

Houston drainage grid ‘so obsolete it’s just unbelievable’

What exactly is this return period business?

Does a 10-year return period event occur diligently every ten years?

Can a 100-year event occur three times in a row?

###### THE LOGIC

Let’s visit annual maximum rainfall for Houston. If we take the daily rainfall data for each year from January 1 to December 31 and choose the maximum rainfall among these days, we call it annual maximum rainfall for that year. So this is the rainfall for the wettest day of the year. Likewise, if we do this for all the years that we have data for, we get a data series (also called time series since we are recording this in time units).

You can get the data from here if you like. You may have to register using your email, but its free. We have 79 years of recorded data from 1939 to 2017. 79 data points, one number per year as the rainfall for the wettest day in that year. You will see that five years are missing between 1942 and 1946.

I want you to understand that these numbers represent a random variable X. Each number (outcome) is assumed to be independent, i.e., the occurrence of one event in one year does not influence the occurrence of the subsequent event. In other words, 2017 rainfall does not depend on 2016 rainfall.

Now, I want you to see Brays Bayou, the lake that detains excess rainfall in Houston. Let us assume that it can store up to eight inches of rainfall on any day. If it rains more than eight inches in a day, the Bayou will overflow and cause flood — as we saw in Houston during hurricane Harvey.

So, if the rainfall is greater than eight inches, we define this as an event. Let us call him Bob. The first time we see Bob was in 1949. We started recording data in 1939. Bob happened after 11 years. The wait time for Bob (1949) is 11 years.

Then we get on with our lives, 11 years passed, Bob is not back, 22 years passed, no sign of Bob. Suddenly, after 30 years of waiting from 1949, Bob Strikes Back (1979).

Two years after this event happened, Bob wanted to greet the Millenials, so he came back in 1981. This time, the waiting period is only two years.

Then, in 1989, for no particular reason, Bob returns. The return of Bob (1989) is after eight years.

You must be thinking: “I don’t see any pattern here.” Yes, that is because there is none.

Years pass, Bob seems to be resting. At the turn of the century, Bob decided to come back. So Bob Meets the 21st Century in 2001 after 12 years since his prior occurrence. Bob re-occurs. Recurrence.

During the first decade of the 21st century, Bob re-occurs two times, once in 2006 as the Restless Bob (5-year wait time) and again in 2008 as Miss Me Yet, Bob (2-year wait time).

We all know what happened after that. Vengeant Bob (2017), aka Harvey, happened after nine years.

Now, let’s summarize all Bobs along with their recurrence times. We started with the assumption that the maximum rainfall events represent a random variable X. Let us define T as another random variable that measures the time between the event Bob (wait time or time to the next event or time to the first event since the previous event).

The return period of the event Bob, (X > 8 inches) is the expected value of T, i.e., E[T], its average measured over a large number of such occurrences.

As you can see here, in the table, the return period of Bob is approximately ten years. Bob is a 10-year return period event.

Another way of thinking about this: Since there are eight Bob events in 79 years, they occur at an average rate of 79/8. Approximately, once in 10 years. Hence originated the 10-year event concept.

Remember, they don’t happen cyclically every ten years. If we average the wait times of a lot of events, we will get approximately ten years.

Just like when you wait for the bus, you wait for short time or a long time, but you think of the average time you wait for a bus everyday, you can see events happening in a cluster or spaced out, but all average to an nyear return period.

###### The relation to Geometric distribution

Last week when we learned Geometric distribution, I told you that we would relate the expected value of the Geometric distribution to return period of an event. Let’s see how Bob relates to Geometric distribution.

I want you to convert the maximum rainfall data series into a series of independent Bernoulli trials of 0s and 1s. 0 if the rainfall is < eight inches (No Bob), 1 if the rainfall is > eight inches (Yes Bob). The 1s can occur with some probability of occurrence p. In our example, since we have 8 Bobs in 79 years the probability of occurrence p = 8/79 = 0.101.

Now, assume T to be a random variable that measures the number of trials (years) it takes to see the first success (event), or the next event from each such event. For the first event, Bob (1949), it took 11 years to occur. The probability that T = 11, P(T = 11) is (1 – p)^10*p. Similarly, the next Bob happened after 30 years and so on. T is the time to first success (next success) → Geometrically distributed.

We can derive the expected value of T using the expectation operation we learned in lesson 24.

Now, recall from your math classes that the expression inside the parenthesis looks like a power series. Ponder over it and confirm that the whole expression will reduce to

`E[T] = 1/p`

The expected value of the wait time that is Geometrically distributed is the inverse of the probability of the event. Since the probability of Bob is 0.101, the return period (expected value of the wait times) is 1/0.101 ~ ten years. A 10-year return period event.

###### The Question

We measured the probability over 79 years; n = 79. We assumed that the probability is constant over all the trials.

In other words, we are assuming that we know p and it does not change.

If I were writing this lesson last year, the probability would have been 7/78 = 0.089. Since Harvey (The Vengeant Bob), the probability became 8/79 = 0.101. There are also five missing years.

Perhaps we do not know the true value of p, and perhaps it is not constant.

How then, can you estimate the risk of anything? How then, can you predict anything? How then, can you design anything?

If I haven’t confused you enough, let me end with one of my favorite quotes from Nicholas Taleb’s book Antifragile: Things that gain from disorder.

“It is hard to explain to naive data-driven people that risk is in the future, not in the past.”

If you find this useful, please like, share and subscribe.

## Lesson 33 – Trials to first success: The language of Geometric distribution

So goes the legendary story: Bruce, Robert I, the King of Scotland, defeated the English armies on his seventh trial. He bore six successive defeats before that.

Giants are yet to win their first game of the season. Their record: L, L, … I wonder how many games until their first win.

The last deadly hurricane that hit New York City is Sandy in 2012. We are now five years through without such a dangerous event.

The common thread tying the three examples is the number of trials to the first success. This is the language of the Geometric distribution.

If we consider independent Bernoulli trials of 0s and 1s with some probability of occurrence p and assume X to be a random variable that measures the number of trials it takes to see the first success, then, X is said to be Geometrically distributed.

We can get the success on the first trial, in which case X will be 1. We can see the success on the second trial, in which case the sequence will be 01, and X will be 2. We can see the success on the third trial; the sequence will be 001 and X will be 3 and so forth. As you can guess, X = {1, 2, 3, … }, positive integers.

There is some probability that X can take any integer value. We should also figure out this probability, i.e., P(X = 1), P(X = 2), P(X = 3), and so on.

Let us take the example of a coin toss. The outcomes are head or tail. Binary outcomes → Bernoulli trial. The probability p of a head or tail is 0.5. In other words, if you toss a coin a large number of times, say 100, roughly 50 of them will be heads, and 50 of them will be tails. Let’s play. Heads you win, tails you lose.

Great, you win on the first trial. The probability of seeing head is 0.5. Hence, P(X = 1) = 0.5.

Let’s play again.

Ah, this time the first outcome is a tail and the second outcome is a head. You lose on the first trial but win on the second. It took two trials to wins. X = 2. P(X=2) is P(tail on the first toss)*P(head on the second toss) = 0.5*0.5 = 0.25. Why did we multiply? What is P(A and B) for independent events?

One more time.

Now it took three trials to win. X = 3, and P(X = 3) = P(tail on the first toss)*P(tail on the second toss)*P(head on the third toss) = 0.5*0.5*0.5 = 0.125.

For X = 4, it will be P(tail on the first toss)*P(tail on the second toss)*P(tail on the third toss)*P(head on the fourth toss) = 0.5*0.5*0.5*0.5 = 0.0625.

If we now plot X and P(X = k), k being 1, 2, 3, 4, …, we get a probability distribution like this.

The height of the line at X = 2 is 0.5 times the height of the line at X = 1. In the same way, the height of the line at X = 3 is 0.5 times the height of the line at X = 2 and so on. P(X=k) decreases in a geometric progression. Hence the name Geometric distribution.

We can generalize this for any probability p. In our game, we estimated P(X = 1) as 0.5, i.e., the probability of seeing a head p. P(X=2) is 0.5*0.5, i.e., (1-p)*p. P(X = 10) = (1-p)^9*p. First success on the tenth toss is nine tails followed by a head.

More generally,

We can derive the expected value and variance of X as:

The expected value of a Geometric distribution relates to a special concept called return period → we will look at it next week.

Meanwhile, here are some more geometric probability distributions with different values of p.

###### p = 0.9

Notice how the shape changes with changing values of p. p is the parameter that controls the shape of the distribution. The greater the value for p, the steeper the fall.

If the probability of success is close to 1, the odds of winning in the first few trials is high → notice the height of the line for p = 0.9. If the probability of success is close to 0, it takes several trials to get to greater odds of winning overall.

Have you now conceptualized the idea of geometric distribution?

Let me challenge you to a bet then.

I have a coin toss game where I give you two times your bet if you win; you get nothing if you lose. Assume we have a fair coin, would you play the game with me and bet your money? If you will, then what is your strategy, assuming you are in it to win.

Since I challenged you to a bet, I also looked into some lottery games myself at nylottery.ny.gov.

###### First observation

The odds of winning first prize in any of the games is next to 0. So if you plan to keep buying the tickets until you win the first time, and then retire, you now know that you will keep buying forever.

The chances of winning per game get better for lower prize levels. For example, in the Mega Million, if you want to win the ninth prize, the odds are 1 in 21. Still low, and will take a long time to win.

But, who wants to win the ninth prize. It is like saying “America Ninth.”

###### Second observation

I keep wondering why on earth is New York Government running a lottery business … only to reconcile that “bread and circuses” have always been up the state’s sleeves to expand.

If you find this useful, please like, share and subscribe.

## Lesson 32 – Exactly k successes: The language of Binomial distribution

I may be in one of those transit buses. Since I moved to New Jersey, I am going through this mess every day.

Well, you wanted to enjoy Manhattan skyline. It has a price tag.

D, glad you are here. It’s been a while. In our last meeting, we were discussing the concepts of variance operation and its properties. I continue to read your lessons every week. As I paused and reflected on all the lessons, I noticed that there is a systematic approach to them. You started with the basics of sets and probability, introduced lessons on visualizing, summarizing and comparing data using various statistics, then extended those ideas into random variables and probability distributions. The readership seems to have grown considerably, and people are tweeting about our classroom. Have you reached 25000 pageviews yet?

We are at 24350 pageviews now. We will certainly hit the 25k mark today 😉 I am thankful to all the readers for their time. Special thanks to all those who are spreading the word. Our classroom is a great resource for anyone starting data analysis.

So, whats on the show today?

As you correctly pointed out, we are now slowly getting into various types of probability distributions. I mentioned in lesson 31 that we would learn several discrete probability distributions that are based on Bernoulli trials. We start this adventure with Binomial distribution.

Great. Let me refresh my memory of probability distributions before we get started. We discussed the basics of probability distribution in lesson 23. Let’s assume X is a random variable, and P(X = x) is the probability that this random variable takes any value x (i.e., an outcome). Then, the distribution of these probabilities on a number line, i.e., the probability graph is called the probability distribution function f(x) for a random variable. We are now looking at various mathematical forms for this f(x).

Fantastic. Now imagine you have a Bernoulli sequence of yes or no.

Sure. It is a sequence of 0s and 1s with a probability p; 0 if the trial yields a no (failure, or event not happening) and 1 if the trial yields a yes (success, or event happening). Something like this: 00101001101

From this sequence, if you are interested in the number of successes (1s) in n trials, this number follows a Binomial distribution. If you assume X is a random variable that represents the number of successes in a Bernoulli sequence of n trials, then this X should follow a binomial distribution. The probability that this random variable X takes any value k, i.e., the probability of exactly k successes in n trials is:

The expected value of this random variable, E[X] = np, and the variance V[X] = np(1-p).

😯 Wow, that’s a fastball. Can we parse through the lingo?

Oops… Okay, let us take the example of your daily commute. Imagine buses and cars pass through the tunnel each morning. Can you guesstimate the probability of buses?

Yeah, I usually see more buses than cars in the morning. Let’s say the likelihood of seeing a bus is p=0.7.

Now let us imagine that buses and cars come in a Bernoulli sequence. Assign a 1 if it is a bus, and 0 if it is a car.

That is reasonable. The vehicle passage is usually random. If we take that as a Bernoulli sequence, there will be some 1s and some 0s with a 0.7 probability of occurrence. In the long run, you will have 70% buses and 30% cars in any order.

Correct. Now think about this. In the next four vehicles that pass through the tunnel, how many of them will be buses?

Since there is randomness in the sequence, in the next four vehicles, I can say, all of them may be buses, or none of them will be buses or any number in between.

Exactly. The number of buses in a sequence of 4 vehicles can be 0, 1, 2, 3 or 4. These are the random variables represented by X. In other words, if X is the number of buses in 4 vehicles coming at random, then X can take 0, 1, 2, 3 or 4 as the outcomes. The probability distribution of X is binomial.

I understand how we came up with X. Why is the probability distribution of X called binomial?

It originates from the idea of the binomial coefficient that you may have learned in an elementary math/combinations class. Let us continue with our logical deduction to see how the probability is derived, and you will see why.

Sure. We have X as 0, 1, 2, 3 and 4. We should calculate the probability P(X = 0), P(X = 1), P(X = 2), P(X = 3) and P(X = 4). This will give us the distribution of the probabilities.

Take an example, say 2. Let us compute P(X = 2). The probability of seeing exactly two buses in 4 vehicles. The probability of exactly k successes in n trials. If the buses and cars come in a Bernoulli sequence (1 for bus and 0 for a car) with a probability p, in how many ways can you see two buses out of 4 vehicles?

Ah, I see where we are going with this. Let me list out the possibilities. Two buses in four vehicles can occur in six ways. 0011, 0101, 1100, 1010, 1001, 0110. In each of these six possible sequences, there will be exactly two buses among four vehicles. I remember from my combinations class that this is four choose two. Four factorial divided by the product of two factorial and (four minus two) factorial. 4C2 = 4!/4!(4-2)!

For each possibility, the probability of that sequence can also be written down. Let me make a table like this:

You can see from the table that there are six possibilities. Any of the possibilities, 1 or 2 or 3 or 4 or 5 or 6 can occur. Hence, the probability of seeing two in four is the sum of these probabilities. Remember P(A or B) = P(A) + P(B). If you follow through this, you will get, 6*p*p*(1-p)*(1-p). = 6*p^2*(1-p)^(4-2). Can you see where the formula for binomial distribution comes from?

Absolutely. For each outcome of X, i.e., 0, 1, 2, 3 and 4, we should apply this logic/formula and compute the probability of the outcome. Let me finish it and make a plot.

Very nicely done. Let me jump in here and show you another plot with a different n and p. If p = 0.5 (equal probability) and n = 100; this is how the binomial distribution looks like.

Nice. It looks like an inverted bell centered around 50.

Yeah. You noticed that the distribution is centered around 50. It is the expected value of the distribution. Remember E[X] is the central tendency of the distribution. For binomial, you can derive it as np = 100 (0.5) = 50. In the same way, the variance, i.e. spread of the function around this center is np(1-p) = 100(0.5)(0.5) = 25. Or standard deviation is 5. You can see that the distribution is spread out within three standard deviations from the center. Can you now imagine how the distribution will look like for p = 0.3 or p = 0.7?

Following the same logic, those distributions will be centered on 100*0.3 = 30 and 100*0.7 = 70 with their variance. Now it all makes sense.

You see how easy it is when you go through the logic. We started with Bernoulli sequence. When we are interested in the random variable that is the number of successes in so many trials, it follows a binomial distribution. Exactly k successes” is the language of Binomial distribution. Can you think of any other examples that can be modeled as a binomial distribution?

Probability that Derek Jeter, with a batting average of 0.3, gets three hits out of the three times he comes to bat 😆  This is fun. I am glad I learned some useful concepts out of the messy commute experience. By the way, Exactly one landfall in the next four hurricanes is also binomial. With Jose coming up, I wonder if we can compute the probability of damage for New York City based on the probability of landfall.

Don’t worry Joe. Our Mayor is graciously implementing his comprehensive \$20 billion resiliency plan. NYC is safe now. Forget probability of damage. You need to worry about the probability of bankruptcy.

If you find this useful, please like, share and subscribe.