October 2017 – dataanalysisclassroom

Lesson 38 – Correct guesses: The language of Hypergeometric distribution

Now that John prefers Pepsi let’s put him on the spot and ask him to choose it correctly. We will tell him that there is one Pepsi drink. We’ll ask him to chose it correctly when given three drinks.

These are his possible outcomes. He will pick the first, second or third can as Pepsi. If he chooses the first can, his guess is correct. So the probability that John is correctly guessing the one Pepsi can is 1/3. The probability that his guess is wrong is 2/3.

Since he is only choosing one correct Pepsi can, his outcomes are 0 or 1 right guess.

P(X = 0) = 2/3
P(X = 1) = 1/3

Let’s say that he liked Pepsi so much that he guessed it correctly in our first test.

We will put him through another test. This time, we will ask him to choose Pepsi correctly, when there are two cans in three.

He is to choose two cans, 1-2, 1-3, or 2-3. His guess outcomes are selecting a combination of two cans that will yield one Pepsi or both Pepsis. The only way to correctly guess both Pepsis is to choose 1-2. The probability is 1/3 (one option among three combinations).

There are two other ways (1-3, 2-3) for him to choose one Pepsi correctly. So the probability is 2/3.

P(X = 1) = 2/3
P(X = 2) = 1/3

John is good at this. He correctly guessed both the Pepsi cans. Let’s troll him by saying he got lucky so that he will take on the next test.

This time, he has to choose Pepsi correctly where there are two cans in four.

His picks can be 1-2, 1-3, 1-4, 2-3, 2-4, and 3-4. Six possibilities, since he is choosing two from four – 4C2. Based on his pick, he can get zero Pepsi, one Pepsi or both Pepsis. X = 0, 1, 2.

There is one possibility of both Pepsis; the combination 1-2. The probability that he chooses this combination among all others is 1/6.

P(X = 2) = 1/6

Similarly, there is one possibility of no Pepsi, the combination 3-4. The probability that he chooses these cans among all his options is 1/6.

P(X = 0) = 1/6

The other guess is choosing one Pepsi correctly. There are two Pepsi cans; he can pick either the first can or the second can as his choice, and then his second can will be the third or the fourth can.

1 with 3
1 with 4

2 with 3
2 with 4

There are 2C1 ways of choosing one Pepsi can among the two Pepsi cans. There are 2C1 ways of picking one Coke can from the two Coke cans.
The total possibilities are 2C1*2C1.

P(X = 1) = 4/6

Fortune favors the brave. John again correctly picked 1-2. Ask him to take one more test, and he will ask us to get out.

If you are like me, you want to play out one more scenario to understand the patterns and where we are going with this. So, let’s leave John alone and do this thought experiment ourselves. I will pretend you are John.

I will ask you to correctly pick Pepsis when there are three Pepsi cans among five.

When you pick three cans out of five, only one of them can be Pepsi, two of them can be Pepsi, or all three of them can be Pepsi. So X = 1, 2, 3.
You have a total of 10 combinations; pick three correctly from five. 5C3 ways as shown in this table.

The probability of picking all three Pepsis is 1/10.

P(X = 3) = 1/10

Now, let’s work out how many options we have for picking exactly one Pepsi in three cans. It has to be one Pepsi (one correct guess) with two Cokes (two wrong guesses). One Pepsi from three Pepsi cans can be selected in 3C1 ways. Two Cokes from two Cokes can be selected in 2C2 ways making it a total of 3C1*2C2 = 3 options. We can identify these options from our table as 1-4-5, 2-4-5 and 3-4-5. So,

P(X = 1) = 3/10

By the same logic, picking exactly two Pepsis in three cans can be done in 3C2*2C1 ways. Two Pepsi cans selected from three and one Coke chosen from two Coke cans. Six ways. Can you identify them in the table?

P(X = 2) = 6/10

Let us generalize.

If there are R Pepsi cans in a total of N cans (N-R Cokes) and we are asked to identify them correctly, in our choice selection of R Pepsis, we can get k = 0, 1, 2, … R Pepsis. The probability of correctly selecting k Pepsis is

X, the number of correct guesses (0, 1, 2, …, R) assumes a hypergeometric distribution.

The control parameters of the hypergeometric distribution are N and R.

The probability distribution for N = 5, and R = 2 is given.

When N =5 and R = 3.

In more generalized terms, if there are R Pepsi cans in a total of N cans (N-R Cokes) and we randomly select n cans from this lot of N and define X to be the number of Pepsis in our sample of n, then the distribution of X is hypergeometric distribution. P(X=k) in this sample is then:

The denominator NCn is the number of ways you can select n cans out of a total of N. A sample of n from N. The first term in the numerator is selecting k (correct guesses) out of R Pepsis. The second term is selecting (n-k) (wrong guesses) from the remaining (N-R) Cokes.

I want you to start visualizing how the probability distribution changes for different values of N, R, and n.

Next week, we will learn how to work with all discrete distributions in our computer programming tool R.

Hypergeometric distribution is typically used in quality control analysis for estimating the probability of defective items out of a selected lot.

The Pepsi-Coke marketing analysis is another example application. Companies can analyze the preferences of one product to other among a subset of customers in their region.

Now think of this:

There are R Republican leaning voters in a population of N. For simplicity (and for all practical purposes since LP and GP will not win); the other N-R voters are leaning Democratic. If you select a random sample of n voters, what is the probability that you will have more than k of them voting Republican? You know that the control parameters for this “election forecast” model are N, R, and n. What if you underestimated the number of R leaning voters? What if your sample of n voters is not random?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 37 – Still counting: Poisson distribution

The conference table was arranged neatly with a notebook and pen at each chair. Mumble’s Macbook Air is hooked up to the projector at the other end.

He peeps over the misty window. A hazy and rainy outside did not elevate his senses. He will meet Lani from the risk team of California builders. Able invited him to New York to discuss a potential deal on earthquake insurance products.

Mumble goes over the talking points in his mind. He is to discuss the technical details with Able and Lani in the morning. Later, he will present his analysis of product lines and market studies to the executive officers.

Still facing the window, Mumble takes a deep breath, feels confident about his opening jokes, and turns around to greet Able and Lani who just entered the conference room.

Lani initiates the conversation and talks about teamwork and success. Mumble was not impressed. He has heard this teamwork mantra many a time now. Lani came across as an all talk no action guy.

Lani nods and picks up the pen to take some notes.

Able’s eyes rolled over towards Lani. He expected him to interject. Afterall, Lani’s bio says that he is a mathematician.

Lani did not interrupt. He was busy writing down points on his notepad.

Able was again expecting Lani to latch onto the equations.

“This is very good. Spending time on these details is essential. I am looking forward to our successful collaboration Mumble. I love Math too.”

Able and Mumble glanced at each other and politely concealed their emotions.

“Definitely. California Builders are at the forefront in Earthquake risk for housing projects. We can collect data for tremors in California. We work on several models for the benefit of our clients. Our team consists of many mathematicians and engineers. Mumble, I agree with Mr. Able. You did a fantastic job. We can work together on this project to produce risk models.”

Mumble has now confirmed that Lani is all talk no substance. He is just looking to delegate work and take credit.

Able has little patience for these platitudes. He is beginning to realize that Lani is a paper mathematician who may have taken one extra Math course in college and just calls himself a “mathematician” since it sounds “Einsteinian.” The deal with California Builders just became “no deal.”

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 36 – Counts: The language of Poisson distribution

I want you to meet my friends, Mr. Able and Mr. Mumble.

Unlike me who talks about risk and risk management for a living, they take and manage risk and make a killing.

True to his name, Mr. Mumble is soft-spoken and humble. He is the go-to person for numbers, data, and models. Like many of his contemporaries, he checked off the bucket list given to him and stepped on the ladder to success.

Some of you may be familiar with this bucket list.

Mumble was a “High School STEM intern,” was part of a “High School to College Bridge Program for STEM,” has a “STEM degree,” and completed his “STEM CAREER Development Program.” Heck, he even was called a “STEM Coder” during his brief stint with a computer programming learning center. A few more “STEMs,” and he’d be ready for stem cell research.

These “STEM initiatives” have landed him a risk officer job for an assets insurance company in the financial district.

On the contrary, Mr. Able is your quintessential American pride who has a standard high school education, was able to pay for his college through odd jobs, worked for a real estate company for some time, learned the trade and branched off to start his own business that insures properties against catastrophic risk.

He is very astute and understands the technical aspects involved in his business. Let’s say you cannot throw in some lingo into your presentation and get away without his questions. He is not your “I don’t get the equations stuff” person. He is the BOSS.

Last week, while you were learning the negative binomial distribution, Able and Mumble were planning a new hurricane insurance product. Their company would sell insurance against hurricane damages. Property owners will pay an annual premium to collect payouts in case of damages.

As you know, the planning phase involves discussion about available data and hurricane and damage probabilities. The meeting is in their 61st-floor conference room that oversees the Brooklyn Bridge.

Mumble, is there an update on the hurricane data? Do you have any thoughts on how we can compute the probabilities of a certain number of hurricanes per year?

Mr. Able, the National Oceanic and Atmospheric Administration’s (NOAA) National Hurricane Center archives the data on hurricanes and tropical storms. I could find historical information on each storm, their track history, meteorological statistics like wind speeds, pressures, etc. They also have information on the casualties and damages.

That is excellent. A good starting point. Have you crunched the numbers yet? There must be a lot of these hurricanes this year. I keep hearing they are unprecedented.

Counting Ophelia, we had ten hurricanes this year. Take a look at this table from their website. I am counting hurricanes of all categories. I recall from our last meeting that Hurin will cover all categories. By the way, I never liked the name Hurin for hurricane insurance. It sounds like aspirin.

Don’t worry about the name. Our marketing team has it covered. Funny name branding has its influence. You will learn when you rotate through the sales and marketing team. Tell me about the counts for 2016, 2015, etc. Did you count the number of hurricanes for all the previous years?

Yes. Here are the table and a plot showing the counts for each year from 1996 to 2017.

Based on this 22-year data, we see that the lowest number per year is two hurricanes and the highest number is 15 hurricanes. When we are designing the payout structure, we should have this in mind. Our claim applications will be a function of the number of hurricanes. Can we compute the probability of having more than 15 hurricanes in a year using some distribution?

Absolutely. If we assume hurricane events are independent (the occurrence of one event does not affect the probability that a second event will occur), then the counts per year can be assumed a random variable that follows a probability distribution. Counts, i.e., the number of times an event occurs in an interval follows a Poisson distribution. In our case, we are counting events that occur in time, and the interval is one year.

Let’s say the random variable is X and it can be any value zero hurricanes, one hurricane, two hurricanes, ….. What is the probability that X can take any particular value P(X = k)? What are the control parameters?

Poisson distribution has one control parameter

It is the rate of occurrence; the average number of hurricanes per year. Based on our data, lambda is 7.18 hurricanes per year. The probability P(X = k) for a unit time interval t is

The expected value and the variance of this distribution are both

We can compute the probability of having more than 15 hurricanes in a year by adding P(X = 16) + P(X = 17) + P(X = 18) and so on. Since 15 happened to be in the extreme, the probability will be small, but our risk planning should include it. Extreme events will create a catastrophic damage. I see you have more slides on your deck. Do you also have the probability distribution plotted?

Yes, I have them. I computed the P(X = k) for k = 0, 1, 2, …, 20 and plotted the distribution. It looks like this for = 7.18.

Let me show you one more probability distribution. This one is for storms originating in the western Pacific. They are reported to the Joint Typhoon Warning Center. Since we also insure assets in Asia, this data and the probability estimates will be useful to design premiums and payouts there. The rate of events is higher in Asia; an average of 14.95 typhoons per year. The maximum number of typhoons is 21.

Very impressive Mumble. You have the foresight to consider different scenarios.

As the meeting comes to closure, Mr. Able is busy checking his emails on the phone. A visibly jubilant Mumble sits in his chair and collects the papers from the table. He is happy for having completed a meeting with Mr. Able without many questions. He is already thinking of his evening drink.

The next meeting is in one week. Just as Mr. Able gets up to leave the conference room, he pauses and looks at Mumble.

“Why is it called Poisson distribution? How is this probability distribution different from the Binomial distribution? Didn’t you say in a previous meeting that exactly one landfall in the next four hurricanes is binomial?”

Mumble gets cold feet. His mind already switched over to the drinks after the last slide; he couldn’t come up with a quick answer. As he begins to mumble, Able gets sidetracked with a phone call. “See you next week Mumble.” He leaves the room.

Mumble gets up and watches over the window — bright sunny afternoon. He refills his coffee mug, takes a sip and reflects on the meeting and the question.

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 35 – Trials to ‘r’th success: The language of Negative Binomial distribution

We all know that the trials to the first success is a Geometric distribution. It can take one trial, two trials, three trials, etc., to see the first success. These trials are assumed a random variable X = {1, 2, 3, … }; they have a probability, i.e., P(X = 1), P(X = 2), P(X = 3), and so on.

However, Mr. Gardner needs more success. He has already sold his first “time machine” (aka bone density scanner). He’d have to sell more. He is looking for his second success, third success and so forth to get through.

The number of trials it takes to see the second success is Negative Binomial distribution.

The number of trials it takes to see the third success is Negative Binomial distribution.

The number of trials it takes to see the ‘r’th success is Negative Binomial distribution.

Are you paying attention to the pattern here? Negative Binomial distribution is Geometric distribution if r is 1 (trials to first success).

The Geometric distribution has one control parameter, p, the probability of success.

Since we are interested in more than the first success, r is another parameter in the Negative Binomial distribution. Together, p and r determine how the distribution looks.

Let’s take the example of Mr. Gardner. “I’d have to sell one more.” Each of his visit to a doctor’s office is a trial. They will buy the bone density scanner or show him the door. So Mr. Gardner is working with a probability of success, p. Let’s say that p is 0.25, i.e., there is a 25% chance that he will succeed in selling it.

Let’s assume that he sold one machine. He is looking for his second sale. r = 2.

His second success can occur in the second trial, the third trial, the fourth trial, etc. When r = 2, X, the random variable of the number of trials will be X = {2, 3, 4, …}.

When r = 3, X = {3, 4, 5, …}.

You correctly guessed my next line. When r = 1, i.e., for Geometric distribution, X = {1, 2, 3, …}.

There is a pattern. We are learning a distribution which is an extension of Geometric distribution.

Now let us compute the probability that X can take any integer value.

P(X = 2) is the probability that he makes his second sale (second success) on the second trial.

Remember the trials are independent. The second doctor’s decision is not dependent on what happened before. He buys or not with a 0.25 probability.

So, P(X = 2) is 0.25*0.25 = 0.0625. The first trial is a success and the second trial is a success.

P(X = 3) is the probability that he makes his second sale on the third trial. It means he must have made his first sale in either the first trial or the second trial, and then he makes his second sale on the third trial.

The probability of succeeding second time on the third trial is the probability of succeeding once in two trials, and the third trial is a success.

P(1 success in 2 trials) * P(3rd is a success)

The probability of the one success in two trials is computed using the Binomial distribution.

2C1*p^1*(1-p)^2-1

This probability is multiplied by p, the probability of success in the third trial.

P(X = 3) = 2C1*p^1*(1-p)^2-1*p

If this is your expression now, 😕 let’s take another case to clear up the concept.

Suppose we want P(X = 6), the probability of making the second sale on the sixth trial.

This will happen in the following way.

Mr. Gardner has to sell one machine in five trials. P(1 in 5), one success in five trials → Binomial.

5C1*p^1(1-p)^(5-1)

Then, the sixth trial is a sell. So we multiply the above binomial probability with p, the hit in the sixth.

You should have noticed the origin of the name → Negative Binomial.

To generalize,

I computed these probabilities for X ranging from 2 to 50. Here is the probability distribution. Remember r = 2 and p = 0.25.

Now I change the value for r to 3 and 5 to see how the Negative Binomial distribution looks. r = 3 means Mr. Gardner will sell his third machine.

r = 3

r = 5

Notice how the probability distribution shifts with increasing values of r.

The control parameters are r and p.

You can try different values of p and see what happens. For a fixed value of p and changing values of r, the tail is getting bigger and bigger. What does it mean regarding the number trials and their probability for Mr. Gardner?

Think about this. If he sets a target of selling three machines in the day, what is the probability that he can achieve his goal within 20 doctor visits?

What if he needs to sell five machines within this 20 visits to make his ends meet. Can he make it?

If you try out the probability distribution plots for p = 0.5, you will see that his chance of selling the fifth machine within 20 visits goes up tremendously. So perhaps he should learn the six principles of influence and persuasion to increase p, the probability of saying yes by the doctors.

People mostly prefer to say yes to the request of someone they know and like. I want our blog to have the 9000th user within nine months. See, I am requesting in the language of Negative Binomial distribution.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Month: October 2017

Lesson 38 – Correct guesses: The language of Hypergeometric distribution

Lesson 37 – Still counting: Poisson distribution

Lesson 36 – Counts: The language of Poisson distribution

Lesson 35 – Trials to ‘r’th success: The language of Negative Binomial distribution

Enjoy this blog? Please spread the word :)