Lesson 38 – Correct guesses: The language of Hypergeometric distribution

Now that John prefers Pepsi let’s put him on the spot and ask him to choose it correctly. We will tell him that there is one Pepsi drink. We’ll ask him to chose it correctly when given three drinks.

These are his possible outcomes. He will pick the first, second or third can as Pepsi. If he chooses the first can, his guess is correct. So the probability that John is correctly guessing the one Pepsi can is 1/3. The probability that his guess is wrong is 2/3.

Since he is only choosing one correct Pepsi can, his outcomes are 0 or 1 right guess.

P(X = 0) = 2/3
P(X = 1) = 1/3

Let’s say that he liked Pepsi so much that he guessed it correctly in our first test.

We will put him through another test. This time, we will ask him to choose Pepsi correctly, when there are two cans in three.

He is to choose two cans, 1-2, 1-3, or 2-3. His guess outcomes are selecting a combination of two cans that will yield one Pepsi or both Pepsis. The only way to correctly guess both Pepsis is to choose 1-2. The probability is 1/3 (one option among three combinations).

There are two other ways (1-3, 2-3) for him to choose one Pepsi correctly. So the probability is 2/3.

P(X = 1) = 2/3
P(X = 2) = 1/3

John is good at this. He correctly guessed both the Pepsi cans. Let’s troll him by saying he got lucky so that he will take on the next test.

This time, he has to choose Pepsi correctly where there are two cans in four.

His picks can be 1-2, 1-3, 1-4, 2-3, 2-4, and 3-4. Six possibilities, since he is choosing two from four – 4C2. Based on his pick, he can get zero Pepsi, one Pepsi or both Pepsis. X = 0, 1, 2.

There is one possibility of both Pepsis; the combination 1-2. The probability that he chooses this combination among all others is 1/6.

P(X = 2) = 1/6

Similarly, there is one possibility of no Pepsi, the combination 3-4. The probability that he chooses these cans among all his options is 1/6.

P(X = 0) = 1/6

The other guess is choosing one Pepsi correctly. There are two Pepsi cans; he can pick either the first can or the second can as his choice, and then his second can will be the third or the fourth can.

1 with 3
1 with 4

2 with 3
2 with 4

There are 2C1 ways of choosing one Pepsi can among the two Pepsi cans. There are 2C1 ways of picking one Coke can from the two Coke cans.
The total possibilities are 2C1*2C1.

P(X = 1) = 4/6

Fortune favors the brave. John again correctly picked 1-2. Ask him to take one more test, and he will ask us to get out.

If you are like me, you want to play out one more scenario to understand the patterns and where we are going with this. So, let’s leave John alone and do this thought experiment ourselves. I will pretend you are John.

I will ask you to correctly pick Pepsis when there are three Pepsi cans among five.

When you pick three cans out of five, only one of them can be Pepsi, two of them can be Pepsi, or all three of them can be Pepsi. So X = 1, 2, 3.
You have a total of 10 combinations; pick three correctly from five. 5C3 ways as shown in this table.

The probability of picking all three Pepsis is 1/10.

P(X = 3) = 1/10

Now, let’s work out how many options we have for picking exactly one Pepsi in three cans. It has to be one Pepsi (one correct guess) with two Cokes (two wrong guesses). One Pepsi from three Pepsi cans can be selected in 3C1 ways. Two Cokes from two Cokes can be selected in 2C2 ways making it a total of 3C1*2C2 = 3 options. We can identify these options from our table as 1-4-5, 2-4-5 and 3-4-5. So,

P(X = 1) = 3/10

By the same logic, picking exactly two Pepsis in three cans can be done in 3C2*2C1 ways. Two Pepsi cans selected from three and one Coke chosen from two Coke cans. Six ways. Can you identify them in the table?

P(X = 2) = 6/10

Let us generalize.

If there are R Pepsi cans in a total of N cans (N-R Cokes) and we are asked to identify them correctly, in our choice selection of R Pepsis, we can get k = 0, 1, 2, … R Pepsis. The probability of correctly selecting k Pepsis is

X, the number of correct guesses (0, 1, 2, …, R) assumes a hypergeometric distribution.

The control parameters of the hypergeometric distribution are N and R.

The probability distribution for N = 5, and R = 2 is given.

When N =5 and R = 3.

In more generalized terms, if there are R Pepsi cans in a total of N cans (N-R Cokes) and we randomly select n cans from this lot of N and define X to be the number of Pepsis in our sample of n, then the distribution of X is hypergeometric distribution. P(X=k) in this sample is then:

The denominator NCn is the number of ways you can select n cans out of a total of N. A sample of n from N. The first term in the numerator is selecting k (correct guesses) out of R Pepsis. The second term is selecting (n-k) (wrong guesses) from the remaining (N-R) Cokes.

I want you to start visualizing how the probability distribution changes for different values of N, R, and n.

Next week, we will learn how to work with all discrete distributions in our computer programming tool R.

Hypergeometric distribution is typically used in quality control analysis for estimating the probability of defective items out of a selected lot.

The Pepsi-Coke marketing analysis is another example application. Companies can analyze the preferences of one product to other among a subset of customers in their region.

Now think of this:

There are R Republican leaning voters in a population of N. For simplicity (and for all practical purposes since LP and GP will not win); the other N-R voters are leaning Democratic. If you select a random sample of n voters, what is the probability that you will have more than k of them voting Republican? You know that the control parameters for this “election forecast” model are N, R, and n. What if you underestimated the number of R leaning voters? What if your sample of n voters is not random?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Enjoy this blog? Please spread the word :)