Lesson 32 – Exactly k successes: The language of Binomial distribution


I may be in one of those transit buses. Since I moved to New Jersey, I am going through this mess every day.



Well, you wanted to enjoy Manhattan skyline. It has a price tag.



D, glad you are here. It’s been a while. In our last meeting, we were discussing the concepts of variance operation and its properties. I continue to read your lessons every week. As I paused and reflected on all the lessons, I noticed that there is a systematic approach to them. You started with the basics of sets and probability, introduced lessons on visualizing, summarizing and comparing data using various statistics, then extended those ideas into random variables and probability distributions. The readership seems to have grown considerably, and people are tweeting about our classroom. Have you reached 25000 pageviews yet?

We are at 24350 pageviews now. We will certainly hit the 25k mark today 😉 I am thankful to all the readers for their time. Special thanks to all those who are spreading the word. Our classroom is a great resource for anyone starting data analysis.



So, whats on the show today?



As you correctly pointed out, we are now slowly getting into various types of probability distributions. I mentioned in lesson 31 that we would learn several discrete probability distributions that are based on Bernoulli trials. We start this adventure with Binomial distribution.


Great. Let me refresh my memory of probability distributions before we get started. We discussed the basics of probability distribution in lesson 23. Let’s assume X is a random variable, and P(X = x) is the probability that this random variable takes any value x (i.e., an outcome). Then, the distribution of these probabilities on a number line, i.e., the probability graph is called the probability distribution function f(x) for a random variable. We are now looking at various mathematical forms for this f(x).


Fantastic. Now imagine you have a Bernoulli sequence of yes or no.



Sure. It is a sequence of 0s and 1s with a probability p; 0 if the trial yields a no (failure, or event not happening) and 1 if the trial yields a yes (success, or event happening). Something like this: 00101001101


From this sequence, if you are interested in the number of successes (1s) in n trials, this number follows a Binomial distribution. If you assume X is a random variable that represents the number of successes in a Bernoulli sequence of n trials, then this X should follow a binomial distribution. The probability that this random variable X takes any value k, i.e., the probability of exactly k successes in n trials is:




The expected value of this random variable, E[X] = np, and the variance V[X] = np(1-p).



😯 Wow, that’s a fastball. Can we parse through the lingo?



Oops… Okay, let us take the example of your daily commute. Imagine buses and cars pass through the tunnel each morning. Can you guesstimate the probability of buses?



Yeah, I usually see more buses than cars in the morning. Let’s say the likelihood of seeing a bus is p=0.7.



Now let us imagine that buses and cars come in a Bernoulli sequence. Assign a 1 if it is a bus, and 0 if it is a car.


That is reasonable. The vehicle passage is usually random. If we take that as a Bernoulli sequence, there will be some 1s and some 0s with a 0.7 probability of occurrence. In the long run, you will have 70% buses and 30% cars in any order.



Correct. Now think about this. In the next four vehicles that pass through the tunnel, how many of them will be buses?



Since there is randomness in the sequence, in the next four vehicles, I can say, all of them may be buses, or none of them will be buses or any number in between.


Exactly. The number of buses in a sequence of 4 vehicles can be 0, 1, 2, 3 or 4. These are the random variables represented by X. In other words, if X is the number of buses in 4 vehicles coming at random, then X can take 0, 1, 2, 3 or 4 as the outcomes. The probability distribution of X is binomial.


I understand how we came up with X. Why is the probability distribution of X called binomial?


It originates from the idea of the binomial coefficient that you may have learned in an elementary math/combinations class. Let us continue with our logical deduction to see how the probability is derived, and you will see why.


Sure. We have X as 0, 1, 2, 3 and 4. We should calculate the probability P(X = 0), P(X = 1), P(X = 2), P(X = 3) and P(X = 4). This will give us the distribution of the probabilities.


Take an example, say 2. Let us compute P(X = 2). The probability of seeing exactly two buses in 4 vehicles. The probability of exactly k successes in n trials. If the buses and cars come in a Bernoulli sequence (1 for bus and 0 for a car) with a probability p, in how many ways can you see two buses out of 4 vehicles?


Ah, I see where we are going with this. Let me list out the possibilities. Two buses in four vehicles can occur in six ways. 0011, 0101, 1100, 1010, 1001, 0110. In each of these six possible sequences, there will be exactly two buses among four vehicles. I remember from my combinations class that this is four choose two. Four factorial divided by the product of two factorial and (four minus two) factorial. 4C2 = 4!/4!(4-2)!

For each possibility, the probability of that sequence can also be written down. Let me make a table like this:








You can see from the table that there are six possibilities. Any of the possibilities, 1 or 2 or 3 or 4 or 5 or 6 can occur. Hence, the probability of seeing two in four is the sum of these probabilities. Remember P(A or B) = P(A) + P(B). If you follow through this, you will get, 6*p*p*(1-p)*(1-p). = 6*p^2*(1-p)^(4-2). Can you see where the formula for binomial distribution comes from?




Absolutely. For each outcome of X, i.e., 0, 1, 2, 3 and 4, we should apply this logic/formula and compute the probability of the outcome. Let me finish it and make a plot.


Very nicely done. Let me jump in here and show you another plot with a different n and p. If p = 0.5 (equal probability) and n = 100; this is how the binomial distribution looks like.


Nice. It looks like an inverted bell centered around 50.



Yeah. You noticed that the distribution is centered around 50. It is the expected value of the distribution. Remember E[X] is the central tendency of the distribution. For binomial, you can derive it as np = 100 (0.5) = 50. In the same way, the variance, i.e. spread of the function around this center is np(1-p) = 100(0.5)(0.5) = 25. Or standard deviation is 5. You can see that the distribution is spread out within three standard deviations from the center. Can you now imagine how the distribution will look like for p = 0.3 or p = 0.7?


Following the same logic, those distributions will be centered on 100*0.3 = 30 and 100*0.7 = 70 with their variance. Now it all makes sense.


You see how easy it is when you go through the logic. We started with Bernoulli sequence. When we are interested in the random variable that is the number of successes in so many trials, it follows a binomial distribution. Exactly k successes” is the language of Binomial distribution. Can you think of any other examples that can be modeled as a binomial distribution?


Probability that Derek Jeter, with a batting average of 0.3, gets three hits out of the three times he comes to bat 😆  This is fun. I am glad I learned some useful concepts out of the messy commute experience. By the way, Exactly one landfall in the next four hurricanes is also binomial. With Jose coming up, I wonder if we can compute the probability of damage for New York City based on the probability of landfall.


Don’t worry Joe. Our Mayor is graciously implementing his comprehensive $20 billion resiliency plan. NYC is safe now. Forget probability of damage. You need to worry about the probability of bankruptcy.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 31 – Yes or No: The language of Bernoulli trials

Downtown Miami will be flooded due to hurricane Irma.

Your vehicle will pass the inspection test this year.          

Each toss of a coin results in either a head or a tail.          

Did you notice that I am looking for an answer, an outcome that is “yes” or “no.” We often summarize data as the occurrence (or non-occurrence) of an event in a sequence of trials. For example, if you are designing dikes for flood control in Miami, you may want to look at the sequence of floods over several years to analyze the number of events, and the rate at which they occur.

There are two possibilities, a hit (event occurred – success) or miss (event did not occur – failure). A yes or a no. These events can be represented as a sequence of 0’s and 1’s (0001100101001000) called Bernoulli trials with a probability of occurrence of p. This probability is constant over all the trials, and the trials itself are assumed to be independent, i.e., the occurrence of one event does not influence the occurrence of the subsequent event.

Now, imagine these outcomes, 0’s or 1’s can be represented using a random variable X. In other words, X is a random variable that can take 0 or 1 with a probability p. If in Miami, there were ten extreme flood events in the last 100 years, the sequence will have 90 0’s and 10 1’s in some order. The probability of the event is hence 0.1. If the probability is 0.5, then, in a sequence of 100 trials (coin tosses for example), you will see 50 heads on average. We can derive the expected value of X and the variance of X as follows:

Since the Bernoulli trials are independent, the probability of a sequence of events happening will be equal to the product of the probability of each event. For instance, the probability of observing a sequence of No Flood, No Flood, No Flood and Flood over the last four years is 0.9*0.9*0.9*0.1 = 0.072 (assuming p = 0.1).

Bernoulli trials form the basis for deriving several discrete probability distributions that we will learn over the next few weeks.

While you ponder over what these distributions are, their mathematical forms, and how they represent the variation in the data, I will leave you with this image of the daily rainfall data from Miami International Airport. An approximate 6.38 inches of rain (~160mm/day) is forecasted for Sunday. Notice how you can remap the data into a sequence of 0’s (if rain is less than 160) and 1’s (if rain is greater than 160).

After tomorrow, when you hear “unprecedented rains” in the news, keep in mind that we seek the historical sequence data like this precisely because our memory is weak.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 30 – Pause and rewind

Why am I doing this?

What you just read is my blog post on January 1, 2017. It is what happens on a new year day when you don’t have a social life.

Over the last few years, I have concluded that the largest crowd I can ever reach, if I continue delivering lectures in the usual University setting is 80 per semester. I am sure some of them are not there by choice. Since I believe that I can distill ideas into easily understandable forms, I felt the urge to spread my voice to a larger audience. Hence our Data Analysis Classroom.

I usually emphasize the importance of collecting more data for understanding the uncertainties in the system and using them in the final decision-making process. Practicing the preachings, after 29 lessons, I want to pause, reflect on the readership data and rewind what we learned.

We started this journey on the fifth day of February 2017. After 202 days, i.e., six months and 21 days, the monthly readership data is

We more than made up for the slump in July. The total page views in this time are 19051 with average monthly page views of 2721. Being very conservative, I would like to think that the blog may only capture the interest of 0.5% of the readers. That will be approximately 95 more people I could reach in the six month period. As I said, that is easily more than who I can reach through a class in a semester. I just hope these 95 don’t include the same folks in the class who are already captivated!

Here is a map of the readership. If you are in one of the blue countries, thank you for your attention. Let’s get more people on board. They will like it too.

What did we learn so far?

Uncertainties surround us. Understanding where they arise from and recognizing their extent is fundamental to improving our knowledge about anything.

We started off with the fact that one needs to observe the system to understand it better. Ask for data. The more, the merrier.

Lesson 1: When you see something, say data. A total of 412 page views.

Data may be grouped in sets → collection of elements. We have various forms to visualize the sets. Unions, intersections, subsets, and their properties.

Lesson 3: The Setup. 321 page views.
Lesson 4: I am a visual person. 301 page views.

We defined probability and understood that there are some rules (axioms) of probability.

Lesson 6: It’s probably me. 377 page views.
Lesson 7: The nervousness axiom – fight or flight. 253 page views.

We learned conditional events, independent events and how the probability plays out under these events.

Lesson 9: The necessary condition for Vegas. 813 page views.
Lesson 10: The fight for independence. 641 page views.

We now know the law of total probability and the Bayes theorem.

Lesson 12: Total recall. 471 page views.
Lesson 13: Dear Mr. Bayes. 2280 page views — most popular lesson so far.

In lessons 14 through 20, you can learn the basics of exploratory data analysis; visualization techniques, and summary statistics.

Lesson 14: The time has come; execute order statistics. A lesson to explain the concept of percentiles and boxplots. 327 page views.

Lesson 15: Another brick in the wall — for building histograms. 297 page views.

Lesson 16: Joe meets the average. Explains the mean of the data. 218 page views.

Lesson 17: We who deviate from the norm. Explains the idea of variance and standard deviation. 159 page views.

Lesson 18: Data comes in all shapes. A short lesson explaining the concept of skewness. 116 page views.

Lesson 19: Voice of the outliers. Outliers are significant. As Nicholas Taleb puts it: “Don’t be a turkey” by removing them. 87 page views.

Lesson 20: Compared to what?. Use the coefficient of variation to compare data. 115 page views.

Lessons 22 to 28 introduce the idea of random variables and probability distribution.

Lesson 22: You are so random. The basic idea of discrete and continuous random variables. 109 page views.

Lesson 23: Let’s distribution the probability. This lesson will teach the concept of the probability distribution. 126 page views.

Lesson 24: What else did you expect? 130 page views.
Lesson 25: More expectation. 95 page views. These two lessons go through the expected value of a random variable.

Lesson 26: The variety in consumption. 175 page views.
Lesson 27: More variety. 568 page views. These two lessons are for understanding the variance of a random variable concept.

Lesson 28: Apples and Oranges. How to standardize the data for comparison. 325 page views.

Lessons in R

As I mentioned in lesson 2, R is your companion in this journey through data. Computer programming is very very (cannot emphasize enough “very”) essential for data analysis. Wherever required, I have provided brief lessons on how to use R for data analysis.

Lesson 2: R is your companion. The very first lesson to get going with R. 434 page views.

Lesson 5: Let us Review. Reading data files and more fun stuff. 228 page views.

Lesson 8: The search for wifi. Learning for and if else statements in R. 194 page views.

Lesson 11: Fran, the functionin R-bot. The essentials of writing functions in R. 332 page views.

Lesson 21: Beginners guide to summarize data in R. A step-by-step exploratory data analysis. 1786 page views.

Lesson 29: Large number games in R. 933 page views.

Where do we go from here?

A long way.

While you pause and reflect on these lessons, I will pause and come back with lesson 31 on the ninth day of September 2017. Help spread the word as we build this knowledge platform one lesson at a time while the university system becomes obsolete.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 28 – Apples and Oranges

The prices of golden delicious apples sold yesterday at selected U.S. cities’ terminal markets are

The prices of navel oranges sold yesterday at selected U.S. cities’ terminal markets are

I want to compare apples and oranges, needless to say, prices. I will assume the prices of apples to be a random variable and will plot the probability distribution function. I will consider the prices of oranges to be another random variable and will plot its probability distribution function. Since the prices can be defined on a continuous number scale, I am assuming continuous probability distribution functions that look like this

Notice the two random variables are on a different footing. Apples are centered on $44 with a standard deviation (spread) of $9. Oranges are centered on $25 with a standard deviation of $5.

If our interest lies in working simultaneously with data that are on different scales or units, we can standardize them, so they are placed on the same level or footing. In other words, we will move the distributions from their original scales to a new common scale. This transformation will enable an easy way to work with data that are related, but not strictly comparable. In our example, while both the prices data are in the same units, they are clearly on different scales.

We can re-express the random variable (data) as standardized anomalies so they can be compared or analyzed. This process of standardization can be achieved by subtracting the mean of the data and dividing by the standard deviation. For any random variable, if we subtract the mean and divide by the standard deviation, the expected value and the variance of the new standardized variable is 0 and 1. Here’s why.

The common scale is hence a mean of 0 and a standard deviation of 1. We just removed the influence of the location (center) and spread (standard deviation) from the data. Any variable, once standardized, will be on this scale regardless of the type of the random variable and shape of the distribution.

You must have observed that the units will cancel out → the standardized random variable is dimensionless. The standardized data for apples and oranges will look this

The first step where we subtract the mean will give anomalies, i.e. differences from the mean that are now centered on zero. Positive anomalies are the values that are greater than the mean, and negative anomalies are the values that are less than the mean. The second step where we divide by the standard deviation will provide a scaling factor; unit standard deviation for the new data that enables comparison across different distributions.

The standardized scores can also be seen as a distance measures between the original data value and the mean, the units being the standard deviation. For example, the price of apples in New York terminal market is $55, about 1.14 standard deviations from the mean of ~ $44. $44.63 + 1.14*9.

We will revisit this idea of standardization and use it to our advantage when we learn normal probability distributions. Until then, here are some examples of “standardize” in sentences as provided by Merriam — my rants attached.

“The plan is to standardize the test for reading comprehension so that we can see how students across the state compare” – One size does not fit all.

“He standardized procedures for the industry” – Interns getting coffee is not one of them. So make your own.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 27 – More variety

August 5, 2017: Outside temperature 71F; Joe’s temperature 62F.

June 2, 2017: Outside temperature 77F; Joe’s temperature 60F.

April 14, 2017: Outside temperature 65F; Joe’s temperature 64F.

February 7, 2017: Outside temperature 40F; Joe’s temperature 61F.

December 11, 2017: Outside temperature 34F; Joe’s temperature 63F.


What is the point? I know you are not referring to me.


Hey Joe. Glad you are here. I have been doing this experiment for some time now. A few years ago, I noticed that Joe’s, the famous coffee shop in New York City is always pleasant – summer or winter. I wanted to see how much variation there is in the internal temperature, so I started measuring it daily. It also meant $5 a day on coffee. 😯


So what did you find at that price?


I discovered that the temperature was discrete and always between 60 and 64. Some of it may be due to the temperature app I use on my phone – it only shows whole numbers; nevertheless, the temperature seems to be pretty regulated. Here is the summary of that data shown in probabilities and a simple probability distribution plot. You must be familiar with this.

Sure, I remember our first conversations about probability. It is the long run relative frequency. So you are finding that the probability that the temperature is 60F on any day is 0.1, the probability that the temperature is 62 on any day is 0.4 and so forth.


Correct. Shall we also compute the summary statistics?


Sure, it will be useful to estimate the expected value (E[X]), variance (V[X]) and the standard deviation of this random variable based on the probability distribution function. Let me do that.

The expected value is

E[X] = 60*0.1 + 61*0.2 + 62*0.4 + 63*0.2 + 64*0.1 = 62

The average temperature is 62F.

Variance can be estimated using the following equation

Applying this to our data: 

So the variance of the temperature is 1.2 deg F*deg F, or the standard deviation is 1.095 deg F.

Fantastic. We can see that the average temperature is 62F and not varying much. That explains why Joe’s is always crowded. Besides having great coffee, they also offer great climate.



I agree. By the way, do you know the geolocation of your readers?


Sure. Here is a map of our readership. We’ve got to get some more countries on board to cover the globe, but this is a great start. Why do you ask?


I ask because outside the USA, people follow the SI unit system and they may want to do these computations in Celsius scale.


Very interesting observation. Do you have a solution for it?


Well, the simplest way is to convert the data first into Celsius scale and re-compute. But that is boring. There’s got to be some elegant approach.


Yes there is. The properties of expected value and the variance operations can help us with this. Going back to your concern, we can convert degree F to degree C using this equation:

In lesson 25, we learned that the expected value operation is linear on functions. For example, 

Using this property, we can compute the expected value of Joe’s temperature in deg C as:

E[C] = 5/9*E[F] - 160/9 = 5/9*62 - 160/9 = 16.66 deg C.

For variance, it is not linear. Let us look at the derivation.


Very clear. Let me apply this to the Celsius equation.

and the standard deviation will be 0.6 deg C.

Excellent. Now that you have your intellectual focus going tell me if the variance of the sum is the sum of the variance for two or more random variables?


Hmm. Going by your previous lessons, your questions always open new concepts. There is something hidden here too that you want me to discover. Can you give me a hint?


Well Joe, Noam Chomsky once said that if you can discover things, you are on your way to being an “independent” thinker.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 26 – The variety in consumption

This summer, I affixed an air conditioner in my apartment to adapt to a changing climate. One month into the season, I was startled to see my electricity bill double. I checked my AC specs; sure enough, its energy efficiency ratio is 9.7 BTU/h.W — not the modern energy star baby by any standard.

I begin to wonder how many such low-efficiency units contribute to increased energy usage, especially in New York City where window AC is a norm for residential buildings. While getting the number of older units is challenging, Local Law 84 on Municipal Energy and Water Data Disclosure require owners of large buildings to report their energy and water consumption annually. You can get the 2015 disclosure data from here. I am showing a simple frequency plot of the weather normalized electricity intensity in kBtu per square foot of building area here.

There are 13223 buildings in the data file; 254 of them have an electricity intensity greater than 200 kBtu. I am not showing them here. The NY Stock Exchange building, Bryant Park Hotel, and Rockefeller University are notable among them.

I want to know the variety in energy use. Long time readers might guess correctly from the title that I am talking about the variability in energy usage data. We can assume that energy consumption is a random variable X, i.e. X represents the possible energy consumption values (infinite and continuous). The data we downloaded are sample observations x. We are interested in the variance of X → V[X].

In lesson 24 and lesson 25, we learned that the expected value (E[X]) is a descriptive quantity of the average (center) of a random variable X with a probability distribution function f(x). In the same way, a measure of the variability, i.e. deviation from the center, of the random variable is the variance V[X].

It is defined as the expected value of the squared deviation from the average.

is the expected value of the random variable → E[X]. measures the deviation from this value. We square these deviations and get the expected value of the squared deviations. If you remember lesson 17, this is exactly the equation for the variance of the data sample. Here, we generalize it for a random variable X.

With some derivation, we can get a useful alternative for computing the variance.

If we know the probability distribution function f(x) of the random variable X, we can also write the variance as

f(x) for this data might look like the thick black line on the frequency plot.

The expected value of the consumption for 2015 of the 12969 buildings is 83 kBtu/sqft; the variance is 1062 (kBtu/sqft)×(kBtu/sqft). A better way to understand this is through standard deviation, the square root of the variance — 32 kBtu/sqft. You can see from the frequency plot that the data has high variance — buildings with very low electricity intensity and buildings with high electricity intensity.

What is your building’s energy consumption this year? Does your city have this cool feature?

It came at a price though. With another local law, we can ban all the low EER AC units to solve the energy problem.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 25 – More expectation

My old high school friend is now a successful young businessman. Last week, he shared thoughts on one of his unusual stock investment schemes. Every few months, he randomly selects four stocks from four different sectors and bets equally on them. I asked him for a rationale. He said he expects two profit making stocks on average assuming that the probability of profit or loss for a stock is 0.5. I think since he picks them at random, he also assigns a 50-50 chance of win lose.

My first thought

This made a nice expected value problem of the sum of random variables.

The expected number of profit making stocks in his case is 2. We can assign X1, X2, X3, and X4 as the random variables for individual stocks with outcomes 1 if it makes a profit and 0 otherwise. We can assign Y as the total number of profit making stocks; ranging from 0 to 4. His possible outcomes are:

As we can see, the total number of profit making stocks in these scenarios are 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0. The average of these numbers is 2; the expected number of profit making stocks.

Another way of getting at the same number is to use the expected value formula we learned in lesson 24.

E[Y] = 4(1/16) + 3(4/16) + 2(6/16) + 1(4/16) + 0(1/16) = 2

An important property of expected value of a random variable is that the mean of the linear function is the linear function of the mean.

Y = X1 + X2 + X3 + X4

Y is another random variable that comes from a combination of individual random variables. For sum of random variables,

E[Y] = E[X1] + E[X2] + E[X3] + E[X4]

Detailed folks can go over the derivation.

Now, E[X1] = 0.5(1) + 0.5(0) = 0.5, since the outcomes are 1 and 0 and the probabilities are 0.5 each. Adding all of them, we get 2.

So you see, the additive property makes it easy to estimate the expected value of the sum of the random variables instead of writing down the outcomes and computing the probability distribution of Y.

Other simple rules when there are constants involved;




Try to derive them as we did above.

My second thought

Stock market might be more complicated than a coin flip experiment. But what do I know; he clearly has more net worth than me. I guess since this is only one of his investment schemes, he is just playing around with his leftovers.

My final thought

I am only good for teaching probability; not using it like him. But again, most ivory tower professors only preach; don’t practice. Hey, they are protected. Why should they do real things?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 24 – What else did you expect?

On a hot summer evening, a low energy Leo walked into his apartment building without greeting the old man smoking outside. Leo has just burnt his wallet in Atlantic City Roulette game, and his mind has been occupied with how it all happened. He went in with $500. Ten chips of $50. His first gamble was to play safe and bet one chip at a time on red. He lost the first two time, won the third time, and lost the next two times. After five consecutive bets, he was left with $350. In the Roulette game, the payout for red or black is 1 to 1. He started getting worried. Since the payout for any single number is 35 to 1, in a hasty move, he went all in on number 20, just to realize that he was all out.

Could it be that luck was not favoring Leo on the day? Could it be that his betting strategy was wrong? Are the odds stacked against him? If he had enough chips and placed the same bet long enough, what can he expect?

Based on his first bet

Imagine Leo bets a dollar at a time on red. He will win or lose $1 each time. In an American Roulette, there will be 18 red, 18 black and two green (0 and 00) — a total of 38 numbers. Each number is independent, i.e. there is an equal chance of getting any number. The probability of getting a red is 18/38 (18 reds in 38 numbers). In the same way, the probability of getting a black is 18/38, and the probability of getting a green is 2/38.

If the Ivory ball ends in a red, he will win $1; if it ends in any other color he will lose $1 – or he gets -$1. In the long run, if he keeps playing this game with dollar on and off, his expected win for a dollar will be

On average, for every $1 he bets on red, he will lose 0.05 cents.
Based on his Second bet

Now let us imagine Leo bets on a single number where the payout is 35 to 1. He will win $35 if the ball ends up in his number, or lose the dollar. The probability of getting any number is 1/38 (one number in 38 outcomes). Again, in the long run, if he keeps playing this game; win $35 or lose $1, over time, his expected win for a dollar will be

Although the payout is high, one average, for every $1 he bets on a single number, he will still lose 0.05 cents. 

This estimation we just did is called the Expected Value of a random variable. Just like how “mean” is a description of the central tendency for a sample data, the expected value (E[X]) is a descriptive quantity of the central tendency (average behavior) of a random variable X with a probability distribution function (f(x)).

In Leo’s case, X is the random variable describing his payout, x is the actual payout from the house ($1 or $35 if he wins, or -$1 if he loses), and f(x) is the probability distribution or frequency of the outcomes (18/38 for red and 20/38 otherwise, or 1/38 for a single number and 37/38 otherwise).

You will notice that this equation is exactly like the equation for the average of a sample. Imagine there is a very large sample data with repetitions; we are adding and averaging over the groups.

Poor Leo expected this to happen but didn’t realize that the table is tilted and the game is rigged.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 23 – Let’s distribute the probability


Hey Joe, what are you up to these days?


Apart from visiting DC recently, life has been mellow over this summer. I am reading your lessons every week. I noticed there are several ways to visualize data and summarize it. Those were a nice set of data summary lessons.


Yes. Preliminaries in data analysis — visualize and summarize. I recently came across visuals with cute faces 🙂 I will present them at an appropriate time.


That is cool. On the way back from DC, we played the Chicago dice game. I remembered our conversation about probability while playing.



Interesting. How is the game played?


There will be eleven rounds numbered 2 – 12. In each round, we throw the pair of dice to score the number of the round. For example, if on the first try, I get a 1 and 1, I win a point because my first round score is 2. If I throw any other number other than 2, I don’t win anything. The player with the highest total after 11 rounds wins the game.


I see. So there are 11 outcomes (2 – 12), and you are trying to get the outcome. Do you know the probability distribution of these outcomes?


I believe you just used the question to present a new idea – “probability distribution“. Fine, let me do the Socratic thing here and ask “What is probability distribution“?

It is the distribution of the probability of the outcomes. In your Chicago dice example, you have a random outcome between 2 and 12; 2 if you roll a 1 and 1; 12 if you roll a 6 and 6. Each of these random outcomes has a probability of occurring. If you compute these probabilities and plot them; i.e. distribute the probabilities on a number line, we can see a probability distribution of these random variables.


Let me jump in here. There are 11 possible outcomes. I will tabulate the possibilities.

There are limited ways of achieving an outcome. The likelihood of each outcome will be the ratio of the total ways we can get the number and 36. An outcome 2 can only be achieved if we get a (1,1). Hence the probability of getting 2 in this game is 1/36.


Excellent, now try to plot these probabilities on a scale from 2 to 12.


Looking at the table, I can see the probability will increase as we go up from 2 to 7 and decrease from there till 12.


I like the way you named your axes. X and P(X = x). Your plot shows that there is a spike (which is the probability) for each possible outcome. The probability is 0 for all other outcomes. The spikes should add up to 1. This probability graph is called the probability distribution function f(x) for a discrete random variable.

The function can be integrated to obtain the cumulative distribution function. Say you want to know the probability of getting an outcome less than 4. You can use the cumulative function that is integrated over the outcomes 2 and 3. Just be watchful of the notations. Probability distribution function has a lowercase f, and cumulative distribution function has an uppercase F.







So if we know the function f(x), we can find out the probability of any possible event from it. These outcomes are discrete (2 to 12), and the function is also discrete for every outcome. What if the outcomes are continuous? How does the probability distribution function look if the random variable is continuous where the possibilities are infinite?


Okay, let us do a thought experiment. Imagine there are ten similar apples in a basket. What is the probability of taking any apple at random?


Since there are ten apples, the probability of taking one is 1/10.


What if there are n apples?



Then the probability of taking any one is 1/n. Why do you ask?


What happens to the probability if n is a very large number, i.e. if there are infinite possibilities?


Ah, I see. As n approaches infinity, the probability of seeing any one number approaches 0. So unlike discrete random variables which have a defined probability for each outcome, for continuous random variables P(X = x) = 0. How then, can we come up with a probability distribution function?


Recall how we did frequency plots. We partitioned the space into intervals or groups and recorded the number of observations that fall into each group. For continuous random variables, the proportion of observations in the group approaches the probability of being in the group. For a large n, we can imagine a large number of small intervals like this.

We can approximate this to a smooth curve and define the probability of a continuous variable in an interval a and b.





The extension from the frequency plot to the probability distribution function is clear. Since the function is continuous, if we want the cumulative function, we integrate it like this.






Great. You picked up many things today. Did you figure out the odds of getting a deal on your Chicago dice game — getting the same number as the round in your 11 tries?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 22 – You are so random

Not just Pinkie Pie, outcomes of events that involve some level of uncertainty are also random. A random variable describes these outcomes as numbers. Random variables can take on different values; just like variables in math taking different values.

If the possible outcomes are distinct numbers (e.g. counts), then these are called discrete random variables. If the possible outcomes can take on any value on the real number line, then these are called continuous random variables.

There are six possible outcomes (1, 2, 3, 4, 5 and 6) when you roll a dice. Each number is distinct. We can assume X as a random variable that can take any number between 1 and 6; hence it is finite and discrete. For any single roll, we can assume x to be the outcome. Notice that we are using uppercase X for the random variable and lowercase x for the value it takes for a given outcome.

X is the set of possible values and x is an observation from that set.

In lesson 20, we explored the rainfall data for New York City and Berkeley. Here, we can assume rain to be a continuous random variable X on the number line. In other words, the rainfall in any year can be a random value on the line with 0 as the lower limit. Can you guess the upper limit for rainfall? The actual data we have is an outcome (x); observation; a value on this random variable scale. Again, X is the possible values rainfall can take (infinite and continuous), and x is what we observed in the sample data.

In lesson 19, we looked at SAT reading score for schools in New York City. Since SAT reading score is between 200, the participation trophy and 800, in increments of 10, we can assume that it is finite and discrete random variable X. Any particular score we observe, for instance, 670 for a student is an observed outcome x.

If you are playing monopoly, the outcome of your roll will be a random variable between 2 and 12; discrete and finite; 2 if you get 1 and 1; 12 if you get 6 and 6, and all combinations in between.

In lesson 14, we plotted the box office revenue for STAR WARS films. We can assume this data as observations of a continuous random variable.

Do you think this random variable showing revenue can be negative? What if they lose money? Maybe not STAR WARS, but there are loads of terrible films that are negative random variables.

Can you think of other random variables that can be negative?

How about the national debt?

Are you old enough to have seen a surplus?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.