September 2017 – dataanalysisclassroom

Lesson 34 – I’ll be back: The language of Return Period

The average time between Arnold Schwarzenegger’s being back is the return period of his stunt.

I have to wait 5 minutes for my next bus. Some days I wait for 1 minute; some days, I wait for 15 minutes. The wait time is variable → random variable 😉 The average of these wait times is the return period of my bus.

Your recent vocabulary may include “100-year event” (happening more often), (drainage system designed for) “10-year storm,” and so on, courtesy mainstream media and news outlets.

Houston drainage grid ‘so obsolete it’s just unbelievable’

What exactly is this return period business?

Does a 10-year return period event occur diligently every ten years?

Can a 100-year event occur three times in a row?

THE LOGIC

Let’s visit annual maximum rainfall for Houston. If we take the daily rainfall data for each year from January 1 to December 31 and choose the maximum rainfall among these days, we call it annual maximum rainfall for that year. So this is the rainfall for the wettest day of the year. Likewise, if we do this for all the years that we have data for, we get a data series (also called time series since we are recording this in time units).

You can get the data from here if you like. You may have to register using your email, but its free. We have 79 years of recorded data from 1939 to 2017. 79 data points, one number per year as the rainfall for the wettest day in that year. You will see that five years are missing between 1942 and 1946.

I want you to understand that these numbers represent a random variable X. Each number (outcome) is assumed to be independent, i.e., the occurrence of one event in one year does not influence the occurrence of the subsequent event. In other words, 2017 rainfall does not depend on 2016 rainfall.

Now, I want you to see Brays Bayou, the lake that detains excess rainfall in Houston. Let us assume that it can store up to eight inches of rainfall on any day. If it rains more than eight inches in a day, the Bayou will overflow and cause flood — as we saw in Houston during hurricane Harvey.

So, if the rainfall is greater than eight inches, we define this as an event. Let us call him Bob. The first time we see Bob was in 1949. We started recording data in 1939. Bob happened after 11 years. The wait time for Bob (1949) is 11 years.

Then we get on with our lives, 11 years passed, Bob is not back, 22 years passed, no sign of Bob. Suddenly, after 30 years of waiting from 1949, Bob Strikes Back (1979).

Two years after this event happened, Bob wanted to greet the Millenials, so he came back in 1981. This time, the waiting period is only two years.

Then, in 1989, for no particular reason, Bob returns. The return of Bob (1989) is after eight years.

You must be thinking: “I don’t see any pattern here.” Yes, that is because there is none.

Years pass, Bob seems to be resting. At the turn of the century, Bob decided to come back. So Bob Meets the 21st Century in 2001 after 12 years since his prior occurrence. Bob re-occurs. Recurrence.

During the first decade of the 21st century, Bob re-occurs two times, once in 2006 as the Restless Bob (5-year wait time) and again in 2008 as Miss Me Yet, Bob (2-year wait time).

We all know what happened after that. Vengeant Bob (2017), aka Harvey, happened after nine years.

Now, let’s summarize all Bobs along with their recurrence times. We started with the assumption that the maximum rainfall events represent a random variable X. Let us define T as another random variable that measures the time between the event Bob (wait time or time to the next event or time to the first event since the previous event).

The return period of the event Bob, (X > 8 inches) is the expected value of T, i.e., E[T], its average measured over a large number of such occurrences.

As you can see here, in the table, the return period of Bob is approximately ten years. Bob is a 10-year return period event.

Another way of thinking about this: Since there are eight Bob events in 79 years, they occur at an average rate of 79/8. Approximately, once in 10 years. Hence originated the 10-year event concept.

Remember, they don’t happen cyclically every ten years. If we average the wait times of a lot of events, we will get approximately ten years.

Just like when you wait for the bus, you wait for short time or a long time, but you think of the average time you wait for a bus everyday, you can see events happening in a cluster or spaced out, but all average to an n–year return period.

The relation to Geometric distribution

Last week when we learned Geometric distribution, I told you that we would relate the expected value of the Geometric distribution to return period of an event. Let’s see how Bob relates to Geometric distribution.

I want you to convert the maximum rainfall data series into a series of independent Bernoulli trials of 0s and 1s. 0 if the rainfall is < eight inches (No Bob), 1 if the rainfall is > eight inches (Yes Bob). The 1s can occur with some probability of occurrence p. In our example, since we have 8 Bobs in 79 years the probability of occurrence p = 8/79 = 0.101.

Now, assume T to be a random variable that measures the number of trials (years) it takes to see the first success (event), or the next event from each such event. For the first event, Bob (1949), it took 11 years to occur. The probability that T = 11, P(T = 11) is (1 – p)^10*p. Similarly, the next Bob happened after 30 years and so on. T is the time to first success (next success) → Geometrically distributed.

We can derive the expected value of T using the expectation operation we learned in lesson 24.

Now, recall from your math classes that the expression inside the parenthesis looks like a power series. Ponder over it and confirm that the whole expression will reduce to

E[T] = 1/p

The expected value of the wait time that is Geometrically distributed is the inverse of the probability of the event. Since the probability of Bob is 0.101, the return period (expected value of the wait times) is 1/0.101 ~ ten years. A 10-year return period event.

The Question

We measured the probability over 79 years; n = 79. We assumed that the probability is constant over all the trials.

In other words, we are assuming that we know p and it does not change.

If I were writing this lesson last year, the probability would have been 7/78 = 0.089. Since Harvey (The Vengeant Bob), the probability became 8/79 = 0.101. There are also five missing years.

Perhaps we do not know the true value of p, and perhaps it is not constant.

How then, can you estimate the risk of anything? How then, can you predict anything? How then, can you design anything?

If I haven’t confused you enough, let me end with one of my favorite quotes from Nicholas Taleb’s book Antifragile: Things that gain from disorder.

“It is hard to explain to naive data-driven people that risk is in the future, not in the past.”

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 33 – Trials to first success: The language of Geometric distribution

So goes the legendary story: Bruce, Robert I, the King of Scotland, defeated the English armies on his seventh trial. He bore six successive defeats before that.

Giants are yet to win their first game of the season. Their record: L, L, … I wonder how many games until their first win.

The last deadly hurricane that hit New York City is Sandy in 2012. We are now five years through without such a dangerous event.

The common thread tying the three examples is the number of trials to the first success. This is the language of the Geometric distribution.

If we consider independent Bernoulli trials of 0s and 1s with some probability of occurrence p and assume X to be a random variable that measures the number of trials it takes to see the first success, then, X is said to be Geometrically distributed.

We can get the success on the first trial, in which case X will be 1. We can see the success on the second trial, in which case the sequence will be 01, and X will be 2. We can see the success on the third trial; the sequence will be 001 and X will be 3 and so forth. As you can guess, X = {1, 2, 3, … }, positive integers.

There is some probability that X can take any integer value. We should also figure out this probability, i.e., P(X = 1), P(X = 2), P(X = 3), and so on.

Let us take the example of a coin toss. The outcomes are head or tail. Binary outcomes → Bernoulli trial. The probability p of a head or tail is 0.5. In other words, if you toss a coin a large number of times, say 100, roughly 50 of them will be heads, and 50 of them will be tails. Let’s play. Heads you win, tails you lose.

Great, you win on the first trial. The probability of seeing head is 0.5. Hence, P(X = 1) = 0.5.

Let’s play again.

Ah, this time the first outcome is a tail and the second outcome is a head. You lose on the first trial but win on the second. It took two trials to wins. X = 2. P(X=2) is P(tail on the first toss)*P(head on the second toss) = 0.5*0.5 = 0.25. Why did we multiply? What is P(A and B) for independent events?

One more time.

Now it took three trials to win. X = 3, and P(X = 3) = P(tail on the first toss)*P(tail on the second toss)*P(head on the third toss) = 0.5*0.5*0.5 = 0.125.

For X = 4, it will be P(tail on the first toss)*P(tail on the second toss)*P(tail on the third toss)*P(head on the fourth toss) = 0.5*0.5*0.5*0.5 = 0.0625.

If we now plot X and P(X = k), k being 1, 2, 3, 4, …, we get a probability distribution like this.

The height of the line at X = 2 is 0.5 times the height of the line at X = 1. In the same way, the height of the line at X = 3 is 0.5 times the height of the line at X = 2 and so on. P(X=k) decreases in a geometric progression. Hence the name Geometric distribution.

We can generalize this for any probability p. In our game, we estimated P(X = 1) as 0.5, i.e., the probability of seeing a head p. P(X=2) is 0.5*0.5, i.e., (1-p)*p. P(X = 10) = (1-p)^9*p. First success on the tenth toss is nine tails followed by a head.

More generally,

We can derive the expected value and variance of X as:

The expected value of a Geometric distribution relates to a special concept called return period → we will look at it next week.

Meanwhile, here are some more geometric probability distributions with different values of p.

p = 0.1

p = 0.3

p = 0.5

p = 0.7

p = 0.9

Notice how the shape changes with changing values of p. p is the parameter that controls the shape of the distribution. The greater the value for p, the steeper the fall.

If the probability of success is close to 1, the odds of winning in the first few trials is high → notice the height of the line for p = 0.9. If the probability of success is close to 0, it takes several trials to get to greater odds of winning overall.

Have you now conceptualized the idea of geometric distribution?

Let me challenge you to a bet then.

I have a coin toss game where I give you two times your bet if you win; you get nothing if you lose. Assume we have a fair coin, would you play the game with me and bet your money? If you will, then what is your strategy, assuming you are in it to win.

Since I challenged you to a bet, I also looked into some lottery games myself at nylottery.ny.gov.

First observation

The odds of winning first prize in any of the games is next to 0. So if you plan to keep buying the tickets until you win the first time, and then retire, you now know that you will keep buying forever.

The chances of winning per game get better for lower prize levels. For example, in the Mega Million, if you want to win the ninth prize, the odds are 1 in 21. Still low, and will take a long time to win.

But, who wants to win the ninth prize. It is like saying “America Ninth.”

Second observation

I keep wondering why on earth is New York Government running a lottery business … only to reconcile that “bread and circuses” have always been up the state’s sleeves to expand.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 32 – Exactly k successes: The language of Binomial distribution

I may be in one of those transit buses. Since I moved to New Jersey, I am going through this mess every day.

Well, you wanted to enjoy Manhattan skyline. It has a price tag.

D, glad you are here. It’s been a while. In our last meeting, we were discussing the concepts of variance operation and its properties. I continue to read your lessons every week. As I paused and reflected on all the lessons, I noticed that there is a systematic approach to them. You started with the basics of sets and probability, introduced lessons on visualizing, summarizing and comparing data using various statistics, then extended those ideas into random variables and probability distributions. The readership seems to have grown considerably, and people are tweeting about our classroom. Have you reached 25000 pageviews yet?

Great place to start learning #DataAnalysis is at https://t.co/OTGRiCsOTG by @realDevineni and thanks to @ThomasEWoods for mentioning it

— Kevin Stokes (@kevinstokes58) September 14, 2017

We are at 24350 pageviews now. We will certainly hit the 25k mark today 😉 I am thankful to all the readers for their time. Special thanks to all those who are spreading the word. Our classroom is a great resource for anyone starting data analysis.

So, whats on the show today?

As you correctly pointed out, we are now slowly getting into various types of probability distributions. I mentioned in lesson 31 that we would learn several discrete probability distributions that are based on Bernoulli trials. We start this adventure with Binomial distribution.

Great. Let me refresh my memory of probability distributions before we get started. We discussed the basics of probability distribution in lesson 23. Let’s assume X is a random variable, and P(X = x) is the probability that this random variable takes any value x (i.e., an outcome). Then, the distribution of these probabilities on a number line, i.e., the probability graph is called the probability distribution function f(x) for a random variable. We are now looking at various mathematical forms for this f(x).

Fantastic. Now imagine you have a Bernoulli sequence of yes or no.

Sure. It is a sequence of 0s and 1s with a probability p; 0 if the trial yields a no (failure, or event not happening) and 1 if the trial yields a yes (success, or event happening). Something like this: 00101001101

From this sequence, if you are interested in the number of successes (1s) in n trials, this number follows a Binomial distribution. If you assume X is a random variable that represents the number of successes in a Bernoulli sequence of n trials, then this X should follow a binomial distribution. The probability that this random variable X takes any value k, i.e., the probability of exactly k successes in n trials is:

The expected value of this random variable, E[X] = np, and the variance V[X] = np(1-p).

😯 Wow, that’s a fastball. Can we parse through the lingo?

Oops… Okay, let us take the example of your daily commute. Imagine buses and cars pass through the tunnel each morning. Can you guesstimate the probability of buses?

Yeah, I usually see more buses than cars in the morning. Let’s say the likelihood of seeing a bus is p=0.7.

Now let us imagine that buses and cars come in a Bernoulli sequence. Assign a 1 if it is a bus, and 0 if it is a car.

That is reasonable. The vehicle passage is usually random. If we take that as a Bernoulli sequence, there will be some 1s and some 0s with a 0.7 probability of occurrence. In the long run, you will have 70% buses and 30% cars in any order.

Correct. Now think about this. In the next four vehicles that pass through the tunnel, how many of them will be buses?

Since there is randomness in the sequence, in the next four vehicles, I can say, all of them may be buses, or none of them will be buses or any number in between.

Exactly. The number of buses in a sequence of 4 vehicles can be 0, 1, 2, 3 or 4. These are the random variables represented by X. In other words, if X is the number of buses in 4 vehicles coming at random, then X can take 0, 1, 2, 3 or 4 as the outcomes. The probability distribution of X is binomial.

I understand how we came up with X. Why is the probability distribution of X called binomial?

It originates from the idea of the binomial coefficient that you may have learned in an elementary math/combinations class. Let us continue with our logical deduction to see how the probability is derived, and you will see why.

Sure. We have X as 0, 1, 2, 3 and 4. We should calculate the probability P(X = 0), P(X = 1), P(X = 2), P(X = 3) and P(X = 4). This will give us the distribution of the probabilities.

Take an example, say 2. Let us compute P(X = 2). The probability of seeing exactly two buses in 4 vehicles. The probability of exactly k successes in n trials. If the buses and cars come in a Bernoulli sequence (1 for bus and 0 for a car) with a probability p, in how many ways can you see two buses out of 4 vehicles?

Ah, I see where we are going with this. Let me list out the possibilities. Two buses in four vehicles can occur in six ways. 0011, 0101, 1100, 1010, 1001, 0110. In each of these six possible sequences, there will be exactly two buses among four vehicles. I remember from my combinations class that this is four choose two. Four factorial divided by the product of two factorial and (four minus two) factorial. 4C2 = 4!/4!(4-2)!

For each possibility, the probability of that sequence can also be written down. Let me make a table like this:

You can see from the table that there are six possibilities. Any of the possibilities, 1 or 2 or 3 or 4 or 5 or 6 can occur. Hence, the probability of seeing two in four is the sum of these probabilities. Remember P(A or B) = P(A) + P(B). If you follow through this, you will get, 6*p*p*(1-p)*(1-p). = 6*p^2*(1-p)^(4-2). Can you see where the formula for binomial distribution comes from?

Absolutely. For each outcome of X, i.e., 0, 1, 2, 3 and 4, we should apply this logic/formula and compute the probability of the outcome. Let me finish it and make a plot.

Very nicely done. Let me jump in here and show you another plot with a different n and p. If p = 0.5 (equal probability) and n = 100; this is how the binomial distribution looks like.

Nice. It looks like an inverted bell centered around 50.

Yeah. You noticed that the distribution is centered around 50. It is the expected value of the distribution. Remember E[X] is the central tendency of the distribution. For binomial, you can derive it as np = 100 (0.5) = 50. In the same way, the variance, i.e. spread of the function around this center is np(1-p) = 100(0.5)(0.5) = 25. Or standard deviation is 5. You can see that the distribution is spread out within three standard deviations from the center. Can you now imagine how the distribution will look like for p = 0.3 or p = 0.7?

Following the same logic, those distributions will be centered on 100*0.3 = 30 and 100*0.7 = 70 with their variance. Now it all makes sense.

You see how easy it is when you go through the logic. We started with Bernoulli sequence. When we are interested in the random variable that is the number of successes in so many trials, it follows a binomial distribution. “Exactly k successes” is the language of Binomial distribution. Can you think of any other examples that can be modeled as a binomial distribution?

Probability that Derek Jeter, with a batting average of 0.3, gets three hits out of the three times he comes to bat 😆 This is fun. I am glad I learned some useful concepts out of the messy commute experience. By the way, Exactly one landfall in the next four hurricanes is also binomial. With Jose coming up, I wonder if we can compute the probability of damage for New York City based on the probability of landfall.

Don’t worry Joe. Our Mayor is graciously implementing his comprehensive $20 billion resiliency plan. NYC is safe now. Forget probability of damage. You need to worry about the probability of bankruptcy.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 31 – Yes or No: The language of Bernoulli trials

Downtown Miami will be flooded due to hurricane Irma.

Your vehicle will pass the inspection test this year.

Each toss of a coin results in either a head or a tail.

Did you notice that I am looking for an answer, an outcome that is “yes” or “no.” We often summarize data as the occurrence (or non-occurrence) of an event in a sequence of trials. For example, if you are designing dikes for flood control in Miami, you may want to look at the sequence of floods over several years to analyze the number of events, and the rate at which they occur.

There are two possibilities, a hit (event occurred – success) or miss (event did not occur – failure). A yes or a no. These events can be represented as a sequence of 0’s and 1’s (0001100101001000) called Bernoulli trials with a probability of occurrence of p. This probability is constant over all the trials, and the trials itself are assumed to be independent, i.e., the occurrence of one event does not influence the occurrence of the subsequent event.

Now, imagine these outcomes, 0’s or 1’s can be represented using a random variable X. In other words, X is a random variable that can take 0 or 1 with a probability p. If in Miami, there were ten extreme flood events in the last 100 years, the sequence will have 90 0’s and 10 1’s in some order. The probability of the event is hence 0.1. If the probability is 0.5, then, in a sequence of 100 trials (coin tosses for example), you will see 50 heads on average. We can derive the expected value of X and the variance of X as follows:

Since the Bernoulli trials are independent, the probability of a sequence of events happening will be equal to the product of the probability of each event. For instance, the probability of observing a sequence of No Flood, No Flood, No Flood and Flood over the last four years is 0.9*0.9*0.9*0.1 = 0.072 (assuming p = 0.1).

Bernoulli trials form the basis for deriving several discrete probability distributions that we will learn over the next few weeks.

While you ponder over what these distributions are, their mathematical forms, and how they represent the variation in the data, I will leave you with this image of the daily rainfall data from Miami International Airport. An approximate 6.38 inches of rain (~160mm/day) is forecasted for Sunday. Notice how you can remap the data into a sequence of 0’s (if rain is less than 160) and 1’s (if rain is greater than 160).

After tomorrow, when you hear “unprecedented rains” in the news, keep in mind that we seek the historical sequence data like this precisely because our memory is weak.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Month: September 2017

Lesson 34 – I’ll be back: The language of Return Period

THE LOGIC

The relation to Geometric distribution

The Question

Lesson 33 – Trials to first success: The language of Geometric distribution

p = 0.1

p = 0.3

p = 0.5

p = 0.7

p = 0.9

First observation

Second observation

Lesson 32 – Exactly k successes: The language of Binomial distribution

Lesson 31 – Yes or No: The language of Bernoulli trials

Enjoy this blog? Please spread the word :)