Let me introduce you to two friends, Joe and Devine.

Joe is a curious kid, funny, feisty and full of questions. He is sharp and engaging and always puts in the honest effort to understand things.

Devine is mostly boring, but thoughtful. He believes in reason and evidence and the scientific method of inquiry.

Joe and Devine stumbled upon our blog and are having a conversation about Lesson 5.

**Joe**: Have you seen the recent post on data analysis classroom. There was an interesting question on how many warm days are there in February.

**Devine**: What do you mean by warm? Where is it warm?

**Joe**: Look, Devine, I know you always want to think from first principles. Can you stop being picky and cut to the chase here.

**Devine**: Okay, how many warm days are there in February?

**Joe**: Looks like there were seven warm days last month. That seems like a high number.

**Devine**: Maybe. Do you know the probability of warm days in February?

**Joe**: What is “**probability**“?

**Devine**: Let us say it is the extent to which something is probable. How likely is a warm day in February? In other words, how frequently will warm day occur in February?

**Joe**: I can see there are seven warm days in February out of 28 days. So that means the frequency or likeliness is seven in 28. 7/28 = 0.25. So the probability is 25%.

**Devine**: What you just computed is for February 2017. How about February 2016, 2015, 2014, 2013, … ?

**Joe**: I see your point. When we compute the frequency, we are trying to see ** how many times an event (in this case warm day in February) is occurring out of all possibilities for that event**.

**Devine**: Yes. Let us say there is a **set of February days**; 28 in 2017, 29 in 2016, 28 in 2015, so on and so forth. These are **all possible February days**. Among all these days, we see **how many of them are warm days.**

**Joe**: So you mean what is the frequency in the long run.

**Devine**: Yes, the * probability of an event is its long-run relative frequency*.

**Joe**: Let me get the data for all the years and see how many warm days are there in February.

**Devine**: That is a good idea. When you get the data, try the following experiment to see **how the probability approaches the true probability** as you increase the sample space.

Step 1: Compute the number of warm days in February 2017. Let us call this warmdays2017. So the probability of warm days is p = warmdays2017 divided by 28.

Step 2: Compute the number of warm days in February 2016. Let us call this warmdays2016. Using this extended sample size, we calculate the probability of warm days as p = (warmdays2017 + warmdays2016) divided by 57; 28 days in 2017 and 29 days in 2016. Here you have more outcomes (warm days) and more opportunities (February days) for these outcomes.

Step 3: Do this for as many years as you can get data, and make a plot (visual) of growing years and true probability.

**Joe**: I got the logic, but this looks like a lot of counting for me. Is there an easier way?

**Devine**: R can help you do this easily. Perhaps, you can wait till the data analysis guy posts another lesson of tricks in R. For now, let me help you.

**Joe**: Great.

**Devine**: Okay, here is what I got after running this experiment.

**Joe**: This is pretty. Let me try to interpret it. On the x-axis you have Years up to; and the axis is showing 2017, 2016, …, 1949. So that means, in each step, you are considering the data up to that year — up to 2017, up to 2016, up to 2015 so on and so forth. On the y-axis, you have the probability of warm days in February. At each step, you are computing the probability with new sample size, so you have a better idea of the likelihood of warm days since there are more outcomes and opportunities.

**Devine**: Exactly. What else do you observe?

**Joe**: There is a red line somewhere around 0.02, and the probabilities are approaching this red line as we have more sample size.

**Devine**: Great, the red line is at 0.027, and the long run relative frequency — **Probability of warm days in February is 0.027**. Notice that the probability does not vary much, and looks like a stable line after you go up to 2000s. This is telling us that we need enough sample size to get a reliable measure of the probability.

**Joe**: What happens to the probability if we have 50 more years of data?

**Devine**: ah…

**Joe**: Wait, I thought the probability based on 2017 data is 0.25. Why is the first point on the plot at 0.15?

**Devine**: Joe, you are the curious kid.. Go figure it out.

In case you are wondering who Devine is,

It’s **probably** me.

*If you find this useful, please like, share and subscribe.*

*You can also follow me on Twitter @realDevineni for updates on new lessons.*

…and I may be Joe and I don’t know it yet, because I asked myself the same question…