Lesson 30 – Pause and rewind

Why am I doing this?

What you just read is my blog post on January 1, 2017. It is what happens on a new year day when you don’t have a social life.

Over the last few years, I have concluded that the largest crowd I can ever reach, if I continue delivering lectures in the usual University setting is 80 per semester. I am sure some of them are not there by choice. Since I believe that I can distill ideas into easily understandable forms, I felt the urge to spread my voice to a larger audience. Hence our Data Analysis Classroom.

I usually emphasize the importance of collecting more data for understanding the uncertainties in the system and using them in the final decision-making process. Practicing the preachings, after 29 lessons, I want to pause, reflect on the readership data and rewind what we learned.

We started this journey on the fifth day of February 2017. After 202 days, i.e., six months and 21 days, the monthly readership data is

We more than made up for the slump in July. The total page views in this time are 19051 with average monthly page views of 2721. Being very conservative, I would like to think that the blog may only capture the interest of 0.5% of the readers. That will be approximately 95 more people I could reach in the six month period. As I said, that is easily more than who I can reach through a class in a semester. I just hope these 95 don’t include the same folks in the class who are already captivated!

Here is a map of the readership. If you are in one of the blue countries, thank you for your attention. Let’s get more people on board. They will like it too.

What did we learn so far?

Uncertainties surround us. Understanding where they arise from and recognizing their extent is fundamental to improving our knowledge about anything.

We started off with the fact that one needs to observe the system to understand it better. Ask for data. The more, the merrier.

Lesson 1: When you see something, say data. A total of 412 page views.

Data may be grouped in sets → collection of elements. We have various forms to visualize the sets. Unions, intersections, subsets, and their properties.

Lesson 3: The Setup. 321 page views.
Lesson 4: I am a visual person. 301 page views.

We defined probability and understood that there are some rules (axioms) of probability.

Lesson 6: It’s probably me. 377 page views.
Lesson 7: The nervousness axiom – fight or flight. 253 page views.

We learned conditional events, independent events and how the probability plays out under these events.

Lesson 9: The necessary condition for Vegas. 813 page views.
Lesson 10: The fight for independence. 641 page views.

We now know the law of total probability and the Bayes theorem.

Lesson 12: Total recall. 471 page views.
Lesson 13: Dear Mr. Bayes. 2280 page views — most popular lesson so far.

In lessons 14 through 20, you can learn the basics of exploratory data analysis; visualization techniques, and summary statistics.

Lesson 14: The time has come; execute order statistics. A lesson to explain the concept of percentiles and boxplots. 327 page views.

Lesson 15: Another brick in the wall — for building histograms. 297 page views.

Lesson 16: Joe meets the average. Explains the mean of the data. 218 page views.

Lesson 17: We who deviate from the norm. Explains the idea of variance and standard deviation. 159 page views.

Lesson 18: Data comes in all shapes. A short lesson explaining the concept of skewness. 116 page views.

Lesson 19: Voice of the outliers. Outliers are significant. As Nicholas Taleb puts it: “Don’t be a turkey” by removing them. 87 page views.

Lesson 20: Compared to what?. Use the coefficient of variation to compare data. 115 page views.

Lessons 22 to 28 introduce the idea of random variables and probability distribution.

Lesson 22: You are so random. The basic idea of discrete and continuous random variables. 109 page views.

Lesson 23: Let’s distribution the probability. This lesson will teach the concept of the probability distribution. 126 page views.

Lesson 24: What else did you expect? 130 page views.
Lesson 25: More expectation. 95 page views. These two lessons go through the expected value of a random variable.

Lesson 26: The variety in consumption. 175 page views.
Lesson 27: More variety. 568 page views. These two lessons are for understanding the variance of a random variable concept.

Lesson 28: Apples and Oranges. How to standardize the data for comparison. 325 page views.

Lessons in R

As I mentioned in lesson 2, R is your companion in this journey through data. Computer programming is very very (cannot emphasize enough “very”) essential for data analysis. Wherever required, I have provided brief lessons on how to use R for data analysis.

Lesson 2: R is your companion. The very first lesson to get going with R. 434 page views.

Lesson 5: Let us Review. Reading data files and more fun stuff. 228 page views.

Lesson 8: The search for wifi. Learning for and if else statements in R. 194 page views.

Lesson 11: Fran, the functionin R-bot. The essentials of writing functions in R. 332 page views.

Lesson 21: Beginners guide to summarize data in R. A step-by-step exploratory data analysis. 1786 page views.

Lesson 29: Large number games in R. 933 page views.

Where do we go from here?

A long way.

While you pause and reflect on these lessons, I will pause and come back with lesson 31 on the ninth day of September 2017. Help spread the word as we build this knowledge platform one lesson at a time while the university system becomes obsolete.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 29 – Large number games in R

“it cannot escape anyone that for judging in this way about any event at all, it is not enough to use one or two trials, but rather a great number of trials is required. And sometimes the stupidest man — by some instinct of nature per se and by no previous instruction (this is truly amazing) — knows for sure that the more observations of this sort that are taken, the less danger will be of straying from the mark.”

wrote James Bernoulli, a Swiss mathematician in his thesis Ars Conjectandi in 1713.

Suppose we have an urn with an unknown number of white and black pebbles. Can we take a sample of 10 pebbles, count the number of white and black ones, and deduce the proportion in the container? If 10 is too small a sample, how about 20? How small is too small? James Bernoulli used this illustration to understand, through observations, the likelihood of diseases in human body.

How many sample observations do we need before we can confidently say something is true?

In lesson 6, and lesson 23, we have seen that the probability of an event is its long run relative frequency. How many observations should we get before we know the true probability?

Do data from different probability distribution functions take the same amount of sample observations to see this convergence to the truth?

Do we even know if what we observe in the long run is the actual truth? Remember black swans did not exist till they did.

We will let the mathematicians and probability theorists wrestle with these questions. Today, I will show you some simple tricks in R to simulate large number games. We will use three examples;

tossing a coin,

rolling a dice, and

drawing pebbles from the urn.

All the code in R is here. Let’s get to the nuts and bolts of our toy models.

Coin Toss

If you toss a coin, the probability of getting a head or a tail is 0.5. The two outcomes are independent. By now you should know that it is 0.5 in the long run. In other words, if you toss a coin many times and count the number of times you get head and the number of times you get tails, the probability of head, i.e., the number of heads/N ~ 0.5.

Let us create this experiment in R. Since there is 0.5 probability of getting head or tail; we can use the random number generator to simulate the outcomes. Imagine we have a ruler on the scale of 0 to 1, and we simulate a random number with equal probability, i.e., any number between 0 and 1 is equally possible. If the number is less than 0.5, we assume heads, if it is greater than 0.5, we assume tails.

A random number between 0 and 1 with equal probability can be generated using the runif command in R.

# Generate 1 uniform random number 
r = runif(1)

[1] 0.3627271
# Generate 10 uniform random number 
r = runif(10)

[1] 0.7821440 0.8344471 0.3977171 0.3109202 0.8300459 0.7474802 0.8777750
 [8] 0.7528353 0.9098839 0.3731529
# Generate 100 uniform random number 
r = runif(100)

The steps for this coin toss experiment are:
1) Generate a uniform random number r between 0 – 1
2) If r < 0.5 choose H
3) If r > 0.5, choose T
4) Repeat the experiment N times → N coin tosses.

Here is the code.

# coin toss experiment 
N = 100

r = runif(N)

x = matrix(NA,nrow = N,ncol = 1)
for (i in 1:N)
{
 if(r[i] < 0.5 )
 { x[i,1] = "H" } else 
 { x[i,1] = "T" }
}

# Probability of Head 
P_H = length(which(x=="H"))/N
print(P_H)

# Probability of Tail 
P_T = length(which(x=="T"))/N
print(P_T)

We first generate 100 random numbers with equal probability, look through each number, and assign H or T based on whether the number is less than 0.5 or not, and then count the number of heads and tails. Some of you remember the which and length commands from lesson 5 and lesson 8.

The beauty of using R is that it offers shortcuts for lengthy procedures. What we did using the for loop above can be much easier using the sample command. Here is how we do that.

We first create a vector with H and T names (outcomes of a toss) to select from.

# create a vector with H and T named toss to select from 
toss = c("H","T")

Next, we can use the sample command to draw a value from the vector. This function is used to randomly sample a value from the vector (H T), just like tossing.

# use the command sample to sample 1 value from toss with replacement 
sample(toss,1,replace=T) # toss a coin 1 time

You can also use it to sample/toss multiple times, but be sure to use replace = TRUE in the function. That way, the function replicates the process of drawing a value, putting it back and drawing again, so the observations do not change.

# sample 100 times will toss a coin 100 times. use replace = T 
sample(toss,100,replace=T) # toss a coin 100 time

 [1] "H" "H" "H" "H" "T" "T" "H" "T" "T" "H" "T" "H" "H" "H" "T" "T" "H" "H"
 [19] "T" "H" "H" "H" "H" "T" "T" "T" "T" "H" "H" "H" "T" "H" "H" "H" "T" "H"
 [37] "H" "T" "T" "T" "T" "T" "H" "T" "T" "H" "T" "H" "H" "T" "H" "H" "T" "H"
 [55] "H" "T" "H" "T" "H" "H" "H" "T" "T" "T" "T" "H" "H" "H" "H" "T" "H" "T"
 [73] "T" "H" "H" "T" "H" "H" "T" "T" "T" "T" "H" "T" "H" "T" "T" "H" "T" "H"
 [91] "H" "T" "T" "H" "T" "H" "T" "H" "H" "T"

To summarize the number of heads and tails, you can use the table command. It is a bean counter. You should expect 50 H and 50 T from a large number of tosses.

# to summarize the values use the table command.
x = sample(toss,100,replace=T) # toss a coin 100 times
table(x) # summarize the values in x

 H T 
51 49
A large number of simulations

Set up an experiment with N = 5, 15, 25, 35, 45, … , 20000 and
record the probability of obtaining a head in each case. Type the following lines in your code and execute. See what you get.

# Large Numbers Experiment 
# Tossing Coin

toss = c("H","T")

number_toss = seq(from = 5, to = 20000,by=10) # increasing number of tosses

P_H = matrix(NA,nrow=length(number_toss),ncol=1) # create an empty matrix to fill probability each time

P_T = matrix(NA,nrow=length(number_toss),ncol=1) # create an empty matrix to fill probability each time

for (i in 1:length(number_toss))
{
 x = sample(toss,number_toss[i],replace=T)
 P_H[i,1]=length(which(x=="H"))/number_toss[i]
 P_T[i,1]=length(which(x=="T"))/number_toss[i] 
}

plot(number_toss,P_H,xlab="Number of Tosses",ylab="Probability of Head",type="l",font=2,font.lab=2)
abline(h=0.5,col="red",lwd=1)

Notice, for a small number of coin tosses, the probability of head has high variability (not stable). But as the number of coin tosses increases, the probability converges and stabilizes at 0.5.

DICE

You should be able to write the code for this using the same logic. Try it. Create a vector [1, 2, 3, 4, 5, 6] and sample from this vector with replacement.

No cheating by looking at my code right away.

Here is the large number experiment on this. Did you see the convergence to 1/6?

Pebbles

Let us first create a vector of size 10 (an urn with pebbles) with six white and four black pebbles, assuming the true ratio of white and black is 3/2. Run the sampling experiment on this vector by drawing more and more samples and estimate the probability of white pebble in the long run.

# Show that the probability of getting the color stones approaches the true probability
# Pebbles

stone = c("W","W","W","W","W","W","B","B","B","B")

number = seq(from = 5, to = 20000,by=10) # increasing number of draws

P_W = matrix(NA,nrow=length(number),ncol=1) # create an empty matrix to fill probability each time
P_B = matrix(NA,nrow=length(number),ncol=1) # create an empty matrix to fill probability each time

for (i in 1:length(number))
{
 x = sample(stone,number[i],replace=T)
 P_W[i,1]=length(which(x=="W"))/number[i]
 P_B[i,1]=length(which(x=="B"))/number[i]
}

plot(number,P_W,type="l",font=2,font.lab=2,xlab="Number of Draws with Replacement",ylim=c(0,1),ylab="Probability of a White Pebble")
abline(h=(0.6),col="red")

If there were truly three white to two black pebbles in the world, an infinite (very very large) experiment/simulation would reveal a convergence at this number. This large number experiment is our quest for the truth, a process to surmise randomness.

You can now go and look at the full code and have fun programming it yourself. I will close off with this quote from James Bernoulli’s Ars Conjectandi.

“Whence, finally, this one thing seems to follow: that if observations of all events were to be continued throughout all eternity, (and hence the ultimate probability would tend toward perfect certainty), everything in the world would be perceived to happen in fixed ratios and according to a constant law of alternation, so that even in the most accidental and fortuitous occurrences we would be bound to recognize, as it were, a certain necessity and, so to speak, a certain fate.”

I continue to wonder what that “certain fate” is for me.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 28 – Apples and Oranges

The prices of golden delicious apples sold yesterday at selected U.S. cities’ terminal markets are

The prices of navel oranges sold yesterday at selected U.S. cities’ terminal markets are

I want to compare apples and oranges, needless to say, prices. I will assume the prices of apples to be a random variable and will plot the probability distribution function. I will consider the prices of oranges to be another random variable and will plot its probability distribution function. Since the prices can be defined on a continuous number scale, I am assuming continuous probability distribution functions that look like this

Notice the two random variables are on a different footing. Apples are centered on $44 with a standard deviation (spread) of $9. Oranges are centered on $25 with a standard deviation of $5.

If our interest lies in working simultaneously with data that are on different scales or units, we can standardize them, so they are placed on the same level or footing. In other words, we will move the distributions from their original scales to a new common scale. This transformation will enable an easy way to work with data that are related, but not strictly comparable. In our example, while both the prices data are in the same units, they are clearly on different scales.

We can re-express the random variable (data) as standardized anomalies so they can be compared or analyzed. This process of standardization can be achieved by subtracting the mean of the data and dividing by the standard deviation. For any random variable, if we subtract the mean and divide by the standard deviation, the expected value and the variance of the new standardized variable is 0 and 1. Here’s why.

The common scale is hence a mean of 0 and a standard deviation of 1. We just removed the influence of the location (center) and spread (standard deviation) from the data. Any variable, once standardized, will be on this scale regardless of the type of the random variable and shape of the distribution.

You must have observed that the units will cancel out → the standardized random variable is dimensionless. The standardized data for apples and oranges will look this

The first step where we subtract the mean will give anomalies, i.e. differences from the mean that are now centered on zero. Positive anomalies are the values that are greater than the mean, and negative anomalies are the values that are less than the mean. The second step where we divide by the standard deviation will provide a scaling factor; unit standard deviation for the new data that enables comparison across different distributions.

The standardized scores can also be seen as a distance measures between the original data value and the mean, the units being the standard deviation. For example, the price of apples in New York terminal market is $55, about 1.14 standard deviations from the mean of ~ $44. $44.63 + 1.14*9.

We will revisit this idea of standardization and use it to our advantage when we learn normal probability distributions. Until then, here are some examples of “standardize” in sentences as provided by Merriam — my rants attached.

“The plan is to standardize the test for reading comprehension so that we can see how students across the state compare” – One size does not fit all.

“He standardized procedures for the industry” – Interns getting coffee is not one of them. So make your own.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 27 – More variety

August 5, 2017: Outside temperature 71F; Joe’s temperature 62F.

June 2, 2017: Outside temperature 77F; Joe’s temperature 60F.

April 14, 2017: Outside temperature 65F; Joe’s temperature 64F.

February 7, 2017: Outside temperature 40F; Joe’s temperature 61F.

December 11, 2017: Outside temperature 34F; Joe’s temperature 63F.

 

What is the point? I know you are not referring to me.

 

Hey Joe. Glad you are here. I have been doing this experiment for some time now. A few years ago, I noticed that Joe’s, the famous coffee shop in New York City is always pleasant – summer or winter. I wanted to see how much variation there is in the internal temperature, so I started measuring it daily. It also meant $5 a day on coffee. 😯

 

So what did you find at that price?

 

I discovered that the temperature was discrete and always between 60 and 64. Some of it may be due to the temperature app I use on my phone – it only shows whole numbers; nevertheless, the temperature seems to be pretty regulated. Here is the summary of that data shown in probabilities and a simple probability distribution plot. You must be familiar with this.

Sure, I remember our first conversations about probability. It is the long run relative frequency. So you are finding that the probability that the temperature is 60F on any day is 0.1, the probability that the temperature is 62 on any day is 0.4 and so forth.

 

Correct. Shall we also compute the summary statistics?

 

Sure, it will be useful to estimate the expected value (E[X]), variance (V[X]) and the standard deviation of this random variable based on the probability distribution function. Let me do that.

The expected value is

E[X] = 60*0.1 + 61*0.2 + 62*0.4 + 63*0.2 + 64*0.1 = 62

The average temperature is 62F.

Variance can be estimated using the following equation

Applying this to our data: 

So the variance of the temperature is 1.2 deg F*deg F, or the standard deviation is 1.095 deg F.

Fantastic. We can see that the average temperature is 62F and not varying much. That explains why Joe’s is always crowded. Besides having great coffee, they also offer great climate.

 

 

I agree. By the way, do you know the geolocation of your readers?

 

Sure. Here is a map of our readership. We’ve got to get some more countries on board to cover the globe, but this is a great start. Why do you ask?

 

I ask because outside the USA, people follow the SI unit system and they may want to do these computations in Celsius scale.

 

Very interesting observation. Do you have a solution for it?

 

Well, the simplest way is to convert the data first into Celsius scale and re-compute. But that is boring. There’s got to be some elegant approach.

 

Yes there is. The properties of expected value and the variance operations can help us with this. Going back to your concern, we can convert degree F to degree C using this equation:

In lesson 25, we learned that the expected value operation is linear on functions. For example, 

Using this property, we can compute the expected value of Joe’s temperature in deg C as:

E[C] = 5/9*E[F] - 160/9 = 5/9*62 - 160/9 = 16.66 deg C.

For variance, it is not linear. Let us look at the derivation.

 

Very clear. Let me apply this to the Celsius equation.

and the standard deviation will be 0.6 deg C.

Excellent. Now that you have your intellectual focus going tell me if the variance of the sum is the sum of the variance for two or more random variables?

 

Hmm. Going by your previous lessons, your questions always open new concepts. There is something hidden here too that you want me to discover. Can you give me a hint?

 

Well Joe, Noam Chomsky once said that if you can discover things, you are on your way to being an “independent” thinker.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.