dataanalysisclassroom – Page 8 – making data analysis easy

Lesson 29 – Large number games in R

“it cannot escape anyone that for judging in this way about any event at all, it is not enough to use one or two trials, but rather a great number of trials is required. And sometimes the stupidest man — by some instinct of nature per se and by no previous instruction (this is truly amazing) — knows for sure that the more observations of this sort that are taken, the less danger will be of straying from the mark.”

wrote James Bernoulli, a Swiss mathematician in his thesis Ars Conjectandi in 1713.

Suppose we have an urn with an unknown number of white and black pebbles. Can we take a sample of 10 pebbles, count the number of white and black ones, and deduce the proportion in the container? If 10 is too small a sample, how about 20? How small is too small? James Bernoulli used this illustration to understand, through observations, the likelihood of diseases in human body.

How many sample observations do we need before we can confidently say something is true?

In lesson 6, and lesson 23, we have seen that the probability of an event is its long run relative frequency. How many observations should we get before we know the true probability?

Do data from different probability distribution functions take the same amount of sample observations to see this convergence to the truth?

Do we even know if what we observe in the long run is the actual truth? Remember black swans did not exist till they did.

We will let the mathematicians and probability theorists wrestle with these questions. Today, I will show you some simple tricks in R to simulate large number games. We will use three examples;

tossing a coin,

rolling a dice, and

drawing pebbles from the urn.

All the code in R is here. Let’s get to the nuts and bolts of our toy models.

Coin Toss

If you toss a coin, the probability of getting a head or a tail is 0.5. The two outcomes are independent. By now you should know that it is 0.5 in the long run. In other words, if you toss a coin many times and count the number of times you get head and the number of times you get tails, the probability of head, i.e., the number of heads/N ~ 0.5.

Let us create this experiment in R. Since there is 0.5 probability of getting head or tail; we can use the random number generator to simulate the outcomes. Imagine we have a ruler on the scale of 0 to 1, and we simulate a random number with equal probability, i.e., any number between 0 and 1 is equally possible. If the number is less than 0.5, we assume heads, if it is greater than 0.5, we assume tails.

A random number between 0 and 1 with equal probability can be generated using the runif command in R.

# Generate 1 uniform random number 
r = runif(1)

[1] 0.3627271

# Generate 10 uniform random number 
r = runif(10)

[1] 0.7821440 0.8344471 0.3977171 0.3109202 0.8300459 0.7474802 0.8777750
 [8] 0.7528353 0.9098839 0.3731529

# Generate 100 uniform random number 
r = runif(100)

The steps for this coin toss experiment are:
1) Generate a uniform random number r between 0 – 1
2) If r < 0.5 choose H
3) If r > 0.5, choose T
4) Repeat the experiment N times → N coin tosses.

Here is the code.

# coin toss experiment 
N = 100

r = runif(N)

x = matrix(NA,nrow = N,ncol = 1)
for (i in 1:N)
{
 if(r[i] < 0.5 )
 { x[i,1] = "H" } else 
 { x[i,1] = "T" }
}

# Probability of Head 
P_H = length(which(x=="H"))/N
print(P_H)

# Probability of Tail 
P_T = length(which(x=="T"))/N
print(P_T)

We first generate 100 random numbers with equal probability, look through each number, and assign H or T based on whether the number is less than 0.5 or not, and then count the number of heads and tails. Some of you remember the which and length commands from lesson 5 and lesson 8.

The beauty of using R is that it offers shortcuts for lengthy procedures. What we did using the for loop above can be much easier using the sample command. Here is how we do that.

We first create a vector with H and T names (outcomes of a toss) to select from.

# create a vector with H and T named toss to select from 
toss = c("H","T")

Next, we can use the sample command to draw a value from the vector. This function is used to randomly sample a value from the vector (H T), just like tossing.

# use the command sample to sample 1 value from toss with replacement 
sample(toss,1,replace=T) # toss a coin 1 time

You can also use it to sample/toss multiple times, but be sure to use replace = TRUE in the function. That way, the function replicates the process of drawing a value, putting it back and drawing again, so the observations do not change.

# sample 100 times will toss a coin 100 times. use replace = T 
sample(toss,100,replace=T) # toss a coin 100 time

 [1] "H" "H" "H" "H" "T" "T" "H" "T" "T" "H" "T" "H" "H" "H" "T" "T" "H" "H"
 [19] "T" "H" "H" "H" "H" "T" "T" "T" "T" "H" "H" "H" "T" "H" "H" "H" "T" "H"
 [37] "H" "T" "T" "T" "T" "T" "H" "T" "T" "H" "T" "H" "H" "T" "H" "H" "T" "H"
 [55] "H" "T" "H" "T" "H" "H" "H" "T" "T" "T" "T" "H" "H" "H" "H" "T" "H" "T"
 [73] "T" "H" "H" "T" "H" "H" "T" "T" "T" "T" "H" "T" "H" "T" "T" "H" "T" "H"
 [91] "H" "T" "T" "H" "T" "H" "T" "H" "H" "T"

To summarize the number of heads and tails, you can use the table command. It is a bean counter. You should expect 50 H and 50 T from a large number of tosses.

# to summarize the values use the table command.
x = sample(toss,100,replace=T) # toss a coin 100 times
table(x) # summarize the values in x

 H T 
51 49

A large number of simulations

Set up an experiment with N = 5, 15, 25, 35, 45, … , 20000 and
record the probability of obtaining a head in each case. Type the following lines in your code and execute. See what you get.

# Large Numbers Experiment 
# Tossing Coin

toss = c("H","T")

number_toss = seq(from = 5, to = 20000,by=10) # increasing number of tosses

P_H = matrix(NA,nrow=length(number_toss),ncol=1) # create an empty matrix to fill probability each time

P_T = matrix(NA,nrow=length(number_toss),ncol=1) # create an empty matrix to fill probability each time

for (i in 1:length(number_toss))
{
 x = sample(toss,number_toss[i],replace=T)
 P_H[i,1]=length(which(x=="H"))/number_toss[i]
 P_T[i,1]=length(which(x=="T"))/number_toss[i] 
}

plot(number_toss,P_H,xlab="Number of Tosses",ylab="Probability of Head",type="l",font=2,font.lab=2)
abline(h=0.5,col="red",lwd=1)

Notice, for a small number of coin tosses, the probability of head has high variability (not stable). But as the number of coin tosses increases, the probability converges and stabilizes at 0.5.

DICE

You should be able to write the code for this using the same logic. Try it. Create a vector [1, 2, 3, 4, 5, 6] and sample from this vector with replacement.

No cheating by looking at my code right away.

Here is the large number experiment on this. Did you see the convergence to 1/6?

Pebbles

Let us first create a vector of size 10 (an urn with pebbles) with six white and four black pebbles, assuming the true ratio of white and black is 3/2. Run the sampling experiment on this vector by drawing more and more samples and estimate the probability of white pebble in the long run.

# Show that the probability of getting the color stones approaches the true probability
# Pebbles

stone = c("W","W","W","W","W","W","B","B","B","B")

number = seq(from = 5, to = 20000,by=10) # increasing number of draws

P_W = matrix(NA,nrow=length(number),ncol=1) # create an empty matrix to fill probability each time
P_B = matrix(NA,nrow=length(number),ncol=1) # create an empty matrix to fill probability each time

for (i in 1:length(number))
{
 x = sample(stone,number[i],replace=T)
 P_W[i,1]=length(which(x=="W"))/number[i]
 P_B[i,1]=length(which(x=="B"))/number[i]
}

plot(number,P_W,type="l",font=2,font.lab=2,xlab="Number of Draws with Replacement",ylim=c(0,1),ylab="Probability of a White Pebble")
abline(h=(0.6),col="red")

If there were truly three white to two black pebbles in the world, an infinite (very very large) experiment/simulation would reveal a convergence at this number. This large number experiment is our quest for the truth, a process to surmise randomness.

You can now go and look at the full code and have fun programming it yourself. I will close off with this quote from James Bernoulli’s Ars Conjectandi.

“Whence, finally, this one thing seems to follow: that if observations of all events were to be continued throughout all eternity, (and hence the ultimate probability would tend toward perfect certainty), everything in the world would be perceived to happen in fixed ratios and according to a constant law of alternation, so that even in the most accidental and fortuitous occurrences we would be bound to recognize, as it were, a certain necessity and, so to speak, a certain fate.”

I continue to wonder what that “certain fate” is for me.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 28 – Apples and Oranges

The prices of golden delicious apples sold yesterday at selected U.S. cities’ terminal markets are

The prices of navel oranges sold yesterday at selected U.S. cities’ terminal markets are

I want to compare apples and oranges, needless to say, prices. I will assume the prices of apples to be a random variable and will plot the probability distribution function. I will consider the prices of oranges to be another random variable and will plot its probability distribution function. Since the prices can be defined on a continuous number scale, I am assuming continuous probability distribution functions that look like this

Notice the two random variables are on a different footing. Apples are centered on $44 with a standard deviation (spread) of $9. Oranges are centered on $25 with a standard deviation of $5.

If our interest lies in working simultaneously with data that are on different scales or units, we can standardize them, so they are placed on the same level or footing. In other words, we will move the distributions from their original scales to a new common scale. This transformation will enable an easy way to work with data that are related, but not strictly comparable. In our example, while both the prices data are in the same units, they are clearly on different scales.

We can re-express the random variable (data) as standardized anomalies so they can be compared or analyzed. This process of standardization can be achieved by subtracting the mean of the data and dividing by the standard deviation. For any random variable, if we subtract the mean and divide by the standard deviation, the expected value and the variance of the new standardized variable is 0 and 1. Here’s why.

The common scale is hence a mean of 0 and a standard deviation of 1. We just removed the influence of the location (center) and spread (standard deviation) from the data. Any variable, once standardized, will be on this scale regardless of the type of the random variable and shape of the distribution.

You must have observed that the units will cancel out → the standardized random variable is dimensionless. The standardized data for apples and oranges will look this

The first step where we subtract the mean will give anomalies, i.e. differences from the mean that are now centered on zero. Positive anomalies are the values that are greater than the mean, and negative anomalies are the values that are less than the mean. The second step where we divide by the standard deviation will provide a scaling factor; unit standard deviation for the new data that enables comparison across different distributions.

The standardized scores can also be seen as a distance measures between the original data value and the mean, the units being the standard deviation. For example, the price of apples in New York terminal market is $55, about 1.14 standard deviations from the mean of ~ $44. $44.63 + 1.14*9.

We will revisit this idea of standardization and use it to our advantage when we learn normal probability distributions. Until then, here are some examples of “standardize” in sentences as provided by Merriam — my rants attached.

“The plan is to standardize the test for reading comprehension so that we can see how students across the state compare” – One size does not fit all.

“He standardized procedures for the industry” – Interns getting coffee is not one of them. So make your own.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 27 – More variety

August 5, 2017: Outside temperature 71F; Joe’s temperature 62F.

June 2, 2017: Outside temperature 77F; Joe’s temperature 60F.

April 14, 2017: Outside temperature 65F; Joe’s temperature 64F.

February 7, 2017: Outside temperature 40F; Joe’s temperature 61F.

December 11, 2017: Outside temperature 34F; Joe’s temperature 63F.

What is the point? I know you are not referring to me.

Hey Joe. Glad you are here. I have been doing this experiment for some time now. A few years ago, I noticed that Joe’s, the famous coffee shop in New York City is always pleasant – summer or winter. I wanted to see how much variation there is in the internal temperature, so I started measuring it daily. It also meant $5 a day on coffee. 😯

So what did you find at that price?

I discovered that the temperature was discrete and always between 60 and 64. Some of it may be due to the temperature app I use on my phone – it only shows whole numbers; nevertheless, the temperature seems to be pretty regulated. Here is the summary of that data shown in probabilities and a simple probability distribution plot. You must be familiar with this.

Sure, I remember our first conversations about probability. It is the long run relative frequency. So you are finding that the probability that the temperature is 60F on any day is 0.1, the probability that the temperature is 62 on any day is 0.4 and so forth.

Correct. Shall we also compute the summary statistics?

Sure, it will be useful to estimate the expected value (E[X]), variance (V[X]) and the standard deviation of this random variable based on the probability distribution function. Let me do that.

The expected value is

E[X] = 60*0.1 + 61*0.2 + 62*0.4 + 63*0.2 + 64*0.1 = 62

The average temperature is 62F.

Variance can be estimated using the following equation

Applying this to our data:

So the variance of the temperature is 1.2 deg F*deg F, or the standard deviation is 1.095 deg F.

Fantastic. We can see that the average temperature is 62F and not varying much. That explains why Joe’s is always crowded. Besides having great coffee, they also offer great climate.

I agree. By the way, do you know the geolocation of your readers?

Sure. Here is a map of our readership. We’ve got to get some more countries on board to cover the globe, but this is a great start. Why do you ask?

I ask because outside the USA, people follow the SI unit system and they may want to do these computations in Celsius scale.

Very interesting observation. Do you have a solution for it?

Well, the simplest way is to convert the data first into Celsius scale and re-compute. But that is boring. There’s got to be some elegant approach.

Yes there is. The properties of expected value and the variance operations can help us with this. Going back to your concern, we can convert degree F to degree C using this equation:

In lesson 25, we learned that the expected value operation is linear on functions. For example,

Using this property, we can compute the expected value of Joe’s temperature in deg C as:

E[C] = 5/9*E[F] - 160/9 = 5/9*62 - 160/9 = 16.66 deg C.

For variance, it is not linear. Let us look at the derivation.

Very clear. Let me apply this to the Celsius equation.

and the standard deviation will be 0.6 deg C.

Excellent. Now that you have your intellectual focus going tell me if the variance of the sum is the sum of the variance for two or more random variables?

Hmm. Going by your previous lessons, your questions always open new concepts. There is something hidden here too that you want me to discover. Can you give me a hint?

Well Joe, Noam Chomsky once said that if you can discover things, you are on your way to being an “independent” thinker.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 26 – The variety in consumption

This summer, I affixed an air conditioner in my apartment to adapt to a changing climate. One month into the season, I was startled to see my electricity bill double. I checked my AC specs; sure enough, its energy efficiency ratio is 9.7 BTU/h.W — not the modern energy star baby by any standard.

I begin to wonder how many such low-efficiency units contribute to increased energy usage, especially in New York City where window AC is a norm for residential buildings. While getting the number of older units is challenging, Local Law 84 on Municipal Energy and Water Data Disclosure require owners of large buildings to report their energy and water consumption annually. You can get the 2015 disclosure data from here. I am showing a simple frequency plot of the weather normalized electricity intensity in kBtu per square foot of building area here.

There are 13223 buildings in the data file; 254 of them have an electricity intensity greater than 200 kBtu. I am not showing them here. The NY Stock Exchange building, Bryant Park Hotel, and Rockefeller University are notable among them.

I want to know the variety in energy use. Long time readers might guess correctly from the title that I am talking about the variability in energy usage data. We can assume that energy consumption is a random variable X, i.e. X represents the possible energy consumption values (infinite and continuous). The data we downloaded are sample observations x. We are interested in the variance of X → V[X].

In lesson 24 and lesson 25, we learned that the expected value (E[X]) is a descriptive quantity of the average (center) of a random variable X with a probability distribution function f(x). In the same way, a measure of the variability, i.e. deviation from the center, of the random variable is the variance V[X].

It is defined as the expected value of the squared deviation from the average.

is the expected value of the random variable → E[X]. measures the deviation from this value. We square these deviations and get the expected value of the squared deviations. If you remember lesson 17, this is exactly the equation for the variance of the data sample. Here, we generalize it for a random variable X.

With some derivation, we can get a useful alternative for computing the variance.

If we know the probability distribution function f(x) of the random variable X, we can also write the variance as

f(x) for this data might look like the thick black line on the frequency plot.

The expected value of the consumption for 2015 of the 12969 buildings is 83 kBtu/sqft; the variance is 1062 (kBtu/sqft)×(kBtu/sqft). A better way to understand this is through standard deviation, the square root of the variance — 32 kBtu/sqft. You can see from the frequency plot that the data has high variance — buildings with very low electricity intensity and buildings with high electricity intensity.

What is your building’s energy consumption this year? Does your city have this cool feature?

It came at a price though. With another local law, we can ban all the low EER AC units to solve the energy problem.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 25 – More expectation

My old high school friend is now a successful young businessman. Last week, he shared thoughts on one of his unusual stock investment schemes. Every few months, he randomly selects four stocks from four different sectors and bets equally on them. I asked him for a rationale. He said he expects two profit making stocks on average assuming that the probability of profit or loss for a stock is 0.5. I think since he picks them at random, he also assigns a 50-50 chance of win lose.

My first thought

This made a nice expected value problem of the sum of random variables.

The expected number of profit making stocks in his case is 2. We can assign X1, X2, X3, and X4 as the random variables for individual stocks with outcomes 1 if it makes a profit and 0 otherwise. We can assign Y as the total number of profit making stocks; ranging from 0 to 4. His possible outcomes are:

As we can see, the total number of profit making stocks in these scenarios are 4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0. The average of these numbers is 2; the expected number of profit making stocks.

Another way of getting at the same number is to use the expected value formula we learned in lesson 24.

E[Y] = 4(1/16) + 3(4/16) + 2(6/16) + 1(4/16) + 0(1/16) = 2

An important property of expected value of a random variable is that the mean of the linear function is the linear function of the mean.

Y = X1 + X2 + X3 + X4

Y is another random variable that comes from a combination of individual random variables. For sum of random variables,

E[Y] = E[X1] + E[X2] + E[X3] + E[X4]

Detailed folks can go over the derivation.

Now, E[X1] = 0.5(1) + 0.5(0) = 0.5, since the outcomes are 1 and 0 and the probabilities are 0.5 each. Adding all of them, we get 2.

So you see, the additive property makes it easy to estimate the expected value of the sum of the random variables instead of writing down the outcomes and computing the probability distribution of Y.

Other simple rules when there are constants involved;

Try to derive them as we did above.

My second thought

Stock market might be more complicated than a coin flip experiment. But what do I know; he clearly has more net worth than me. I guess since this is only one of his investment schemes, he is just playing around with his leftovers.

My final thought

I am only good for teaching probability; not using it like him. But again, most ivory tower professors only preach; don’t practice. Hey, they are protected. Why should they do real things?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 24 – What else did you expect?

On a hot summer evening, a low energy Leo walked into his apartment building without greeting the old man smoking outside. Leo has just burnt his wallet in Atlantic City Roulette game, and his mind has been occupied with how it all happened. He went in with $500. Ten chips of $50. His first gamble was to play safe and bet one chip at a time on red. He lost the first two time, won the third time, and lost the next two times. After five consecutive bets, he was left with $350. In the Roulette game, the payout for red or black is 1 to 1. He started getting worried. Since the payout for any single number is 35 to 1, in a hasty move, he went all in on number 20, just to realize that he was all out.

Could it be that luck was not favoring Leo on the day? Could it be that his betting strategy was wrong? Are the odds stacked against him? If he had enough chips and placed the same bet long enough, what can he expect?

Based on his first bet

Imagine Leo bets a dollar at a time on red. He will win or lose $1 each time. In an American Roulette, there will be 18 red, 18 black and two green (0 and 00) — a total of 38 numbers. Each number is independent, i.e. there is an equal chance of getting any number. The probability of getting a red is 18/38 (18 reds in 38 numbers). In the same way, the probability of getting a black is 18/38, and the probability of getting a green is 2/38.

If the Ivory ball ends in a red, he will win $1; if it ends in any other color he will lose $1 – or he gets -$1. In the long run, if he keeps playing this game with dollar on and off, his expected win for a dollar will be

On average, for every $1 he bets on red, he will lose 0.05 cents.

Based on his Second bet

Now let us imagine Leo bets on a single number where the payout is 35 to 1. He will win $35 if the ball ends up in his number, or lose the dollar. The probability of getting any number is 1/38 (one number in 38 outcomes). Again, in the long run, if he keeps playing this game; win $35 or lose $1, over time, his expected win for a dollar will be

Although the payout is high, one average, for every $1 he bets on a single number, he will still lose 0.05 cents.

This estimation we just did is called the Expected Value of a random variable. Just like how “mean” is a description of the central tendency for a sample data, the expected value (E[X]) is a descriptive quantity of the central tendency (average behavior) of a random variable X with a probability distribution function (f(x)).

In Leo’s case, X is the random variable describing his payout, x is the actual payout from the house ($1 or $35 if he wins, or -$1 if he loses), and f(x) is the probability distribution or frequency of the outcomes (18/38 for red and 20/38 otherwise, or 1/38 for a single number and 37/38 otherwise).

You will notice that this equation is exactly like the equation for the average of a sample. Imagine there is a very large sample data with repetitions; we are adding and averaging over the groups.

Poor Leo expected this to happen but didn’t realize that the table is tilted and the game is rigged.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 23 – Let’s distribute the probability

Hey Joe, what are you up to these days?

Apart from visiting DC recently, life has been mellow over this summer. I am reading your lessons every week. I noticed there are several ways to visualize data and summarize it. Those were a nice set of data summary lessons.

Yes. Preliminaries in data analysis — visualize and summarize. I recently came across visuals with cute faces 🙂 I will present them at an appropriate time.

That is cool. On the way back from DC, we played the Chicago dice game. I remembered our conversation about probability while playing.

Interesting. How is the game played?

There will be eleven rounds numbered 2 – 12. In each round, we throw the pair of dice to score the number of the round. For example, if on the first try, I get a 1 and 1, I win a point because my first round score is 2. If I throw any other number other than 2, I don’t win anything. The player with the highest total after 11 rounds wins the game.

I see. So there are 11 outcomes (2 – 12), and you are trying to get the outcome. Do you know the probability distribution of these outcomes?

I believe you just used the question to present a new idea – “probability distribution“. Fine, let me do the Socratic thing here and ask “What is probability distribution“?

It is the distribution of the probability of the outcomes. In your Chicago dice example, you have a random outcome between 2 and 12; 2 if you roll a 1 and 1; 12 if you roll a 6 and 6. Each of these random outcomes has a probability of occurring. If you compute these probabilities and plot them; i.e. distribute the probabilities on a number line, we can see a probability distribution of these random variables.

Let me jump in here. There are 11 possible outcomes. I will tabulate the possibilities.

There are limited ways of achieving an outcome. The likelihood of each outcome will be the ratio of the total ways we can get the number and 36. An outcome 2 can only be achieved if we get a (1,1). Hence the probability of getting 2 in this game is 1/36.

Excellent, now try to plot these probabilities on a scale from 2 to 12.

Looking at the table, I can see the probability will increase as we go up from 2 to 7 and decrease from there till 12.

I like the way you named your axes. X and P(X = x). Your plot shows that there is a spike (which is the probability) for each possible outcome. The probability is 0 for all other outcomes. The spikes should add up to 1. This probability graph is called the probability distribution function f(x) for a discrete random variable.

The function can be integrated to obtain the cumulative distribution function. Say you want to know the probability of getting an outcome less than 4. You can use the cumulative function that is integrated over the outcomes 2 and 3. Just be watchful of the notations. Probability distribution function has a lowercase f, and cumulative distribution function has an uppercase F.

So if we know the function f(x), we can find out the probability of any possible event from it. These outcomes are discrete (2 to 12), and the function is also discrete for every outcome. What if the outcomes are continuous? How does the probability distribution function look if the random variable is continuous where the possibilities are infinite?

Okay, let us do a thought experiment. Imagine there are ten similar apples in a basket. What is the probability of taking any apple at random?

Since there are ten apples, the probability of taking one is 1/10.

What if there are n apples?

Then the probability of taking any one is 1/n. Why do you ask?

What happens to the probability if n is a very large number, i.e. if there are infinite possibilities?

Ah, I see. As n approaches infinity, the probability of seeing any one number approaches 0. So unlike discrete random variables which have a defined probability for each outcome, for continuous random variables P(X = x) = 0. How then, can we come up with a probability distribution function?

Recall how we did frequency plots. We partitioned the space into intervals or groups and recorded the number of observations that fall into each group. For continuous random variables, the proportion of observations in the group approaches the probability of being in the group. For a large n, we can imagine a large number of small intervals like this.

We can approximate this to a smooth curve and define the probability of a continuous variable in an interval a and b.

The extension from the frequency plot to the probability distribution function is clear. Since the function is continuous, if we want the cumulative function, we integrate it like this.

Great. You picked up many things today. Did you figure out the odds of getting a deal on your Chicago dice game — getting the same number as the round in your 11 tries?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 22 – You are so random

Not just Pinkie Pie, outcomes of events that involve some level of uncertainty are also random. A random variable describes these outcomes as numbers. Random variables can take on different values; just like variables in math taking different values.

If the possible outcomes are distinct numbers (e.g. counts), then these are called discrete random variables. If the possible outcomes can take on any value on the real number line, then these are called continuous random variables.

There are six possible outcomes (1, 2, 3, 4, 5 and 6) when you roll a dice. Each number is distinct. We can assume X as a random variable that can take any number between 1 and 6; hence it is finite and discrete. For any single roll, we can assume x to be the outcome. Notice that we are using uppercase X for the random variable and lowercase x for the value it takes for a given outcome.

X is the set of possible values and x is an observation from that set.

In lesson 20, we explored the rainfall data for New York City and Berkeley. Here, we can assume rain to be a continuous random variable X on the number line. In other words, the rainfall in any year can be a random value on the line with 0 as the lower limit. Can you guess the upper limit for rainfall? The actual data we have is an outcome (x); observation; a value on this random variable scale. Again, X is the possible values rainfall can take (infinite and continuous), and x is what we observed in the sample data.

In lesson 19, we looked at SAT reading score for schools in New York City. Since SAT reading score is between 200, the participation trophy and 800, in increments of 10, we can assume that it is finite and discrete random variable X. Any particular score we observe, for instance, 670 for a student is an observed outcome x.

If you are playing monopoly, the outcome of your roll will be a random variable between 2 and 12; discrete and finite; 2 if you get 1 and 1; 12 if you get 6 and 6, and all combinations in between.

In lesson 14, we plotted the box office revenue for STAR WARS films. We can assume this data as observations of a continuous random variable.

Do you think this random variable showing revenue can be negative? What if they lose money? Maybe not STAR WARS, but there are loads of terrible films that are negative random variables.

Can you think of other random variables that can be negative?

How about the national debt?

Are you old enough to have seen a surplus?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 21 – Beginners guide to summarize data in R

We went nine lessons without R. During this time we had lessons that taught us how to visualize data using dotplots, boxplots, and frequency plots (lesson 14 and lesson 15). We had lessons on summarizing the data using average (lesson 16), standard deviation (lesson 17), skewness (lesson 18), and outliers (lesson 19). We also had a lesson on comparing datasets using coefficient of variation (lesson 20). Don’t you think it is time to put these things into practice and know how to use R commands to explore data?

Let’s get going. Today’s victim is the natural gas consumption data in New York City available at the zip code level. The Mayor’s office of long-term planning and sustainability provides this data.

Usual Chores

Step 1: Get the data
You can download the data file from here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson21”. The downloaded data file “Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv” will be saved in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson21” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson21_code.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson21” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5: Read the data into R workspace
Since we have a comma separated values file (.csv), we can use the “read.csv” command to read the data file into R workspace. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the datafile #
naturalgas_data = read.csv("Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv",header=TRUE)

Step 6: Let us begin the exploration
I realized that the data is classified into various building types → commercial, industrial, residential, institutional, etc. Let us extract three building types for this lesson. You can use the “which” command to do this filtering.

# extract subset of the data #
index1 = which(naturalgas_data$Building.type..service.class == "Large residential")

index2 = which(naturalgas_data$Building.type..service.class == "Commercial")

index3 = which(naturalgas_data$Building.type..service.class == "Institutional")

Residential = naturalgas_data$Consumption..GJ.[index1]
Commercial = naturalgas_data$Consumption..GJ.[index2]
Institutional = naturalgas_data$Consumption..GJ.[index3]

The “which” command will identify all the rows that belong to a class. We then extract these rows from the original data.

Now we have all large residential building consumption data under the “Residential“, all commercial building consumption data under the “Commercial” and all institutional building consumption data under the “Institutional“.

Let us look at Large residential consumption data first and then compare it to the others.

Visualize the data using dot plot

As a first step, we can order the data from smallest to largest and place the numbers as points on the line. This dot plot provides a good visual perspective on the range of the data. You can use the following command to do this.

stripchart(Residential,font=2,pch=21,cex=0.5,xlab="Consumption in GJ",font.lab=2)

You can change the values for the font to make it bold or light, pch for different shapes of the dots, and cex for changing the size of the dots. You can have customized labels using xlab.

Visualize the data using boxplot

With boxplot, we can get a nice visual of the data range and its percentiles along with detecting the outliers in the data.

boxplot(Residential, horizontal=TRUE, col="grey",add=T)

You can pick a color of your choice, and make it horizontal or vertical. I usually prefer the box to be horizontal as it aligns with the number line. You must have noticed the add=T at the end. This operation will tell R to add the boxplot on top on the dot plot. Did you see how the dots and the box aligned?

The box is the region with 50 percent of the data. The vertical line in the box is the 50th percentile (middle data point — median). Whiskers are extended from the box to the higher and lower percentiles. This extension is usually 1.5 times the length of the box. The length of the box is called the interquartile range (75th percentile minus 25th percentile). Points that cannot be reached using the whiskers are outliers.

Visualize the data using frequency plot

For making a frequency plot, we should first partition the data into groups or bins, count the number of data points that fall in each bin and stack building blocks for each bin. All this is done in R using the command:

# histogram #
hist(Residential,font=2,font.lab=2,xlab="Consumption in GJ",col="grey")

Again, you can work with font, xlab, etc. to beautify the plot.

Did you see that the data is not symmetric? There are extreme values on the right creating a positive skew. The average and 50th percentile (median) are not the same.

Summarizing the data using summary statistics

We can summarize the data using the summary statistics → average, standard deviation, and skewness. We can also compute the percentiles.

Average or mean is a measure of the central tendency (balance point). You make a pass through the numbers and add them up, then divide this total by the number of data points. In R, the command for this is:

# average #
mean(Residential,na.rm=T)
[1] 684942.3

Notice that I added an argument na.rm = T since I have some missing values in the data. na.rm is instructing R to remove the “NA” and then compute the average.

You must be thinking about the fact that mean is sensitive to outliers. Yes, we have seen that the data is skewed, so perhaps computing the middle value (median) is necessary. The median is not sensitive to outliers. 50 percent of the data points are less than this number. The command for this in R is:

# median #
median(Residential,na.rm=T)
[1] 414505

While the average provides a measure of the center of the data, variance or standard deviation provides a measure of how spread out the data is from the center. Compute the deviation of each number from the average. Square these deviations. Get the average of all these squared deviations. In R you can do this using the command:

# variance and standard deviation #
var(Residential,na.rm=T)
[1] 707890274303

sd(Residential,na.rm=T)
[1] 841362.2

As you know, the standard deviation is in the same units (gigajoules) as the average.

The third measure, the skewness can be computed as follows:

# skewness #
library(moments)
skewness(Residential,na.rm=T)
[1] 1.990704

The function skewness is part of a package called “moments“. If you do not have this library, you should first install it using:

install.packages("moments")

After installing, you can call this library using:

library(moments)

The skewness for this data is 1.99. Very high as you saw in the histogram. If the data is symmetric, i.e. evenly spread out around the center, the skewness will be 0.

The “quantiles” command will give us the 0th, 25th, 50th, 75th and the 100th percentiles. We can also use the “summary” command to get an overview of the data.

# quantiles and summary #
quantile(Residential,na.rm=T)

 0% 25% 50% 75% 100% 
 30 65247 414505 998733 4098510 

summary(Residential)

 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
 30 65250 414500 684900 998700 4099000 1

Comparing the data using plots and coefficient of variation

An easy way to compare the data is to align them on the same scale and visually inspect for differences. In other words, you can plot two or more boxplots on the same scale like this:

# comparing the data # 
all_data = cbind(Residential,Commercial, Institutional)
boxplot(all_data,horizontal=T)

You must have noticed that I am first combining the data vectors into a new frame through the “cbind“ command. cbind will bind the data as columns. (column bind — cbind). We can now plot this new data frame with three columns.

There are clear differences in the data. Large residential buildings have more energy consumption. Not surprising as there are more residential buildings than commercial or institutional.

Last week, we went over the coefficient of variation. Can you now tell me which consumption data (residential, commercial or institutional) has more variation compared to its mean?

Did you notice that the commercial and institutional has more variation compared to residential? Do you know why?

While you are at that, also think about why using graphical methods and summary statistics to summarize the data is an important feature for statistical modeling.

Could it be that there is a universal behavior and we are using a sample of that population?

Can a sample represent the population?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 20 – Compared to what?

These were the top four stories when I checked “California drought” in google news.

Not surprising. We have been following the drought story in California for a while now. Governor Brown declared the end of the drought emergency; there seems to be plenty of snow ready to melt in Sierra Nevada mountains; some communities are still feeling the effects of previous droughts. With drought last year and good rains this year, we can see there is variability.

These were the top four stories when I typed “New York drought” into the search bar. I had to do it. Didn’t you read the title?

Nothing alarming. The fourth story is about California drought 🙄 People from the East Coast know that there is not much variability in rains here from year to year.

You must have noticed that I am trying to compare different datasets. Are there ways to achieve this? Can we visually inspect for differences? Are there any measures that we can use?

Since we started with California and New York, let us compare rainfall data for two cities, Berkeley and New York City. Fortunately, we have more than 100 years of measured rainfall data for these cities. I am using the data from 1901 to 2000.

As always, we can prepare some graphics to visualize the data. Recall from Lesson 14 that we can use boxplots to get a perspective on the data range, its percentiles, and outliers. Since we have two datasets, Berkeley and New York City, let us look at the boxplots on the same scale, one below the other, like this:

There is a clear difference in the data. New York City gets lot more rain than Berkeley, atleast two times more on average. Notice the 50th percentile (middle of the box) for Berkeley around 600 mm and New York City around 1200 mm.

Did you see that the minimum rainfall New York City gets per year is greater than what Berkeley gets 75% of the times?

What about variability? Is there a difference in the variability of the datasets?

I computed their standard deviations. For Berkeley, it is 228 mm, and for New York City it is 216 mm. At the face of it, 228 and 216 do not look very different. So is there no difference in the variability? Is it sufficient to just compare the standard deviations?

You know that standard deviation is the deviation from the center of the data. But in this case, the two datasets do not have the same central value (average). New York City has an average rainfall of 1200 mm and a standard deviation of 216 mm. Compared to 1200 mm, the deviation is 18% (216/1200). Berkeley has an average rainfall of 600 mm and a standard deviation of 228 mm. Compared to 600 mm, the deviation is 38%.

This measure, the relative standard deviation is called the coefficient of variation.

It measures the amount of variability in relation to the average. It is a standardized metric that can be used to compare datasets on different scales or units. It is common to express this ratio as a percentage as we did with Berkeley and New York City.

So, we can say that New York City gets more rainfall (two times more on average) than Berkeley, and its relative variability is less (~two times less) than Berkeley. That explains why there are fewer drought stories for New York compared to California.

Next time you read a drought story in your State, ask yourself “compared to what” and check out these maps.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Coin Toss

A large number of simulations

DICE

Pebbles

My first thought

My second thought

My final thought

Based on his first bet

Based on his Second bet

Usual Chores

Visualize the data using dot plot

Visualize the data using boxplot

Visualize the data using frequency plot

Summarizing the data using summary statistics

Comparing the data using plots and coefficient of variation

Enjoy this blog? Please spread the word :)