Lesson 22 – You are so random

Not just Pinkie Pie, outcomes of events that involve some level of uncertainty are also random. A random variable describes these outcomes as numbers. Random variables can take on different values; just like variables in math taking different values.

If the possible outcomes are distinct numbers (e.g. counts), then these are called discrete random variables. If the possible outcomes can take on any value on the real number line, then these are called continuous random variables.

There are six possible outcomes (1, 2, 3, 4, 5 and 6) when you roll a dice. Each number is distinct. We can assume X as a random variable that can take any number between 1 and 6; hence it is finite and discrete. For any single roll, we can assume x to be the outcome. Notice that we are using uppercase X for the random variable and lowercase x for the value it takes for a given outcome.

X is the set of possible values and x is an observation from that set.

In lesson 20, we explored the rainfall data for New York City and Berkeley. Here, we can assume rain to be a continuous random variable X on the number line. In other words, the rainfall in any year can be a random value on the line with 0 as the lower limit. Can you guess the upper limit for rainfall? The actual data we have is an outcome (x); observation; a value on this random variable scale. Again, X is the possible values rainfall can take (infinite and continuous), and x is what we observed in the sample data.

In lesson 19, we looked at SAT reading score for schools in New York City. Since SAT reading score is between 200, the participation trophy and 800, in increments of 10, we can assume that it is finite and discrete random variable X. Any particular score we observe, for instance, 670 for a student is an observed outcome x.

If you are playing monopoly, the outcome of your roll will be a random variable between 2 and 12; discrete and finite; 2 if you get 1 and 1; 12 if you get 6 and 6, and all combinations in between.

In lesson 14, we plotted the box office revenue for STAR WARS films. We can assume this data as observations of a continuous random variable.

Do you think this random variable showing revenue can be negative? What if they lose money? Maybe not STAR WARS, but there are loads of terrible films that are negative random variables.

Can you think of other random variables that can be negative?

How about the national debt?

Are you old enough to have seen a surplus?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 20 – Compared to what?

These were the top four stories when I checked “California drought” in google news.

Not surprising. We have been following the drought story in California for a while now. Governor Brown declared the end of the drought emergency; there seems to be plenty of snow ready to melt in Sierra Nevada mountains; some communities are still feeling the effects of previous droughts. With drought last year and good rains this year, we can see there is variability.

These were the top four stories when I typed “New York drought” into the search bar. I had to do it. Didn’t you read the title?

Nothing alarming. The fourth story is about California drought 🙄 People from the East Coast know that there is not much variability in rains here from year to year.

You must have noticed that I am trying to compare different datasets. Are there ways to achieve this? Can we visually inspect for differences? Are there any measures that we can use?

Since we started with California and New York, let us compare rainfall data for two cities, Berkeley and New York City. Fortunately, we have more than 100 years of measured rainfall data for these cities. I am using the data from 1901 to 2000.

As always, we can prepare some graphics to visualize the data. Recall from Lesson 14 that we can use boxplots to get a perspective on the data range, its percentiles, and outliers. Since we have two datasets, Berkeley and New York City, let us look at the boxplots on the same scale, one below the other, like this:

There is a clear difference in the data. New York City gets lot more rain than Berkeley, atleast two times more on average. Notice the 50th percentile (middle of the box) for Berkeley around 600 mm and New York City around 1200 mm.

Did you see that the minimum rainfall New York City gets per year is greater than what Berkeley gets 75% of the times?

What about variability? Is there a difference in the variability of the datasets?

I computed their standard deviations. For Berkeley, it is 228 mm, and for New York City it is 216 mm. At the face of it, 228 and 216 do not look very different. So is there no difference in the variability? Is it sufficient to just compare the standard deviations?

You know that standard deviation is the deviation from the center of the data. But in this case, the two datasets do not have the same central value (average). New York City has an average rainfall of 1200 mm and a standard deviation of 216 mm. Compared to 1200 mm, the deviation is 18% (216/1200). Berkeley has an average rainfall of 600 mm and a standard deviation of 228 mm. Compared to 600 mm, the deviation is 38%.

This measure, the relative standard deviation is called the coefficient of variation

It measures the amount of variability in relation to the average. It is a standardized metric that can be used to compare datasets on different scales or units. It is common to express this ratio as a percentage as we did with Berkeley and New York City.

So, we can say that New York City gets more rainfall (two times more on average) than Berkeley, and its relative variability is less (~two times less) than Berkeley. That explains why there are fewer drought stories for New York compared to California.

Next time you read a drought story in your State, ask yourself “compared to what” and check out these maps.

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 19 – Voice of the outliers

Joe is considering SAT. His first instinct was to take a look at some previous SAT scores. Being a regular reader of this blog, he is familiar with NYC OpenData. So he goes and searches for the College Board SAT results and finds data for graduating seniors of 2010 in New York City schools. Among these records, he is only looking at the critical reading scores for 93 schools that have more than 100 test takers. He is now well versed with order statistics and boxplots, so he made a boxplot for his data. This conversation happened after that.

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 18 – Data comes in all shapes

Symmetric or not symmetric, that is the question.

Whether the data are evenly spread out around the average, producing a symmetric frequency plot, Or some data are disproportionately present on the right side or the left side of the average; thereby disturbing the symmetry.

 

 

 

 

 

 

 

To notice, to realize the shape, and by shape to say we understand the data and its general behavior.

To notice, to realize, perhaps to measure the shape — ah, there’s a catch; for measuring, should we first discern whether the data is right skewed or left skewed.

For who would know that a few extreme values on the right creates a positive skew, and a few extremes on the left creates a negative skew; that the average of the skewed data is not the same as the 50th percentile; that two datasets with same average and standard deviation can have different shapes.

That the measure of skew is

 

 

Thus shape does make a necessary measure to summarize the data, And thus the natural hue of data analysis includes all three summary statistics, average, variance, and skewness.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 17 – We who deviate from the norm

Hello, I am the Wood Duck. I am the most pretty of all water birds. I live in wooden swamps; I nest in tree holes or nest boxes along the lake. Don’t mess with me; I have a strong claw, can grip barks and perch on branches.

Ornithologists look for me each year, and migratory bird data center records my detections. These are the numbers they found for the last decade in New York State. Since you are familiar with the number line, I am showing the data as a dot plot to spare you some time. Can you see that I also show the average number of individuals (of my species) on the plot?

 

Hello, I am the Pine Warbler. I fly in the high pines. I know it is hard to find me, but hey, can’t you listen to my musical trills? I have a stout beak; I can prod clumps of needles. Here are my detections in the last decade in New York State.

 

Did you notice that my numbers are spread out on the line. Sometimes, my numbers are close to the average; sometimes they are far away from the average. I am deviating from the norm.

😉 These poor ornithologists have to understand the variation in finding me.

 

🙁 I am not deviating much from the average. You see, the average number of pine warblers in this data is 65, but the maximum is 75 and the minimum is 53. I am close to the average. I don’t deviate from the norm.

 

Like the average that summarizes the center of the data, if you want to summarize the deviation, you can do the following:

Compute the deviation of each number from the average.

Square these deviations. Why?

Get the average of all these squared deviations.

This measure of the variation in the data is called variance. It is the average squared deviation from the center of the data. Do you think it is sensitive to the outliers?

 

Mr. Wood, you forgot to point out that if you take the square root of the variance measure, you get the standard deviation.

It is in the same units as the average. We can look at a point on the number line and see how many standard deviations away it is from the average.

In your case, you have a number 79 on the line. Since your standard deviation is around 17, the 79 point is more than two standard deviations away from your average of 42.

 

Mr. Warbler, thank you. I think our readers now have two measures to summarize the data. The average and the variance. The average provides a measure of the center of the data, and the variance or standard deviation provides a measure of how spread out the data points are from the center.

While they play around with these summary statistics, I will go back to my swimming, and I guess you can go back to your singing.

 

How many of you thought you would see data on birds here? That is the variance for you. In the data analysis classroom, we not only mean well but also provide variety.

Our friends from the sky have deviated from their norm to teach us standard deviation. Have you been deviating from the norm? If not, why not?

Were you given the index card too?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 16 – Joe meets “the average”

I read a news article this week that said the average US family has $8377 in credit card debt.

 

 

😯

 

 

I got intrigued by the word “the average.” So I tried my usual fun search. I typed the phrase “what is the average” in Google and Bing search engines.

 

The questions that pop up in the search engines are fascinating. We should find the answers.

 

 

But I want clarification about the word “average” first. What is average? What are they showing when they say “the average something is something”?

 

Okay. Let us do our usual thing. Let us take a small dataset and explore the answer to your question. Given your interest in debt, let us look at data on student debt — you may relate to it more.

 

 

Great. Back to business.

 

I got this data from the New York Fed’s Regional Household Debt and Credit Snapshots. They include data about mortgages, student loans, credit cards, auto loans, home equity lines of credit and delinquencies. I extracted a subset of this data; average student loan balance for eleven regions in the Greater New York Area.

 

Let me jump in here. The first thing we do with data is to order them from smallest to largest and place the numbers as points on the line. I read in Lesson 14 that it provides a good visual perspective on the range of the data. So the data table looks like this

Excellent. Now imagine that this number line of yours is a weighing scale and numbers are balls of equal weight. For the weighing scale to be horizontal (without tilt), where do you think the balance point should be?

 

 

Somewhere in the middle? Isn’t it like the center of gravity?

 

 

Exactly. That balance point is called the average or mean of the data. You make a pass through the numbers and add them up, then divide this total by the number of data points.

 

Got the idea. If we use the equation on our data, we get the average debt across the 11 regions to be $33,827. The balance point will be like this.

 

So you see, the average student debt in the Greater New York area is $33,827. That seems pretty high.

 

 

Yeah that seems high, but let me look at the weighing scale again. There is one ball far out around $45,000. It looks like if we remove this ball, the balance point will move back. Let me try.

Hmm. Now the balance point is at $32,760. I get a sense that  this average measure is somewhat sensitive to these far out points.

 

You are correct. The mean or centroid of the points is a good summary of the average conditions if there are no outliers. Mean is sensitive to outliers.

 

 

Looks like the Manhattan folks are influencing the average debt number.

Ah, these New Yorkers are always up to something.

I may want to go to college one day, but these debt numbers are scary. How on earth can I pay back such yuge debt?

 

Don’t worry Joe. In New York, you get your way, and Average Joe gets to pay. 

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 15 – Another brick in the wall

Yes, another week, another brick, another visual technique, to build our way to surmount data analysis.

You know how to create a dot plot and boxplot for any given data. Order the data from smallest to largest and place the numbers as points on the line → dot plot. You can see how the data points are spread out using a dot plot.

Compute the percentiles of the data and connect them using box and whiskers → boxplot. You can understand where most of the data is and how far out are the extreme points.

Now, remember the times when you enjoyed playing with your building blocks; remember the fights with your siblings when they flicked the bottom card of your carefully built house of cards.

We are going to create something similar. We are going to build a frequency plot from the data. Like the dot and box plots, the frequency plot will provide a simplified view of the data.

I will use the class size data for schools in New York City because we celebrated teachers day this week → dark sarcasm.

Details of the class sizes for each school, by grade and program type for the year 2010-2011 are available. Let us use a subset of this data for our lesson today; class sizes for 9th-grade general education program from 19 schools.

Building a frequency plot involves partitioning the data into groups or bins, counting the number of data points that fall in each bin and carefully assembling the building blocks for each bin.

Let us choose a bin of class size 10, i.e., our partitions are equally sized groups or bins 0 – 10; 10 – 20; 20 – 30, and so on.

Now, look at the data table above and see how many schools have an average class size between 0 and 10.

Yes, there are 0 schools in this category. Can you now check how many schools have an average class size between 10 and 20 students?

Did you see that the MURRAY HILL ACADEMY, the ACADEMY FOR HEALTH CAREERS, and the HIGH SCHOOL FOR COMMUNITY LEADERSHIP have class sizes in the 10 – 20 students category or bin. Three schools. This count is the frequency of seeing numbers between 10 and 20. Do this counting for all the bins.

Imagine we start placing bricks in each bin. As many bricks as the frequency number suggests. One on top of the other. Like this

There are zero schools in the first bin (0 – 10); so there are no bricks in that bin. There are three schools in the second bin (10 – 20); so we construct a vertical tower using three blocks. Let us do this for all the bins.

That’s it. We have just constructed a frequency plot. From this plot, we can say that there are 0 schools with class size less than 10. So the probability of finding schools with a tiny class size is 0. There are nine schools with a class size between 20 and 30. So the probability of finding schools in this category is 9/19. We get a sense of the likelihood of the most frequently occurring data and the rarely occurring data. We get a sense of how the data is distributed.

Here is a look at the frequency plot for the largest class size.

Did you notice a gap between the towers? It could mean that there are distinct groups (clusters) in the data. It could also mean that we are choosing a small bin size.

Narrow bins will lead to more irregular towers, so understanding the underlying pattern may be difficult. Wider bins will result in more regular towers (smoother), but we are putting a lot of data points into one bin leading to loss of information about individual behavior.

So you see, data analysis is about understanding this trade-off between individuals and groups. Every data point is valuable because it provides information about the group.

If you find this lesson valuable, share it with others in your group.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 14 – The time has come; execute order statistics

Commander Joe, the time has come → execute order statistics. 

No, we are not going to destruct the Jedi Order. We are going to construct useful graphics to get better insights on the data. Also known as Exploratory Data Analysis (EDA), this procedure allows us to summarize and present the data in a concise form. It is also an excellent tool to detect outliers or unusual numbers. Imagine trying to understand a table full of numbers. I bet you will be as overwhelmed and lost as I am here.

Not to worry. You can solve this puzzle using order statistics and exploratory graphics. Recall lesson four where we visualized the sets using Venn diagrams. Since our brain perceives visual information better, a useful first step is to visualize the data. I looked for a small but powerful dataset to get us started. The FORCE guided me towards box office revenues of the STAR WARS films. Here they are.

This table presents the data in the order of the release of the films, but we are trying to understand the data on the revenue collected. Look at the table and tell me which film had the lowest revenue?

Yes, it is “The Clone Wars.” It collected $39 million dollars in revenue. Look at the table again and tell me which film collected the highest revenue?

Yes, the epic “STAR WARS.”

You have just ordered the data from the smallest to the largest. You can take these ordered data and place them on a number line like this.

What you are seeing is a graphic called the dot plot. Each data point is shown as a dot at its value on the number line. You can see where each point is, relative to other data points. You can also see the range of the data. For our small data set, the box office revenue ranges from $39 Million to $1331 Million.

Okay, we have arranged our data in order and constructed a simple graphic to visualize it. Now, imagine how useful it would be if we can summarize the data into a few numbers (statistics). For example, can you tell me what is the middle value in the data, i.e. what number divides the data on the dot plot into two parts?

Yes, the fifth number (The Phantom Menace – Revenue $708 Million). The fifth number divides the nine numbers into two halves; 1, 2, 3, 4 on one side and 6, 7, 8, 9 on the other side. This middle value is called the 50th percentile→ 50% of the numbers are less than this number.

I threw in the “percentile” term there. Some of you must have remembered your SAT scores. What is your percentile score? If you have a 90th percentile score, 90% of the students who took the test have a score below yours. If you have a 75th percentile score, 75% of the students have a score below yours and so on.

Percentiles, also called order statistics for the data are a nice way to summarize the big data and express them in a few numbers. For our data, these are some order statistics. I am showing 25th, 50th, 75th and 95th percentiles on the dot plot.

Let us take one more step and construct another useful graphic by joining these order statistics. Let us put a box around the 25th, and 75th percentiles. This box will show the region with 50 percent of the data → 25th to 75th. Half of our data will be in the box. Let us also draw a line at the 50th percentile to indicate the middle data point.

Now, let us use wings (whiskers) and extend to lower and higher percentiles. We can stretch out the whiskers up to 1.5 times the box length.

If we cannot reach a data point using the whisker extensions from the box, we give up and call the data point an outlier or unusual data point.

This graphic is called the boxplot. Like the dot plot, we get a nice visual of the data range, its percentiles or order statistics, and we can visually detect outliers in the data.

The “STAR WARS”  is an outlier in the data, a one of its kind.

Today is not May the fourth.

It is not revenge of the fifth.

It is the graphic saber of the sixth. Use it to conquer the data.

May the FORCE be with you.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 13 – Dear Mr. Bayes

April 29, 2017

Thomas Bayes

Dear Mr. Bayes,

I am writing this letter to thank you for coming up with the Bayes Theorem. It uses logic and evidence to understand and update our knowledge. We can revise our belief with new data, and we have been doing that for centuries now. You have provided the formal probability rules to combine data with prior knowledge to get more precise understanding. This way of thinking is inherent in our decision making now. I have benefitted so much from this rule. I cannot thank you enough for this.

Today, I want to share a story with you about how Joe, the curious kid used Bayes Theorem to impress his boss.

Joe works part time at the MARKET.

His usual daily routine is to receive the bags of apples from Andy’s distribution company (A) and Betsy’s distribution company (B), check for quality and report to his boss.

One day, during this routine, he stepped out of the loading dock to check his twitter feed.

When he returned, he noticed that there was a bad apple → a rotten bag of apples in 100 bags of apples.

Since the bags were identical, he could not say whether the bad apple was from Andy’s or Betsy’s. All he knew was that Andy’s was contracted to deliver 60 bags and Betsy’s was contracted for 40 bags.

Joe is a sharp kid. Although he did not see who delivered that bad apple, he knew he could assign a probability that it came from Andy’s or Betsy’s. He called me for some advice.

Mr. Bayes, I hope my students are calculating the updated probability of getting a problem on Normal Distribution in the mid-term based on the review session. Your discovery saw the light after the invention of Monte Carlo approaches. Bayesian methods are widely applied now. You can rest assured in the heaven. Your posterity has ensured the use of Bayes Theorem for centuries to come, albeit making it a methodological fad sometimes.

Sincerely,

A posterior Bayesian from Earth

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 12 – Total recall

Recall October 29, 2012: High winds during Superstorm Sandy; record-breaking high tides, mandatory evacuations, significant power outages, $1Million in estimated property damage.

Recall February 24, 2016: Strong winds; a tractor blown over on the upper level of the George Washington Bridge, a sidewalk shed collapsed on Lenox Avenue in Harlem, $100k in estimated property damage.

Recall August 18, 2009: Thunderstorm winds; hundred trees down in Central Park, significant tree damage in western Central Park between 90th and 100th Street, $300k in estimated property damage.

Recall conditional probability rule from lesson 9.

P(A|B) = P(A ∩ B) / P(B) 

or 

P(A ∩ B) = P(A|B)*P(B)

Recall mutually exclusive and collectively exhaustive events from lesson 4.

The midterms are mutually exclusive, but the final exam is collectively exhaustive.

If you have n events E1, E2, … En that are mutually exclusive and collectively exhaustive, and another event A that may intersect these events, then, the law of total probability says that the probability of A is the sum of the probabilities of its disjoint parts.

Let us cut the jargon and try to understand this law using a simple example. We have the data on property damage during storms accessible from NOAA’s storm events database. Let us take a subset of this data — wind storms in New York City. You can get this subset here.

With some screening, you will see that there are 57 events → 16 high wind events, 15 strong wind events, and 26 thunderstorm wind events. Notice that there is property damage during some of these incidents. Let us visualize this set up using a Venn diagram.

Your immediate perception after seeing this picture would have been that the high winds, strong winds, and thunderstorm winds are mutually exclusive and collectively exhaustive. They don’t intersect, and together make up the entire wind storm sample space. Damages cut across these events.

Let us first focus on the high wind events. The 16 high wind events are shown as 16 points in the picture below. Notice that 4 of these points are within the damage zone. The probability of high wind events is P(H) = 16/57, the probability of high wind events and damage is P(damage ∩ high winds) = 4/57 and the probability of damage given high wind events is

P(damage|high winds) = P(damage ∩ high winds) / P(high winds) = 4/16

Now let us add all the other points (events) onto the picture. Some of these will be in damage zone, and some of them will be out of damage zone.

We can estimate the total probability of damage by adding its disjoint parts.

P(damage) = P(damage ∩ high winds) + P(damage ∩ strong winds) + P(damage ∩ thunderstorm winds)

or

P(damage) = P(damage|high winds)*P(high winds) + P(damage|strong winds)*P(strong winds) + P(damage|thunderstorm winds)*P(thunderstorm winds)

P(damage) = (4/16)*(16/57) + (8/15)*(15/57) + (9/26)*(26/57) = 21/57

The best part is that we can use this law as a predictive equation. Suppose there is an approaching storm and the weatherman told you that there is a 10% chance that the coming storm has high winds, 30% chance that it has strong winds and 60% chance that it has thunderstorm winds, you can immediately use this law and compute the probability of damage for NYC.

Can you tell me what that damage probability is?

Should I wait till after your Earth Day March?

Recall that you are totally contributing your share of Co2 to the earth during the March.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)