Lesson 19 – Voice of the outliers

Joe is considering SAT. His first instinct was to take a look at some previous SAT scores. Being a regular reader of this blog, he is familiar with NYC OpenData. So he goes and searches for the College Board SAT results and finds data for graduating seniors of 2010 in New York City schools. Among these records, he is only looking at the critical reading scores for 93 schools that have more than 100 test takers. He is now well versed with order statistics and boxplots, so he made a boxplot for his data. This conversation happened after that.

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 18 – Data comes in all shapes

Symmetric or not symmetric, that is the question.

Whether the data are evenly spread out around the average, producing a symmetric frequency plot, Or some data are disproportionately present on the right side or the left side of the average; thereby disturbing the symmetry.

 

 

 

 

 

 

 

To notice, to realize the shape, and by shape to say we understand the data and its general behavior.

To notice, to realize, perhaps to measure the shape — ah, there’s a catch; for measuring, should we first discern whether the data is right skewed or left skewed.

For who would know that a few extreme values on the right creates a positive skew, and a few extremes on the left creates a negative skew; that the average of the skewed data is not the same as the 50th percentile; that two datasets with same average and standard deviation can have different shapes.

That the measure of skew is

 

 

Thus shape does make a necessary measure to summarize the data, And thus the natural hue of data analysis includes all three summary statistics, average, variance, and skewness.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 17 – We who deviate from the norm

Hello, I am the Wood Duck. I am the most pretty of all water birds. I live in wooden swamps; I nest in tree holes or nest boxes along the lake. Don’t mess with me; I have a strong claw, can grip barks and perch on branches.

Ornithologists look for me each year, and migratory bird data center records my detections. These are the numbers they found for the last decade in New York State. Since you are familiar with the number line, I am showing the data as a dot plot to spare you some time. Can you see that I also show the average number of individuals (of my species) on the plot?

 

Hello, I am the Pine Warbler. I fly in the high pines. I know it is hard to find me, but hey, can’t you listen to my musical trills? I have a stout beak; I can prod clumps of needles. Here are my detections in the last decade in New York State.

 

Did you notice that my numbers are spread out on the line. Sometimes, my numbers are close to the average; sometimes they are far away from the average. I am deviating from the norm.

😉 These poor ornithologists have to understand the variation in finding me.

 

🙁 I am not deviating much from the average. You see, the average number of pine warblers in this data is 65, but the maximum is 75 and the minimum is 53. I am close to the average. I don’t deviate from the norm.

 

Like the average that summarizes the center of the data, if you want to summarize the deviation, you can do the following:

Compute the deviation of each number from the average.

Square these deviations. Why?

Get the average of all these squared deviations.

This measure of the variation in the data is called variance. It is the average squared deviation from the center of the data. Do you think it is sensitive to the outliers?

 

Mr. Wood, you forgot to point out that if you take the square root of the variance measure, you get the standard deviation.

It is in the same units as the average. We can look at a point on the number line and see how many standard deviations away it is from the average.

In your case, you have a number 79 on the line. Since your standard deviation is around 17, the 79 point is more than two standard deviations away from your average of 42.

 

Mr. Warbler, thank you. I think our readers now have two measures to summarize the data. The average and the variance. The average provides a measure of the center of the data, and the variance or standard deviation provides a measure of how spread out the data points are from the center.

While they play around with these summary statistics, I will go back to my swimming, and I guess you can go back to your singing.

 

How many of you thought you would see data on birds here? That is the variance for you. In the data analysis classroom, we not only mean well but also provide variety.

Our friends from the sky have deviated from their norm to teach us standard deviation. Have you been deviating from the norm? If not, why not?

Were you given the index card too?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 16 – Joe meets “the average”

I read a news article this week that said the average US family has $8377 in credit card debt.

 

 

😯

 

 

I got intrigued by the word “the average.” So I tried my usual fun search. I typed the phrase “what is the average” in Google and Bing search engines.

 

The questions that pop up in the search engines are fascinating. We should find the answers.

 

 

But I want clarification about the word “average” first. What is average? What are they showing when they say “the average something is something”?

 

Okay. Let us do our usual thing. Let us take a small dataset and explore the answer to your question. Given your interest in debt, let us look at data on student debt — you may relate to it more.

 

 

Great. Back to business.

 

I got this data from the New York Fed’s Regional Household Debt and Credit Snapshots. They include data about mortgages, student loans, credit cards, auto loans, home equity lines of credit and delinquencies. I extracted a subset of this data; average student loan balance for eleven regions in the Greater New York Area.

 

Let me jump in here. The first thing we do with data is to order them from smallest to largest and place the numbers as points on the line. I read in Lesson 14 that it provides a good visual perspective on the range of the data. So the data table looks like this

Excellent. Now imagine that this number line of yours is a weighing scale and numbers are balls of equal weight. For the weighing scale to be horizontal (without tilt), where do you think the balance point should be?

 

 

Somewhere in the middle? Isn’t it like the center of gravity?

 

 

Exactly. That balance point is called the average or mean of the data. You make a pass through the numbers and add them up, then divide this total by the number of data points.

 

Got the idea. If we use the equation on our data, we get the average debt across the 11 regions to be $33,827. The balance point will be like this.

 

So you see, the average student debt in the Greater New York area is $33,827. That seems pretty high.

 

 

Yeah that seems high, but let me look at the weighing scale again. There is one ball far out around $45,000. It looks like if we remove this ball, the balance point will move back. Let me try.

Hmm. Now the balance point is at $32,760. I get a sense that  this average measure is somewhat sensitive to these far out points.

 

You are correct. The mean or centroid of the points is a good summary of the average conditions if there are no outliers. Mean is sensitive to outliers.

 

 

Looks like the Manhattan folks are influencing the average debt number.

Ah, these New Yorkers are always up to something.

I may want to go to college one day, but these debt numbers are scary. How on earth can I pay back such yuge debt?

 

Don’t worry Joe. In New York, you get your way, and Average Joe gets to pay. 

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 15 – Another brick in the wall

Yes, another week, another brick, another visual technique, to build our way to surmount data analysis.

You know how to create a dot plot and boxplot for any given data. Order the data from smallest to largest and place the numbers as points on the line → dot plot. You can see how the data points are spread out using a dot plot.

Compute the percentiles of the data and connect them using box and whiskers → boxplot. You can understand where most of the data is and how far out are the extreme points.

Now, remember the times when you enjoyed playing with your building blocks; remember the fights with your siblings when they flicked the bottom card of your carefully built house of cards.

We are going to create something similar. We are going to build a frequency plot from the data. Like the dot and box plots, the frequency plot will provide a simplified view of the data.

I will use the class size data for schools in New York City because we celebrated teachers day this week → dark sarcasm.

Details of the class sizes for each school, by grade and program type for the year 2010-2011 are available. Let us use a subset of this data for our lesson today; class sizes for 9th-grade general education program from 19 schools.

Building a frequency plot involves partitioning the data into groups or bins, counting the number of data points that fall in each bin and carefully assembling the building blocks for each bin.

Let us choose a bin of class size 10, i.e., our partitions are equally sized groups or bins 0 – 10; 10 – 20; 20 – 30, and so on.

Now, look at the data table above and see how many schools have an average class size between 0 and 10.

Yes, there are 0 schools in this category. Can you now check how many schools have an average class size between 10 and 20 students?

Did you see that the MURRAY HILL ACADEMY, the ACADEMY FOR HEALTH CAREERS, and the HIGH SCHOOL FOR COMMUNITY LEADERSHIP have class sizes in the 10 – 20 students category or bin. Three schools. This count is the frequency of seeing numbers between 10 and 20. Do this counting for all the bins.

Imagine we start placing bricks in each bin. As many bricks as the frequency number suggests. One on top of the other. Like this

There are zero schools in the first bin (0 – 10); so there are no bricks in that bin. There are three schools in the second bin (10 – 20); so we construct a vertical tower using three blocks. Let us do this for all the bins.

That’s it. We have just constructed a frequency plot. From this plot, we can say that there are 0 schools with class size less than 10. So the probability of finding schools with a tiny class size is 0. There are nine schools with a class size between 20 and 30. So the probability of finding schools in this category is 9/19. We get a sense of the likelihood of the most frequently occurring data and the rarely occurring data. We get a sense of how the data is distributed.

Here is a look at the frequency plot for the largest class size.

Did you notice a gap between the towers? It could mean that there are distinct groups (clusters) in the data. It could also mean that we are choosing a small bin size.

Narrow bins will lead to more irregular towers, so understanding the underlying pattern may be difficult. Wider bins will result in more regular towers (smoother), but we are putting a lot of data points into one bin leading to loss of information about individual behavior.

So you see, data analysis is about understanding this trade-off between individuals and groups. Every data point is valuable because it provides information about the group.

If you find this lesson valuable, share it with others in your group.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 14 – The time has come; execute order statistics

Commander Joe, the time has come → execute order statistics. 

No, we are not going to destruct the Jedi Order. We are going to construct useful graphics to get better insights on the data. Also known as Exploratory Data Analysis (EDA), this procedure allows us to summarize and present the data in a concise form. It is also an excellent tool to detect outliers or unusual numbers. Imagine trying to understand a table full of numbers. I bet you will be as overwhelmed and lost as I am here.

Not to worry. You can solve this puzzle using order statistics and exploratory graphics. Recall lesson four where we visualized the sets using Venn diagrams. Since our brain perceives visual information better, a useful first step is to visualize the data. I looked for a small but powerful dataset to get us started. The FORCE guided me towards box office revenues of the STAR WARS films. Here they are.

This table presents the data in the order of the release of the films, but we are trying to understand the data on the revenue collected. Look at the table and tell me which film had the lowest revenue?

Yes, it is “The Clone Wars.” It collected $39 million dollars in revenue. Look at the table again and tell me which film collected the highest revenue?

Yes, the epic “STAR WARS.”

You have just ordered the data from the smallest to the largest. You can take these ordered data and place them on a number line like this.

What you are seeing is a graphic called the dot plot. Each data point is shown as a dot at its value on the number line. You can see where each point is, relative to other data points. You can also see the range of the data. For our small data set, the box office revenue ranges from $39 Million to $1331 Million.

Okay, we have arranged our data in order and constructed a simple graphic to visualize it. Now, imagine how useful it would be if we can summarize the data into a few numbers (statistics). For example, can you tell me what is the middle value in the data, i.e. what number divides the data on the dot plot into two parts?

Yes, the fifth number (The Phantom Menace – Revenue $708 Million). The fifth number divides the nine numbers into two halves; 1, 2, 3, 4 on one side and 6, 7, 8, 9 on the other side. This middle value is called the 50th percentile→ 50% of the numbers are less than this number.

I threw in the “percentile” term there. Some of you must have remembered your SAT scores. What is your percentile score? If you have a 90th percentile score, 90% of the students who took the test have a score below yours. If you have a 75th percentile score, 75% of the students have a score below yours and so on.

Percentiles, also called order statistics for the data are a nice way to summarize the big data and express them in a few numbers. For our data, these are some order statistics. I am showing 25th, 50th, 75th and 95th percentiles on the dot plot.

Let us take one more step and construct another useful graphic by joining these order statistics. Let us put a box around the 25th, and 75th percentiles. This box will show the region with 50 percent of the data → 25th to 75th. Half of our data will be in the box. Let us also draw a line at the 50th percentile to indicate the middle data point.

Now, let us use wings (whiskers) and extend to lower and higher percentiles. We can stretch out the whiskers up to 1.5 times the box length.

If we cannot reach a data point using the whisker extensions from the box, we give up and call the data point an outlier or unusual data point.

This graphic is called the boxplot. Like the dot plot, we get a nice visual of the data range, its percentiles or order statistics, and we can visually detect outliers in the data.

The “STAR WARS”  is an outlier in the data, a one of its kind.

Today is not May the fourth.

It is not revenge of the fifth.

It is the graphic saber of the sixth. Use it to conquer the data.

May the FORCE be with you.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 13 – Dear Mr. Bayes

April 29, 2017

Thomas Bayes

Dear Mr. Bayes,

I am writing this letter to thank you for coming up with the Bayes Theorem. It uses logic and evidence to understand and update our knowledge. We can revise our belief with new data, and we have been doing that for centuries now. You have provided the formal probability rules to combine data with prior knowledge to get more precise understanding. This way of thinking is inherent in our decision making now. I have benefitted so much from this rule. I cannot thank you enough for this.

Today, I want to share a story with you about how Joe, the curious kid used Bayes Theorem to impress his boss.

Joe works part time at the MARKET.

His usual daily routine is to receive the bags of apples from Andy’s distribution company (A) and Betsy’s distribution company (B), check for quality and report to his boss.

One day, during this routine, he stepped out of the loading dock to check his twitter feed.

When he returned, he noticed that there was a bad apple → a rotten bag of apples in 100 bags of apples.

Since the bags were identical, he could not say whether the bad apple was from Andy’s or Betsy’s. All he knew was that Andy’s was contracted to deliver 60 bags and Betsy’s was contracted for 40 bags.

Joe is a sharp kid. Although he did not see who delivered that bad apple, he knew he could assign a probability that it came from Andy’s or Betsy’s. He called me for some advice.

Mr. Bayes, I hope my students are calculating the updated probability of getting a problem on Normal Distribution in the mid-term based on the review session. Your discovery saw the light after the invention of Monte Carlo approaches. Bayesian methods are widely applied now. You can rest assured in the heaven. Your posterity has ensured the use of Bayes Theorem for centuries to come, albeit making it a methodological fad sometimes.

Sincerely,

A posterior Bayesian from Earth

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 12 – Total recall

Recall October 29, 2012: High winds during Superstorm Sandy; record-breaking high tides, mandatory evacuations, significant power outages, $1Million in estimated property damage.

Recall February 24, 2016: Strong winds; a tractor blown over on the upper level of the George Washington Bridge, a sidewalk shed collapsed on Lenox Avenue in Harlem, $100k in estimated property damage.

Recall August 18, 2009: Thunderstorm winds; hundred trees down in Central Park, significant tree damage in western Central Park between 90th and 100th Street, $300k in estimated property damage.

Recall conditional probability rule from lesson 9.

P(A|B) = P(A ∩ B) / P(B) 

or 

P(A ∩ B) = P(A|B)*P(B)

Recall mutually exclusive and collectively exhaustive events from lesson 4.

The midterms are mutually exclusive, but the final exam is collectively exhaustive.

If you have n events E1, E2, … En that are mutually exclusive and collectively exhaustive, and another event A that may intersect these events, then, the law of total probability says that the probability of A is the sum of the probabilities of its disjoint parts.

Let us cut the jargon and try to understand this law using a simple example. We have the data on property damage during storms accessible from NOAA’s storm events database. Let us take a subset of this data — wind storms in New York City. You can get this subset here.

With some screening, you will see that there are 57 events → 16 high wind events, 15 strong wind events, and 26 thunderstorm wind events. Notice that there is property damage during some of these incidents. Let us visualize this set up using a Venn diagram.

Your immediate perception after seeing this picture would have been that the high winds, strong winds, and thunderstorm winds are mutually exclusive and collectively exhaustive. They don’t intersect, and together make up the entire wind storm sample space. Damages cut across these events.

Let us first focus on the high wind events. The 16 high wind events are shown as 16 points in the picture below. Notice that 4 of these points are within the damage zone. The probability of high wind events is P(H) = 16/57, the probability of high wind events and damage is P(damage ∩ high winds) = 4/57 and the probability of damage given high wind events is

P(damage|high winds) = P(damage ∩ high winds) / P(high winds) = 4/16

Now let us add all the other points (events) onto the picture. Some of these will be in damage zone, and some of them will be out of damage zone.

We can estimate the total probability of damage by adding its disjoint parts.

P(damage) = P(damage ∩ high winds) + P(damage ∩ strong winds) + P(damage ∩ thunderstorm winds)

or

P(damage) = P(damage|high winds)*P(high winds) + P(damage|strong winds)*P(strong winds) + P(damage|thunderstorm winds)*P(thunderstorm winds)

P(damage) = (4/16)*(16/57) + (8/15)*(15/57) + (9/26)*(26/57) = 21/57

The best part is that we can use this law as a predictive equation. Suppose there is an approaching storm and the weatherman told you that there is a 10% chance that the coming storm has high winds, 30% chance that it has strong winds and 60% chance that it has thunderstorm winds, you can immediately use this law and compute the probability of damage for NYC.

Can you tell me what that damage probability is?

Should I wait till after your Earth Day March?

Recall that you are totally contributing your share of Co2 to the earth during the March.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 11 – Fran, the functionin’ R-bot

Hello, my name is Fran. I am a function in R.

 

 Hello, my name is D. I am a functionin’ person.

 

I can perform any task you give me.

 

Interesting, can you add two numbers?

 

Yes, I can.

 

Can you tell me more about how you work?

 

Sure, I need the inputs and the instruction, i.e. what you want me to do with the inputs.

Okay. I am giving you two numbers, 10 and 15. Can you show me how you   will create a function to add them?

 

This is easy. Let me first give you the structure.

# structure of a function #
functionname = function(inputs)
{
 instructions
 
 return(output)
}

You can select these lines and hit the run button to load the function in R. Once you execute these lines, the function will be loaded in R, and you can use this name with any inputs.

Let us say the two numbers are a and b. These numbers are provided as inputs. I will first assign a name to the function — “add”. Since you are asking me to add two numbers, the instruction will be y = a + b, and I will return the value of y.

Here is a short video showing how to create a function to add two numbers a and b. You can try it in your RStudio program.

 Neat. If I give you three numbers, m, x, and b, can you write a function for mx + b?

 

Yes. I believe you are asking me to write a function for a straight line: y = mx + b. I will assign “line_eq” as the name of the function, the inputs will be m, x, and b and the output will be y.

# function for line equation #
line_eq = function (m, x, b)
{
 # m is the slope of the line 
 # b is the intercept of the line
 # x is the point on the x-axis
 
 y = m*x + b # equation for the line 
 
 return(y) 
}

# test the function #
line_eq(0.5, 5, 10)
> 12.5

Can you perform more than one task? For example, if I ask you for y = mx + b and x + y, can return both the values?

 

Yes, I can. I will have two instructions. In the end, I will combine both the outputs into one vector and return the values. Here is how I do it.

# function for line equation + (x + y) #
two_tasks = function (m, x, b)
{
 # m is the slope of the line 
 # b is the intercept of the line
 # x is the point on the x-axis
 
 y = m*x + b # equation for the line 
 
 z = x + y
 
 return(c(y,z)) 
}

# test the function #
two_tasks(0.5, 5, 10)
> 12.5 17.5

Very impressive. What if some of the inputs are numbers and some of them are a set of numbers? For instance, if I give you many points on the x-axis, m and b, the slope and the intercept, can you give me the values for y?

 

No problemo. The same line_eq function will work. Let us say you give me some numbers x = [1, 2, 3, 4, 5], m = 0.5 and b = 10. I will use the same function line_eq(m, x, b).

# use on vectors #
x = c(1,2,3,4,5)
m = 0.5
b = 10

line_eq(m,x,b)
> 10.5 11.0 11.5 12.0 12.5

I am beginning to like you. But, maybe you are fooling me with simple tricks. I don’t need a robot for doing simple math.

 

Hey, my name is Fran 😡

 

Okay Fran. Prove to me that you can do more complicated things.

 

Bring it on.

 

 It is springtime, and I’d love to get a Citi bike and ride around the city. I want you to tell me how many people rented the bike at the most popular route, the Central Park Southern Loop and the average trip time.

 

aargh… your obsession with the city. Give me the data.

 

Here you go. You can use the March 2017 file. They have data for the trip duration in seconds, check out time and check in time, start station and end station.

Alright. I will name the function “bike_analysis.” The inputs will be the data for the bike ridership for a month, and the name of the station. The function will identify how many people rented the bikes at the Central Park S station and returned it back to the same station — completing the loop. You asked me for total rides and the average trip time. I threw in the maximum and minimum ride time too. You can use this function with data from any month and at any station.

# function to analyze bike data # 
bike_analysis = function(bike_data,station_name)
{
 dum = which (bike_data$Start.Station.Name == station_name &    bike_data$End.Station.Name == station_name)
 total_rides = length(dum)
 
 average_time = mean(bike_data$Trip.Duration[dum])/60 # in minutes 
 max_time = max(bike_data$Trip.Duration[dum])/60 # in minutes 
 min_time = min(bike_data$Trip.Duration[dum])/60 # in minutes
 
 output = c(total_rides,average_time,max_time,min_time)
 return(output)
}

# use the function to analyze Central Park South Loop #

# bike data # 
bike_data = read.csv("201703-citibike-tripdata.csv",header=T)

station_name = "Central Park S & 6 Ave"

bike_analysis(bike_data,station_name)
> 212.000000  42.711085 403.000000   1.066667

212 trips, 42 minutes of average trip time. The maximum trip time is 403 minutes and the minimum trip time is ~ 1 minute. Change of mind?

Wow. You are truly helpful. I would have spent a lot of time if I were to do this manually. I can use your brains and spend my holiday weekend riding the bike.

 

Have fun … and Happy Easter.

 

How did you know that?

 

Machine Learning man 😉

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 10 – The fight for independence

I don’t have a “get out of jail free” card.

I don’t want to pay $50 to the bank because I am short on cash.

😉 I am trusting my magic dice to fight for my freedom. I know I will roll a double.

🙁 I am disappointed. As I wait for my turn, I realized that I did not kiss the dice before rolling. So I do it now and roll again.

😯 Maybe I should have kissed the dice two times since it is the second try. Oh, I did not pray before rolling. So I pray and roll the dice with optimism.

😡 I don’t believe this. My magic dice betrayed me. I will throw them away and get new ones.

Wait. The magic dice did not betray you. It is just following the probability rule for independent events. Unlike you, your magic dice does not have a memory. It does not know that the previous try was not a double. All it knows is that the probability of getting a double on any try is 16.66%.

Assume A is the event of seeing a double, and B is a previous event, say {6,1} – not a double.

The probability of getting a double given that the last try was not a double, P(A|B) is equal to the probability of getting a double in any try, P(A). P(A) does not depend on whether or not event B has happened. B does not influence A.

For independent events A and B, 
P(A|B) = P(A)

From lesson 9, conditional probability rule, we know that

P(A|B) = P(A ∩ B)/P(B)

We can combine these two and come up with a property for independent events.

P(A ∩ B) = P(A).P(B)

For independent events, the probability of both happening (A and B) is the product of the individual probabilities.

Let us apply this property to our example. What is the probability of not seeing a double in three consecutive rolls (with prayer 🙂 or without prayer)? In other words, what are the odds of missing three rounds of the game and paying $50 to get my freedom finally?

The probability of not seeing a double in any try is 30/36. 30 non-double outcomes in 36 possibilities. Since the events are independent, the likelihood of seeing three non-doubles is (30/36)(30/36)(30/36) ≅ 58%.

I should have known that before praying.

If the events are independent, they do not influence each other. A coin toss cannot affect a dice. Torrential rain in London may have nothing to do with the severe drought in California. Your actions may not influence my actions because we are independent.

We all like being independent … or the illusion of independence!

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)