May 2017 – dataanalysisclassroom

Lesson 17 – We who deviate from the norm

Hello, I am the Wood Duck. I am the most pretty of all water birds. I live in wooden swamps; I nest in tree holes or nest boxes along the lake. Don’t mess with me; I have a strong claw, can grip barks and perch on branches.

Ornithologists look for me each year, and migratory bird data center records my detections. These are the numbers they found for the last decade in New York State. Since you are familiar with the number line, I am showing the data as a dot plot to spare you some time. Can you see that I also show the average number of individuals (of my species) on the plot?

Hello, I am the Pine Warbler. I fly in the high pines. I know it is hard to find me, but hey, can’t you listen to my musical trills? I have a stout beak; I can prod clumps of needles. Here are my detections in the last decade in New York State.

Did you notice that my numbers are spread out on the line. Sometimes, my numbers are close to the average; sometimes they are far away from the average. I am deviating from the norm.

😉 These poor ornithologists have to understand the variation in finding me.

🙁 I am not deviating much from the average. You see, the average number of pine warblers in this data is 65, but the maximum is 75 and the minimum is 53. I am close to the average. I don’t deviate from the norm.

Like the average that summarizes the center of the data, if you want to summarize the deviation, you can do the following:

Compute the deviation of each number from the average.

Square these deviations. Why?

Get the average of all these squared deviations.

This measure of the variation in the data is called variance. It is the average squared deviation from the center of the data. Do you think it is sensitive to the outliers?

Mr. Wood, you forgot to point out that if you take the square root of the variance measure, you get the standard deviation.

It is in the same units as the average. We can look at a point on the number line and see how many standard deviations away it is from the average.

In your case, you have a number 79 on the line. Since your standard deviation is around 17, the 79 point is more than two standard deviations away from your average of 42.

Mr. Warbler, thank you. I think our readers now have two measures to summarize the data. The average and the variance. The average provides a measure of the center of the data, and the variance or standard deviation provides a measure of how spread out the data points are from the center.

While they play around with these summary statistics, I will go back to my swimming, and I guess you can go back to your singing.

How many of you thought you would see data on birds here? That is the variance for you. In the data analysis classroom, we not only mean well but also provide variety.

Our friends from the sky have deviated from their norm to teach us standard deviation. Have you been deviating from the norm? If not, why not?

Were you given the index card too?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 16 – Joe meets “the average”

I read a news article this week that said the average US family has $8377 in credit card debt.

😯

I got intrigued by the word “the average.” So I tried my usual fun search. I typed the phrase “what is the average” in Google and Bing search engines.

The questions that pop up in the search engines are fascinating. We should find the answers.

But I want clarification about the word “average” first. What is average? What are they showing when they say “the average something is something”?

Okay. Let us do our usual thing. Let us take a small dataset and explore the answer to your question. Given your interest in debt, let us look at data on student debt — you may relate to it more.

Great. Back to business.

I got this data from the New York Fed’s Regional Household Debt and Credit Snapshots. They include data about mortgages, student loans, credit cards, auto loans, home equity lines of credit and delinquencies. I extracted a subset of this data; average student loan balance for eleven regions in the Greater New York Area.

Let me jump in here. The first thing we do with data is to order them from smallest to largest and place the numbers as points on the line. I read in Lesson 14 that it provides a good visual perspective on the range of the data. So the data table looks like this

Excellent. Now imagine that this number line of yours is a weighing scale and numbers are balls of equal weight. For the weighing scale to be horizontal (without tilt), where do you think the balance point should be?

Somewhere in the middle? Isn’t it like the center of gravity?

Exactly. That balance point is called the average or mean of the data. You make a pass through the numbers and add them up, then divide this total by the number of data points.

Got the idea. If we use the equation on our data, we get the average debt across the 11 regions to be $33,827. The balance point will be like this.

So you see, the average student debt in the Greater New York area is $33,827. That seems pretty high.

Yeah that seems high, but let me look at the weighing scale again. There is one ball far out around $45,000. It looks like if we remove this ball, the balance point will move back. Let me try.

Hmm. Now the balance point is at $32,760. I get a sense that this average measure is somewhat sensitive to these far out points.

You are correct. The mean or centroid of the points is a good summary of the average conditions if there are no outliers. Mean is sensitive to outliers.

Looks like the Manhattan folks are influencing the average debt number.

Ah, these New Yorkers are always up to something.

I may want to go to college one day, but these debt numbers are scary. How on earth can I pay back such yuge debt?

Don’t worry Joe. In New York, you get your way, and Average Joe gets to pay.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 15 – Another brick in the wall

Yes, another week, another brick, another visual technique, to build our way to surmount data analysis.

You know how to create a dot plot and boxplot for any given data. Order the data from smallest to largest and place the numbers as points on the line → dot plot. You can see how the data points are spread out using a dot plot.

Compute the percentiles of the data and connect them using box and whiskers → boxplot. You can understand where most of the data is and how far out are the extreme points.

Now, remember the times when you enjoyed playing with your building blocks; remember the fights with your siblings when they flicked the bottom card of your carefully built house of cards.

We are going to create something similar. We are going to build a frequency plot from the data. Like the dot and box plots, the frequency plot will provide a simplified view of the data.

I will use the class size data for schools in New York City because we celebrated teachers day this week → dark sarcasm.

Details of the class sizes for each school, by grade and program type for the year 2010-2011 are available. Let us use a subset of this data for our lesson today; class sizes for 9th-grade general education program from 19 schools.

Building a frequency plot involves partitioning the data into groups or bins, counting the number of data points that fall in each bin and carefully assembling the building blocks for each bin.

Let us choose a bin of class size 10, i.e., our partitions are equally sized groups or bins 0 – 10; 10 – 20; 20 – 30, and so on.

Now, look at the data table above and see how many schools have an average class size between 0 and 10.

…

Yes, there are 0 schools in this category. Can you now check how many schools have an average class size between 10 and 20 students?

…

Did you see that the MURRAY HILL ACADEMY, the ACADEMY FOR HEALTH CAREERS, and the HIGH SCHOOL FOR COMMUNITY LEADERSHIP have class sizes in the 10 – 20 students category or bin. Three schools. This count is the frequency of seeing numbers between 10 and 20. Do this counting for all the bins.

Imagine we start placing bricks in each bin. As many bricks as the frequency number suggests. One on top of the other. Like this

There are zero schools in the first bin (0 – 10); so there are no bricks in that bin. There are three schools in the second bin (10 – 20); so we construct a vertical tower using three blocks. Let us do this for all the bins.

That’s it. We have just constructed a frequency plot. From this plot, we can say that there are 0 schools with class size less than 10. So the probability of finding schools with a tiny class size is 0. There are nine schools with a class size between 20 and 30. So the probability of finding schools in this category is 9/19. We get a sense of the likelihood of the most frequently occurring data and the rarely occurring data. We get a sense of how the data is distributed.

Here is a look at the frequency plot for the largest class size.

Did you notice a gap between the towers? It could mean that there are distinct groups (clusters) in the data. It could also mean that we are choosing a small bin size.

Narrow bins will lead to more irregular towers, so understanding the underlying pattern may be difficult. Wider bins will result in more regular towers (smoother), but we are putting a lot of data points into one bin leading to loss of information about individual behavior.

So you see, data analysis is about understanding this trade-off between individuals and groups. Every data point is valuable because it provides information about the group.

If you find this lesson valuable, share it with others in your group.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 14 – The time has come; execute order statistics

Commander Joe, the time has come → execute order statistics.

No, we are not going to destruct the Jedi Order. We are going to construct useful graphics to get better insights on the data. Also known as Exploratory Data Analysis (EDA), this procedure allows us to summarize and present the data in a concise form. It is also an excellent tool to detect outliers or unusual numbers. Imagine trying to understand a table full of numbers. I bet you will be as overwhelmed and lost as I am here.

Not to worry. You can solve this puzzle using order statistics and exploratory graphics. Recall lesson four where we visualized the sets using Venn diagrams. Since our brain perceives visual information better, a useful first step is to visualize the data. I looked for a small but powerful dataset to get us started. The FORCE guided me towards box office revenues of the STAR WARS films. Here they are.

This table presents the data in the order of the release of the films, but we are trying to understand the data on the revenue collected. Look at the table and tell me which film had the lowest revenue?

…

Yes, it is “The Clone Wars.” It collected $39 million dollars in revenue. Look at the table again and tell me which film collected the highest revenue?

…

Yes, the epic “STAR WARS.”

You have just ordered the data from the smallest to the largest. You can take these ordered data and place them on a number line like this.

What you are seeing is a graphic called the dot plot. Each data point is shown as a dot at its value on the number line. You can see where each point is, relative to other data points. You can also see the range of the data. For our small data set, the box office revenue ranges from $39 Million to $1331 Million.

Okay, we have arranged our data in order and constructed a simple graphic to visualize it. Now, imagine how useful it would be if we can summarize the data into a few numbers (statistics). For example, can you tell me what is the middle value in the data, i.e. what number divides the data on the dot plot into two parts?

…

Yes, the fifth number (The Phantom Menace – Revenue $708 Million). The fifth number divides the nine numbers into two halves; 1, 2, 3, 4 on one side and 6, 7, 8, 9 on the other side. This middle value is called the 50th percentile→ 50% of the numbers are less than this number.

I threw in the “percentile” term there. Some of you must have remembered your SAT scores. What is your percentile score? If you have a 90th percentile score, 90% of the students who took the test have a score below yours. If you have a 75th percentile score, 75% of the students have a score below yours and so on.

Percentiles, also called order statistics for the data are a nice way to summarize the big data and express them in a few numbers. For our data, these are some order statistics. I am showing 25th, 50th, 75th and 95th percentiles on the dot plot.

Let us take one more step and construct another useful graphic by joining these order statistics. Let us put a box around the 25th, and 75th percentiles. This box will show the region with 50 percent of the data → 25th to 75th. Half of our data will be in the box. Let us also draw a line at the 50th percentile to indicate the middle data point.

Now, let us use wings (whiskers) and extend to lower and higher percentiles. We can stretch out the whiskers up to 1.5 times the box length.

If we cannot reach a data point using the whisker extensions from the box, we give up and call the data point an outlier or unusual data point.

This graphic is called the boxplot. Like the dot plot, we get a nice visual of the data range, its percentiles or order statistics, and we can visually detect outliers in the data.

The “STAR WARS” is an outlier in the data, a one of its kind.

Today is not May the fourth.

It is not revenge of the fifth.

It is the graphic saber of the sixth. Use it to conquer the data.

May the FORCE be with you.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Month: May 2017

Lesson 17 – We who deviate from the norm

Lesson 16 – Joe meets “the average”

Lesson 15 – Another brick in the wall

Lesson 14 – The time has come; execute order statistics

Enjoy this blog? Please spread the word :)