Lesson 15 – Another brick in the wall

Yes, another week, another brick, another visual technique, to build our way to surmount data analysis.

You know how to create a dot plot and boxplot for any given data. Order the data from smallest to largest and place the numbers as points on the line → dot plot. You can see how the data points are spread out using a dot plot.

Compute the percentiles of the data and connect them using box and whiskers → boxplot. You can understand where most of the data is and how far out are the extreme points.

Now, remember the times when you enjoyed playing with your building blocks; remember the fights with your siblings when they flicked the bottom card of your carefully built house of cards.

We are going to create something similar. We are going to build a frequency plot from the data. Like the dot and box plots, the frequency plot will provide a simplified view of the data.

I will use the class size data for schools in New York City because we celebrated teachers day this week → dark sarcasm.

Details of the class sizes for each school, by grade and program type for the year 2010-2011 are available. Let us use a subset of this data for our lesson today; class sizes for 9th-grade general education program from 19 schools.

Building a frequency plot involves partitioning the data into groups or bins, counting the number of data points that fall in each bin and carefully assembling the building blocks for each bin.

Let us choose a bin of class size 10, i.e., our partitions are equally sized groups or bins 0 – 10; 10 – 20; 20 – 30, and so on.

Now, look at the data table above and see how many schools have an average class size between 0 and 10.

…

Yes, there are 0 schools in this category. Can you now check how many schools have an average class size between 10 and 20 students?

…

Did you see that the MURRAY HILL ACADEMY, the ACADEMY FOR HEALTH CAREERS, and the HIGH SCHOOL FOR COMMUNITY LEADERSHIP have class sizes in the 10 – 20 students category or bin. Three schools. This count is the frequency of seeing numbers between 10 and 20. Do this counting for all the bins.

Imagine we start placing bricks in each bin. As many bricks as the frequency number suggests. One on top of the other. Like this

There are zero schools in the first bin (0 – 10); so there are no bricks in that bin. There are three schools in the second bin (10 – 20); so we construct a vertical tower using three blocks. Let us do this for all the bins.

That’s it. We have just constructed a frequency plot. From this plot, we can say that there are 0 schools with class size less than 10. So the probability of finding schools with a tiny class size is 0. There are nine schools with a class size between 20 and 30. So the probability of finding schools in this category is 9/19. We get a sense of the likelihood of the most frequently occurring data and the rarely occurring data. We get a sense of how the data is distributed.

Here is a look at the frequency plot for the largest class size.

Did you notice a gap between the towers? It could mean that there are distinct groups (clusters) in the data. It could also mean that we are choosing a small bin size.

Narrow bins will lead to more irregular towers, so understanding the underlying pattern may be difficult. Wider bins will result in more regular towers (smoother), but we are putting a lot of data points into one bin leading to loss of information about individual behavior.

So you see, data analysis is about understanding this trade-off between individuals and groups. Every data point is valuable because it provides information about the group.

If you find this lesson valuable, share it with others in your group.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Enjoy this blog? Please spread the word :)