Lesson 21 – Beginners guide to summarize data in R

We went nine lessons without R. During this time we had lessons that taught us how to visualize data using dotplots, boxplots, and frequency plots (lesson 14 and lesson 15). We had lessons on summarizing the data using average (lesson 16), standard deviation (lesson 17), skewness (lesson 18), and outliers (lesson 19). We also had a lesson on comparing datasets using coefficient of variation (lesson 20). Don’t you think it is time to put these things into practice and know how to use R commands to explore data?

Let’s get going. Today’s victim is the natural gas consumption data in New York City available at the zip code level. The Mayor’s office of long-term planning and sustainability provides this data.

Usual Chores

Step 1: Get the data

Step 2: Create a new folder on your computer
Let us call this folder “lesson21”. The downloaded data file “Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv” will be saved in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson21” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson21_code.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson21” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

`setwd("path to your folder")`

Step 5: Read the data into R workspace
Since we have a comma separated values file (.csv), we can use the “read.csv” command to read the data file into R workspace. Type the following line in your code and execute it. Use header=TRUE in the command.

```# Read the datafile #

Step 6: Let us begin the exploration
I realized that the data is classified into various building types → commercial, industrial, residential, institutional, etc. Let us extract three building types for this lesson. You can use the “which” command to do this filtering.

```# extract subset of the data #
index1 = which(naturalgas_data\$Building.type..service.class == "Large residential")

index2 = which(naturalgas_data\$Building.type..service.class == "Commercial")

index3 = which(naturalgas_data\$Building.type..service.class == "Institutional")

Residential = naturalgas_data\$Consumption..GJ.[index1]
Commercial = naturalgas_data\$Consumption..GJ.[index2]
Institutional = naturalgas_data\$Consumption..GJ.[index3]```

The “which” command will identify all the rows that belong to a class. We then extract these rows from the original data.

Now we have all large residential building consumption data under the “Residential“, all commercial building consumption data under the “Commercial” and all institutional building consumption data under the “Institutional“.

Let us look at Large residential consumption data first and then compare it to the others.

Visualize the data using dot plot

As a first step, we can order the data from smallest to largest and place the numbers as points on the line. This dot plot provides a good visual perspective on the range of the data. You can use the following command to do this.

`stripchart(Residential,font=2,pch=21,cex=0.5,xlab="Consumption in GJ",font.lab=2)`

You can change the values for the font to make it bold or light, pch for different shapes of the dots, and cex for changing the size of the dots. You can have customized labels using xlab.

Visualize the data using boxplot

With boxplot, we can get a nice visual of the data range and its percentiles along with detecting the outliers in the data.

`boxplot(Residential, horizontal=TRUE, col="grey",add=T)`

You can pick a color of your choice, and make it horizontal or vertical. I usually prefer the box to be horizontal as it aligns with the number line. You must have noticed the add=T at the end. This operation will tell R to add the boxplot on top on the dot plot. Did you see how the dots and the box aligned?

The box is the region with 50 percent of the data. The vertical line in the box is the 50th percentile (middle data point — median). Whiskers are extended from the box to the higher and lower percentiles. This extension is usually 1.5 times the length of the box. The length of the box is called the interquartile range (75th percentile minus 25th percentile). Points that cannot be reached using the whiskers are outliers.

Visualize the data using frequency plot

For making a frequency plot, we should first partition the data into groups or bins, count the number of data points that fall in each bin and stack building blocks for each bin. All this is done in R using the command:

```# histogram #
hist(Residential,font=2,font.lab=2,xlab="Consumption in GJ",col="grey")```

Again, you can work with font, xlab, etc. to beautify the plot.

Did you see that the data is not symmetric? There are extreme values on the right creating a positive skew. The average and 50th percentile (median) are not the same.

Summarizing the data using summary statistics

We can summarize the data using the summary statistics → average, standard deviation, and skewness. We can also compute the percentiles.

Average or mean is a measure of the central tendency (balance point). You make a pass through the numbers and add them up, then divide this total by the number of data points. In R, the command for this is:

```# average #
mean(Residential,na.rm=T)
[1] 684942.3```

Notice that I added an argument na.rm = T since I have some missing values in the data. na.rm is instructing R to remove the “NA” and then compute the average.

You must be thinking about the fact that mean is sensitive to outliers. Yes, we have seen that the data is skewed, so perhaps computing the middle value (median) is necessary. The median is not sensitive to outliers. 50 percent of the data points are less than this number. The command for this in R is:

```# median #
median(Residential,na.rm=T)
[1] 414505```

While the average provides a measure of the center of the data, variance or standard deviation provides a measure of how spread out the data is from the center. Compute the deviation of each number from the average. Square these deviations. Get the average of all these squared deviations. In R you can do this using the command:

```# variance and standard deviation #
var(Residential,na.rm=T)
[1] 707890274303

sd(Residential,na.rm=T)
[1] 841362.2```

As you know, the standard deviation is in the same units (gigajoules) as the average.

The third measure, the skewness can be computed as follows:

```# skewness #
library(moments)
skewness(Residential,na.rm=T)
[1] 1.990704```

The function skewness is part of a package called “moments“. If you do not have this library, you should first install it using:

`install.packages("moments")`

After installing, you can call this library using:

`library(moments)`

The skewness for this data is 1.99. Very high as you saw in the histogram. If the data is symmetric, i.e. evenly spread out around the center, the skewness will be 0.

The “quantiles” command will give us the 0th, 25th, 50th, 75th and the 100th percentiles. We can also use the “summary” command to get an overview of the data.

```# quantiles and summary #
quantile(Residential,na.rm=T)

0% 25% 50% 75% 100%
30 65247 414505 998733 4098510

summary(Residential)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
30 65250 414500 684900 998700 4099000 1```
Comparing the data using plots and coefficient of variation

An easy way to compare the data is to align them on the same scale and visually inspect for differences. In other words, you can plot two or more boxplots on the same scale like this:

```# comparing the data #
all_data = cbind(Residential,Commercial, Institutional)
boxplot(all_data,horizontal=T)```

You must have noticed that I am first combining the data vectors into a new frame through the “cbind command. cbind will bind the data as columns. (column bind — cbind). We can now plot this new data frame with three columns.

There are clear differences in the data. Large residential buildings have more energy consumption. Not surprising as there are more residential buildings than commercial or institutional.

Last week, we went over the coefficient of variation. Can you now tell me which consumption data (residential, commercial or institutional) has more variation compared to its mean?

Did you notice that the commercial and institutional has more variation compared to residential? Do you know why?

While you are at that, also think about why using graphical methods and summary statistics to summarize the data is an important feature for statistical modeling.

Could it be that there is a universal behavior and we are using a sample of that population?

Can a sample represent the population?

If you find this useful, please like, share and subscribe.

Lesson 20 – Compared to what?

These were the top four stories when I checked “California drought” in google news.

Not surprising. We have been following the drought story in California for a while now. Governor Brown declared the end of the drought emergency; there seems to be plenty of snow ready to melt in Sierra Nevada mountains; some communities are still feeling the effects of previous droughts. With drought last year and good rains this year, we can see there is variability.

These were the top four stories when I typed “New York drought” into the search bar. I had to do it. Didn’t you read the title?

Nothing alarming. The fourth story is about California drought 🙄 People from the East Coast know that there is not much variability in rains here from year to year.

You must have noticed that I am trying to compare different datasets. Are there ways to achieve this? Can we visually inspect for differences? Are there any measures that we can use?

Since we started with California and New York, let us compare rainfall data for two cities, Berkeley and New York City. Fortunately, we have more than 100 years of measured rainfall data for these cities. I am using the data from 1901 to 2000.

As always, we can prepare some graphics to visualize the data. Recall from Lesson 14 that we can use boxplots to get a perspective on the data range, its percentiles, and outliers. Since we have two datasets, Berkeley and New York City, let us look at the boxplots on the same scale, one below the other, like this:

There is a clear difference in the data. New York City gets lot more rain than Berkeley, atleast two times more on average. Notice the 50th percentile (middle of the box) for Berkeley around 600 mm and New York City around 1200 mm.

Did you see that the minimum rainfall New York City gets per year is greater than what Berkeley gets 75% of the times?

What about variability? Is there a difference in the variability of the datasets?

I computed their standard deviations. For Berkeley, it is 228 mm, and for New York City it is 216 mm. At the face of it, 228 and 216 do not look very different. So is there no difference in the variability? Is it sufficient to just compare the standard deviations?

You know that standard deviation is the deviation from the center of the data. But in this case, the two datasets do not have the same central value (average). New York City has an average rainfall of 1200 mm and a standard deviation of 216 mm. Compared to 1200 mm, the deviation is 18% (216/1200). Berkeley has an average rainfall of 600 mm and a standard deviation of 228 mm. Compared to 600 mm, the deviation is 38%.

This measure, the relative standard deviation is called the coefficient of variation

It measures the amount of variability in relation to the average. It is a standardized metric that can be used to compare datasets on different scales or units. It is common to express this ratio as a percentage as we did with Berkeley and New York City.

So, we can say that New York City gets more rainfall (two times more on average) than Berkeley, and its relative variability is less (~two times less) than Berkeley. That explains why there are fewer drought stories for New York compared to California.

Next time you read a drought story in your State, ask yourself “compared to what” and check out these maps.

If you find this useful, please like, share and subscribe.

Lesson 19 – Voice of the outliers

Joe is considering SAT. His first instinct was to take a look at some previous SAT scores. Being a regular reader of this blog, he is familiar with NYC OpenData. So he goes and searches for the College Board SAT results and finds data for graduating seniors of 2010 in New York City schools. Among these records, he is only looking at the critical reading scores for 93 schools that have more than 100 test takers. He is now well versed with order statistics and boxplots, so he made a boxplot for his data. This conversation happened after that.

If you find this useful, please like, share and subscribe.

Lesson 18 – Data comes in all shapes

Symmetric or not symmetric, that is the question.

Whether the data are evenly spread out around the average, producing a symmetric frequency plot, Or some data are disproportionately present on the right side or the left side of the average; thereby disturbing the symmetry.

To notice, to realize the shape, and by shape to say we understand the data and its general behavior.

To notice, to realize, perhaps to measure the shape — ah, there’s a catch; for measuring, should we first discern whether the data is right skewed or left skewed.

For who would know that a few extreme values on the right creates a positive skew, and a few extremes on the left creates a negative skew; that the average of the skewed data is not the same as the 50th percentile; that two datasets with same average and standard deviation can have different shapes.

That the measure of skew is

Thus shape does make a necessary measure to summarize the data, And thus the natural hue of data analysis includes all three summary statistics, average, variance, and skewness.

If you find this useful, please like, share and subscribe.