Lesson 21 – Beginners guide to summarize data in R

We went nine lessons without R. During this time we had lessons that taught us how to visualize data using dotplots, boxplots, and frequency plots (lesson 14 and lesson 15). We had lessons on summarizing the data using average (lesson 16), standard deviation (lesson 17), skewness (lesson 18), and outliers (lesson 19). We also had a lesson on comparing datasets using coefficient of variation (lesson 20). Don’t you think it is time to put these things into practice and know how to use R commands to explore data?

Let’s get going. Today’s victim is the natural gas consumption data in New York City available at the zip code level. The Mayor’s office of long-term planning and sustainability provides this data.

Usual Chores

Step 1: Get the data
You can download the data file from here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson21”. The downloaded data file “Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv” will be saved in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson21” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson21_code.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson21” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5: Read the data into R workspace
Since we have a comma separated values file (.csv), we can use the “read.csv” command to read the data file into R workspace. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the datafile #
naturalgas_data = read.csv("Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv",header=TRUE)

Step 6: Let us begin the exploration
I realized that the data is classified into various building types → commercial, industrial, residential, institutional, etc. Let us extract three building types for this lesson. You can use the “which” command to do this filtering.

# extract subset of the data #
index1 = which(naturalgas_data$Building.type..service.class == "Large residential")

index2 = which(naturalgas_data$Building.type..service.class == "Commercial")

index3 = which(naturalgas_data$Building.type..service.class == "Institutional")

Residential = naturalgas_data$Consumption..GJ.[index1]
Commercial = naturalgas_data$Consumption..GJ.[index2]
Institutional = naturalgas_data$Consumption..GJ.[index3]

The “which” command will identify all the rows that belong to a class. We then extract these rows from the original data.

Now we have all large residential building consumption data under the “Residential“, all commercial building consumption data under the “Commercial” and all institutional building consumption data under the “Institutional“.

Let us look at Large residential consumption data first and then compare it to the others.

Visualize the data using dot plot

As a first step, we can order the data from smallest to largest and place the numbers as points on the line. This dot plot provides a good visual perspective on the range of the data. You can use the following command to do this.

stripchart(Residential,font=2,pch=21,cex=0.5,xlab="Consumption in GJ",font.lab=2)

You can change the values for the font to make it bold or light, pch for different shapes of the dots, and cex for changing the size of the dots. You can have customized labels using xlab.

Visualize the data using boxplot

With boxplot, we can get a nice visual of the data range and its percentiles along with detecting the outliers in the data.

boxplot(Residential, horizontal=TRUE, col="grey",add=T)

You can pick a color of your choice, and make it horizontal or vertical. I usually prefer the box to be horizontal as it aligns with the number line. You must have noticed the add=T at the end. This operation will tell R to add the boxplot on top on the dot plot. Did you see how the dots and the box aligned?

The box is the region with 50 percent of the data. The vertical line in the box is the 50th percentile (middle data point — median). Whiskers are extended from the box to the higher and lower percentiles. This extension is usually 1.5 times the length of the box. The length of the box is called the interquartile range (75th percentile minus 25th percentile). Points that cannot be reached using the whiskers are outliers.

Visualize the data using frequency plot

For making a frequency plot, we should first partition the data into groups or bins, count the number of data points that fall in each bin and stack building blocks for each bin. All this is done in R using the command:

# histogram #
hist(Residential,font=2,font.lab=2,xlab="Consumption in GJ",col="grey")

Again, you can work with font, xlab, etc. to beautify the plot.

Did you see that the data is not symmetric? There are extreme values on the right creating a positive skew. The average and 50th percentile (median) are not the same.

Summarizing the data using summary statistics

We can summarize the data using the summary statistics → average, standard deviation, and skewness. We can also compute the percentiles.

Average or mean is a measure of the central tendency (balance point). You make a pass through the numbers and add them up, then divide this total by the number of data points. In R, the command for this is:

# average #
mean(Residential,na.rm=T)
[1] 684942.3

Notice that I added an argument na.rm = T since I have some missing values in the data. na.rm is instructing R to remove the “NA” and then compute the average.

You must be thinking about the fact that mean is sensitive to outliers. Yes, we have seen that the data is skewed, so perhaps computing the middle value (median) is necessary. The median is not sensitive to outliers. 50 percent of the data points are less than this number. The command for this in R is:

# median #
median(Residential,na.rm=T)
[1] 414505

While the average provides a measure of the center of the data, variance or standard deviation provides a measure of how spread out the data is from the center. Compute the deviation of each number from the average. Square these deviations. Get the average of all these squared deviations. In R you can do this using the command:

# variance and standard deviation #
var(Residential,na.rm=T)
[1] 707890274303

sd(Residential,na.rm=T)
[1] 841362.2

As you know, the standard deviation is in the same units (gigajoules) as the average.

The third measure, the skewness can be computed as follows:

# skewness #
library(moments)
skewness(Residential,na.rm=T)
[1] 1.990704

The function skewness is part of a package called “moments“. If you do not have this library, you should first install it using:

install.packages("moments")

After installing, you can call this library using:

library(moments)

The skewness for this data is 1.99. Very high as you saw in the histogram. If the data is symmetric, i.e. evenly spread out around the center, the skewness will be 0.

The “quantiles” command will give us the 0th, 25th, 50th, 75th and the 100th percentiles. We can also use the “summary” command to get an overview of the data.

# quantiles and summary #
quantile(Residential,na.rm=T)

 0% 25% 50% 75% 100% 
 30 65247 414505 998733 4098510 

summary(Residential)

 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
 30 65250 414500 684900 998700 4099000 1

Comparing the data using plots and coefficient of variation

An easy way to compare the data is to align them on the same scale and visually inspect for differences. In other words, you can plot two or more boxplots on the same scale like this:

# comparing the data # 
all_data = cbind(Residential,Commercial, Institutional)
boxplot(all_data,horizontal=T)

You must have noticed that I am first combining the data vectors into a new frame through the “cbind“ command. cbind will bind the data as columns. (column bind — cbind). We can now plot this new data frame with three columns.

There are clear differences in the data. Large residential buildings have more energy consumption. Not surprising as there are more residential buildings than commercial or institutional.

Last week, we went over the coefficient of variation. Can you now tell me which consumption data (residential, commercial or institutional) has more variation compared to its mean?

Did you notice that the commercial and institutional has more variation compared to residential? Do you know why?

While you are at that, also think about why using graphical methods and summary statistics to summarize the data is an important feature for statistical modeling.

Could it be that there is a universal behavior and we are using a sample of that population?

Can a sample represent the population?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.