Lesson 20 – Compared to what? – dataanalysisclassroom

These were the top four stories when I checked “California drought” in google news.

Not surprising. We have been following the drought story in California for a while now. Governor Brown declared the end of the drought emergency; there seems to be plenty of snow ready to melt in Sierra Nevada mountains; some communities are still feeling the effects of previous droughts. With drought last year and good rains this year, we can see there is variability.

These were the top four stories when I typed “New York drought” into the search bar. I had to do it. Didn’t you read the title?

Nothing alarming. The fourth story is about California drought 🙄 People from the East Coast know that there is not much variability in rains here from year to year.

You must have noticed that I am trying to compare different datasets. Are there ways to achieve this? Can we visually inspect for differences? Are there any measures that we can use?

Since we started with California and New York, let us compare rainfall data for two cities, Berkeley and New York City. Fortunately, we have more than 100 years of measured rainfall data for these cities. I am using the data from 1901 to 2000.

As always, we can prepare some graphics to visualize the data. Recall from Lesson 14 that we can use boxplots to get a perspective on the data range, its percentiles, and outliers. Since we have two datasets, Berkeley and New York City, let us look at the boxplots on the same scale, one below the other, like this:

There is a clear difference in the data. New York City gets lot more rain than Berkeley, atleast two times more on average. Notice the 50th percentile (middle of the box) for Berkeley around 600 mm and New York City around 1200 mm.

Did you see that the minimum rainfall New York City gets per year is greater than what Berkeley gets 75% of the times?

What about variability? Is there a difference in the variability of the datasets?

I computed their standard deviations. For Berkeley, it is 228 mm, and for New York City it is 216 mm. At the face of it, 228 and 216 do not look very different. So is there no difference in the variability? Is it sufficient to just compare the standard deviations?

You know that standard deviation is the deviation from the center of the data. But in this case, the two datasets do not have the same central value (average). New York City has an average rainfall of 1200 mm and a standard deviation of 216 mm. Compared to 1200 mm, the deviation is 18% (216/1200). Berkeley has an average rainfall of 600 mm and a standard deviation of 228 mm. Compared to 600 mm, the deviation is 38%.

This measure, the relative standard deviation is called the coefficient of variation.

It measures the amount of variability in relation to the average. It is a standardized metric that can be used to compare datasets on different scales or units. It is common to express this ratio as a percentage as we did with Berkeley and New York City.

So, we can say that New York City gets more rainfall (two times more on average) than Berkeley, and its relative variability is less (~two times less) than Berkeley. That explains why there are fewer drought stories for New York compared to California.

Next time you read a drought story in your State, ask yourself “compared to what” and check out these maps.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 20 – Compared to what?

One thought on “Lesson 20 – Compared to what?”

Enjoy this blog? Please spread the word :)