On the fifth day of February 2017, I embarked upon this journey to create a platform to make data analysis easy for at least a million people. When I began, and even now, I am no expert in this field. Over the years, I have learned several concepts from my mentors and other masters, and I believe I am only a conduit for sharing those with you.

Today, the fourth day of February 2018 marks the one-year milestone for us. We crossed 50,000 page views from more than 10,000 users across 136 countries. I hope my mission is underway and I have created interest in you towards data analysis 🙂

As a landmark event, I asked my mentor, Prof. Upmanu Lall from Columbia University in the City of New York to give us a lesson. He was gracious enough to agree and has taken time out of his ever busy schedule to teach us that “sometimes it is important to let the data speak.”

The man who has pioneered non-parametric methods for hydrology himself opens us to kernel density, why and what.

So, here is his gift on the occasion of our celebration of one-year together; or should I say,

Sometimes it is important to let the Master speak.

Sometimes it is important to let the data speak…

Founded in 1670, Charleston is the oldest and largest city in S. Carolina. For such a historic city and the surrounding community, reliable water supply is critical. Jill is the city manager. Charleston’s had a series of droughts in recent years, and the City Council wants to know how unusual these are, so they can think about investments to manage or capture water in the wettest months and have it available in the drier months.

Jill asked Jack, the comptroller to come up with a presentation so they could understand the situation. Jack was able to locate monthly rainfall data from 1893, and given over a 100 years of data felt that he could indeed present some ideas. After looking over the data, he first plotted a wet (June) and a dry month (January) to see how the rain changed across years.

A remarkable range of variation shows up especially in June, with many years with nearly zero rain, and others with as much as 15 inches of rain. On the other hand, the rainfall in January rarely exceeds 5 inches.

Jack first asked the question “How much rain can you count on in each of these months 10, 25, 50, 75 or 90% of the years”? The idea was to see how this compares with the amount of rain one would need to fall in each of the months to meet water needs that occur in that month.

This demand could change depending on how many people live in Charleston, and the collection area for water supply. Since the council could come up with different assumptions as to the future population or demand, Jack thought it would be wise to present his arguments using the rainfall needed per person, and the rainfall available per person.

Jack’s first take on this question was to pull together the January (June) data and re-arrange it from the smallest to the largest value for that month – using the 1893-2016 data. This was 124 values that are arranged from the lowest to the highest. Sorting the data this way let Jack identify the driest and the wettest years at a glance, and also to quickly do the calculation he was looking for.

Since Jack is R savvy, he actually just did this operation in R, using the quantile function, which reports the quantiles or percentiles of the data.

Looking at this, Jack could see that 10% of the time the rain could be less than 0.7 (1.47) inches and as much as 3.99 (6.79) inches in January (June).

He immediately wondered whether the June rainfall was likely to be low (high) in the same year that the January rainfall was high(low). Looking at their plot, it looked highly unlikely that there was a pattern. The calculated correlation was -0.03 confirming the plot.

Since presenting tables of numbers to a city council is never too exciting, Jack decided to present his findings graphically.

Again, being R savvy, he plotted the cumulative distribution function for the rain for each month.

The cumulative distribution function is just an estimate of the fraction of observations that are smaller than a given number – for instance for January (June), if we were curious about what the rainfall amount would be such that 50% of the values or a fraction 0.5 were smaller than these amounts, we could look that up with a value of the cumulative distribution function of 0.5, and find that it would be a rainfall amount of 1.95 (3.48) inches of rain.

The cumulative distribution function is then just a plot of the quantiles.

We see this in the figure below. For each of the quantiles Jack had calculated, he dropped blue (red) lines to show the corresponding rainfall for January (June). It is then easy to see that for each quantile the rain in January is less than the rain in June – both from the vertical lines and the cumulative distribution function curves.

Now refer back to Lesson 50. There the symmetry of the Normal distribution was talked about, and the area under the Normal density function between two rainfall values was interpreted as the percentage of time the values could fall in that range. We see some parallels.

If we are interested in the Probability that the January rainfall is less than 1.95 ( $P(Rain_{january} \le 1.95)$ ), we see that this is 0.5, and for the Normal distribution this would be the area from –infinity (or really 0) rain to 1.95.

We also saw that the Normal distribution is symmetric. This would suggest that the areas to a distance left of the mean (median or the 50th percentile for the normal distribution is the same as the mean since it is a symmetric function) are the same as the areas the same distance right of the mean. Does that seem to hold for the Charleston rainfall?

These don’t really look symmetric.

Normally, the reason to use a model such as the Normal distribution is that we don’t have much data (e.g., 30 years) and if I am interested in a 100 year event (i.e. something that would occur on average once in a 100 years), unless I fit a probability model to it, I really cannot say anything about a 100 year event.

So, Jack wanted to make sure that even though he had 124 years of data, he was not challenged by a statistically savvy city council model for not fitting a probability distribution model, which if it is the right one, maybe more reliable.

So far, it doesn’t look promising to use a Normal distribution model. This is not a surprise since rainfall cannot be negative and we have so many years in which it is close to zero. In fact, the average rain in January (June) is 2.24 (4.04) inches compared to a maximum of 6.29 (16.29) inches over the 124 years. So, the data is definitely not symmetric, and the Normal distribution would not be a good choice.

Sometimes, people will argue that the logs of the data could be Normally distributed, since after taking logs, small positive values can be negative, so Jack thought he would check that out too. Here is what he found after taking logs of the rainfall data.

Now there was a bit of a dilemma. Comparing Min, Mean and Max, it seems that the data is not at all symmetric.

(Mean-Min) = 2.42 (62) vs Max-Min = 1.26 (1.62).

But the min and the max may not be the best way to look at this since these are rare values and may not always be symmetric, just if one anomalous value was recorded. So Jack also checked the symmetry across the quantiles.

Still not quite there 🙄

Finally, just to reassure himself, Jack used R to plot what the data for each month looked like along with the best fit of a Normal distribution, an exponential distribution (recommended for rainfall in a book), and also a kernel density plot.

The monthly rainfall values are plotted in red; the exponential distribution fit in blue; the normal distribution fit in green; and the kernel density plot in black.

For the original scale

For the LOG scale

Looking at these plots for the original and the log-transformed data Jack was quite clear that the normal and the exponential models were very different from the kernel density plot and if the kernel density represents what the probability density of the data really looks like then one could not use the normal or the exponential model with the original data or the logs.

So what is a kernel density plot?

Aha – this needs a few lessons to master, but for now, think of it as a plot where the probability density of the variable – rainfall in our case – is estimated for any rainfall value (say rain = 3″) by looking at how many rainfall observations fell within a window of a given size (e.g., 1″) centered at the value of interest (3″).

So, the idea is that if a lot of values are close to 3″ then the probability density is high there, and conversely, if there are very few values then the probability density is low there. The result will clearly depend on the size of the window.

Also, observations that are far from the target should perhaps not be considered representative of the current value and should be down-weighted.

The weight or kernel function is then a function of the distance from the target. A common kernel function is the Normal distribution centered at each point at which an estimate is desired, with a certain specified standard deviation which functions as a window width.

For instance to estimate the density at rain =3″, we may use a Normal density function with mean equal to 3 and a standard deviation of 1. The density is then computed as the sum of the contribution of all observations at that point using this kernel.

$f(x)=\frac{1}{nh}\sum_{i=1}^{n}K(\frac{x-x_{i}}{h})$

$u=\frac{x-x_{i}}{h}$

$K(u) = \frac{1}{\sqrt{2\pi}} e^{-\frac{u^{2}}{2}}$

You can see that the kernel function $K(u)$ looks like a Normal distribution with mean 0 and standard deviation equal to 1. Of course, comparing with Lesson 50, looking at the definition of $u$ it is easy to see that the mean is $x_{i}$ and the standard deviation is $h$ .

Another way to interpret these equations is that the kernel density estimate at $x$ is the weighted contribution of all observations $x_{i}$ as being representative of what we expect at $x$ . The weights are from the Normal distribution centered at $x$ .

Thus, this is like doing a weighted moving average that is computed by moving from one value of $x$ to another. As we can see from the figures for Charleston, the result does not look like a Normal distribution that is fit to the full data.

Lesson 51 – Sometimes it is important to let the data speak

Sometimes it is important to let the data speak…

For the original scale

For the LOG scale

Enjoy this blog? Please spread the word :)