Lesson 53 – Sum of squares: The language of Chi-square distribution

The conversations with people at the local bar last night were the usual local and national politics. I usually join the pool table after a casual drink. Yesterday, upon the request of an old friend, I toyed with the game of darts.

In my heyday, I used to be good at this, so without much trouble, I could hit the bullseye after getting my sight.

Maybe it is the draught that did the trick. I felt jubilant.

A couple of tequilas and a few minutes later, I hoped to beat my previous record.

“Did you change the darts?”, “There’s something in my eye,” “Maybe they moved the target.” These were circling in my mind after my performance.

They did not move the target, they did not change the darts, but there was indeed something in my head. I decided to stop playing the next round and go home, lest someone else becomes the bullseye.

While heading back, I thought about how far off I was from the bullseye (origin). On the X-Y coordinate plane, the distance from origin is R=\sqrt{X^{2}+Y^{2}} .

The squared distance is R^{2}=X^{2}+Y^{2}.

It varies with each try. Random variable.

Today morning, while catching up on my news feed, I noticed this interesting article.

The article starts with a conversation between fans.

People bet millions on heads vs. tails. 

“…the toss is totally biased.” It reminded me of the famous article by Karl Pearson, “Science and Monte Carlo” in the fortnightly review published in 1894. He experimentally proved that the roulette wheels in Monte Carlo were biased.

He computed the error (observed – theoretical) (or squared error) and showed that the roulette outcomes in Monte Carlo could not happen by chance. He explained this based on two sets of 16,500 recorded throws of the ball in a roulette wheel during an eight week period in summer of 1892, manually, I may add. Imagine counting up 33,000 numbers, and performing various sets of hand calculations.

He also writes that he spent his vacation tossing a shilling 25,000 times for a coin-toss bias experiment.

These experiments led to the development of the relative squared error metric, the Pearson’s cumulative test statistic (Chi-square test statistic).

Some excerpts from his concluding paragraph.

“To sum up, then: Monte Carlo roulette, if judged by returns which are published apparently with the sanction of the Societe, is, if the laws of chance rule, from the standpoint of exact science the most prodigious miracle of the nineteenth century.

we are forced to accept as alternative that the random spinning of a roulette manufactured and daily readjusted with extraordinary care is not obedient to the laws of chance, but is chaotic in its manifestation!

By now you might be thinking, “What’s the point here. Maybe it’s his hangover writing.”

Hmm, the square talk is for the Chi-square distribution.

Last week, we visited Mumble’s office. He has transformed the data — log-normal distribution.

If Z follows a normal distribution, then, X = e^{Z} is a log-normal distribution because log of X is normal.

Exponentiating a normal distribution will result in log-normal distribution.

See for yourself, how Z \sim N(0,1) tranforms into X, a log-normal distribution. X is non-negative.

Similarly, squaring a normal distribution will result in a Chi-square distribution.

If Z follows a normal distribution, then, \chi = Z^{2} is a Chi-square distribution with one degree of freedom.

See how the same Z \sim N(0,1) tranforms into \chi, a Chi-square distribution, again non-negative, since we are squaring.

Let’s derive the probability distribution function for the Chi-square distribution using the fact that \chi = Z^{2}. We will assume Z is a standard normal distribution Z \sim N(0,1).

For this, we will start with the cumulative distribution function, F(\chi) and take its derivative to obtain the probability distribution function f(\chi) since f(\chi)=\frac{d}{d\chi}F(\chi).

\chi = Z^{2}

F(\chi) = P(\chi \le \chi) is the cumulative distribution function.

F_{\chi}(\chi) = P(Z^{2} \le \chi)

=P(-\sqrt{\chi} \le Z \le \sqrt{\chi})

=P(Z \le \sqrt{\chi}) - P(Z \le -\sqrt{\chi})

F_{\chi}(\chi) =F_{Z}(\sqrt{\chi}) - F_{Z}(-\sqrt{\chi})


=\frac{d}{d\chi}(F_{Z}(\sqrt{\chi}) - F_{Z}(-\sqrt{\chi}))

Applying the fundamental theorem of calculus and chain rule together, we get,

=f_{Z}(\sqrt{\chi})*\frac{1}{2\sqrt{\chi}} + f_{Z}(-\sqrt{\chi})*\frac{1}{2\sqrt{\chi}}


By now, you are familiar with the probability distribution function for Z. f_{Z}(z) = \frac{1}{\sqrt{2\pi}}e^{\frac{-z^{2}}{2}}. Let’s use this.


f_{\chi}(\chi)=\frac{1}{\sqrt{2 \pi \chi}}e^{\frac{-\chi}{2}} for \chi > 0.

As you can see, the function is only defined for \chi > 0. It is 0 otherwise.

With some careful observation, you can tell that this function is the Gamma density function with \lambda=\frac{1}{2} and r = \frac{1}{2}.

Yes, I know, it is not very obvious. Let me rearrange it for you, and you will see the pattern.

f(\chi) = \frac{1}{\sqrt{2 \pi \chi}}e^{-\frac{1}{2}\chi}



Multiply and divide by 1/2.



If \lambda=1/2

f(\chi)=\frac{\lambda*(\lambda \chi)^{-1/2}*e^{-\lambda*\chi}}{\sqrt{\pi}}

Drawing from the factorial concepts, we can replace \sqrt{\pi} with (1/2)! or (-1/2)!, which means, r = 1/2

f(\chi)=\frac{\lambda*(\lambda \chi)^{r-1}*e^{-\lambda*\chi}}{(r-1)!}

This equation, as you know is the density function for the Gamma distribution.

So, Chi-square density function is Gamma density function with \lambda=1/2 and r=1/2. It is a special case of the Gamma distribution.

Now let’s up a level. Sum of squares of two standard normals, like our squared distance (X^{2}+Y^{2}).

\chi = Z_{1}^{2} + Z_{2}^{2}.

We know from lesson 46 on convolution that if X and Y are two independent random variables with probability density functions f_{X}(x) and f_{Y}(y), their sum Z = X + Y is a random variable with a probability density function f_{Z}(z) that is the convolution of f_{X}(x) and f_{Y}(y).

f(z) = f_{X}*f_{Y}(z) = \int_{-\infty}^{\infty}f_{X}(x)f_{Y}(z-x)dx

We can use the principles of convolution to derive the probability density function for \chi = Z_{1}^{2} + Z_{2}^{2}.

Let’s assume Z_{1}^{2}=k. Then, Z_{2}^{2}=\chi - k, and

f_{\chi}(\chi)=\int_{-\infty}^{\infty}f_{Z_{1}^{2}}(k)f_{Z_{2}^{2}}(\chi - k)dk

f_{Z_{1}^{2}} = \frac{1}{\sqrt{2 \pi k}}e^{-\frac{k}{2}}. This is the same function we derived above for \chi = Z^{2}. Using this,

f_{\chi}(\chi)=\int_{-\infty}^{\infty}\frac{1}{\sqrt{2 \pi k}}e^{-\frac{k}{2}}\frac{1}{\sqrt{2 \pi (\chi - k)}}e^{-\frac{(\chi - k)}{2}}dk

=\frac{1}{2\pi}\int_{0}^{\chi}k^{-\frac{1}{2}}e^{-\frac{k}{2}}(\chi - k)^{-\frac{1}{2}}e^{-\frac{(\chi - k)}{2}}dk

=\frac{1}{2\pi}e^{-\frac{\chi}{2}}\int_{0}^{\chi}k^{-\frac{1}{2}}(\chi - k)^{-\frac{1}{2}}dk

The term with the integral integrates to 2*arcsin(\frac{\sqrt{k}}{\sqrt{\chi}}), and its definite integral is \pi since arcsin(1) = \frac{\pi}{2}. Try it for yourself. Put your calculus classes to practice. 

We are left with

f_{\chi}(\chi) = \frac{1}{2}e^{-\frac{\chi}{2}}, again for \chi > 0.

This function is a Gamma distribution with \lambda = \frac{1}{2} and r = 1.

Generalization for n random normal variables

If there are n standard normal random variables, Z_{1}, Z_{2}, ..., Z_{n}, their sum of squares is a Chi-square distribution with n degrees of freedom.

\chi = Z_{1}^{2}+Z_{2}^{2}+ ... + Z_{n}^{2}

Its probability density function is a Gamma density function with \lambda=1/2 and r=n/2. You can derive it by induction.

f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!} for \chi > 0 and 0 otherwise.

Look at this animation for Chi-square distribution with different degrees of freedom. See how it becomes more symmetric for large values of n (degrees of freedom).

We will revisit the Chi-square distribution when we learn hypotheses testing. Till then, the sum of squares and error square should remind you of Chi-square [Square – Square], and tequila square should remind you of

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 52 – Transformation: The language of lognormal distribution

A cluttered desk with scattered papers, piles of binders, and an open reference textbook with a pencil on top welcomes Mumble to his day at work.

“Clean up the workspace,” said Mumble in angst, staring at the checklist for the day. It was that time of February when his last New Year resolution of having an ambient workspace had already taken a backburner.

He has two essential data analysis tasks and two meetings for the day. Besides, he set himself two chores. If you think Mumble is doing his miscellaneous tasks at works, you are mistaken. He is like you and me when it comes to writing checklists.

He begins his first task: Analyze flood damage data.

The 2017 flood insurance claims are starting to come in. Mumble, who was the chief designer of the product was assigned the task of corroborating the flood damage claims with the reported flood damage data. He was also assigned the task of estimating the probability of damage exceeding a certain threshold set for the year by the company.

“Open Dartmouth Flood Observatory Database. They will have the archive for recent floods, the area affected and people displaced. I can use that data to verify if the claims from these areas are valid or not,” thought Mumble as he types his password to log in.

“Time for some coding and analysis. The best part of my day. Fire up RStudio. Check. Read the file into R workspace. Check.” He completes the essentials and is now ready to analyze the data. You can also look at the data if you want.

“Let me first filter the database to select and plot only the floods for 2017.” RStudio lets him make the map too.

“Aha. The locations match the countries from where we got the claims. So let me look at the data for people displaced and develop the probability distribution function for it.” He quickly types up a few lines of code as he sips his coffee.

“There are 123 events,” he thinks as he scrolls down his monitor. “Some of these floods must be minor. There are 0’s, no people displaced in 40 such events. There are some huge numbers though. Possibly created a lot of damage. Let me plot them to see their arrangement on the number line.”

“They do not look symmetric. The few large numbers are skewing the data. I wonder how the frequency plot will look.”

“Ugly. Not sure what probability distribution fits this data. Those large numbers are dominating, but they are also most of the insurance claims.” He took another sip, leaned back, and with his hands at the back of his head, stared at the ceiling.

After about 30 seconds, it struck him that the scale of the observations is obscuring the pattern in the data.

What if I transform the data? If I re-express the data on a different scale, I can perhaps alter the distances between the points. Log scale?”

So he applied log transformation y = log(x) and plotted their frequency.

“Amazing. A simple mathematical transformation can make a skewed distribution more symmetric. The application of “logarithm” has shrunk the large numbers on the right side and moved them closer to the center. The frequency plot looks like a normal distribution, ” he thought, as he typed another line into the code to update the plot.

He realized that the log of the data is a normal distribution. “Lognormal distribution,” he said.

If Y = log(X) is a normal distribution with mean \mu_{y} and standard deviation \sigma_{y}, then X follows a lognormal distribution with a probabilty density function

f(x) = \frac{1}{x} \frac{1}{\sqrt{2\pi\sigma_{y}^{2}}} e^{\frac{-1}{2}(\frac{ln(x)-\mu_{y}}{\sigma_{y}})^{2}}

Remember, \mu_{y} and \sigma_{y} are the mean and standard deviation of the transformed variable Y.

The mean \mu_{x} and the standard deviation \sigma_{x} of X are related to \mu_{y} and \sigma_{y}.



You must have noticed already that the probabilities can be computed using the transformed variable (Y) and the usual method of standard normal distribution, i.e., convert y to z using z = \frac{y-\mu_{y}}{\sigma_{y}}=\frac{ln(x)-\mu_{y}}{\sigma_{y}}

P(X \le x)=P(e^{Y} \le x)=P(Y \le ln(x))=P(Z \le \frac{ln(x)-\mu_{y}}{\sigma_{y}})

“So the lognormal distribution can be used to model skewed and non-negative data.” He sits up in his chair excited, types a few lines of code with enthusiasm and write down the following in his notepad for the meeting later.

 P(Displaced > 300,000) = 0.0256

Can you tell how?

He then plots the flood-affected area using similar plots.

“Hmm, the original data is skewed with a large affected area to the right. But, the log-transformed data is not symmetric. It looks like there are numbers to the left that are far away from the log-transformed data.” He keeps staring at his screen as Aaron knocks on the door. “Meet now?”

“Sure.” Mumble walks out with his notepad and few other things. The flood-affected area plot is still pestering him, but he needs to take care of other issues for the day.

“Maybe logarithm is one type of transformation. Maybe there are other transformations?” He keeps thinking during his meetings.

As his 3 pm meeting comes to an end, he still has to do quality checks on the claims data. As is usual, his checklist stays incomplete. The filing taxes just got moved to the next day. He was however resolute to catch up on Transformers before Bumblebee hits the theaters this year. So transformers is on for the evening. It is timely as he transforms the data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 51 – Sometimes it is important to let the data speak

On the fifth day of February 2017, I embarked upon this journey to create a platform to make data analysis easy for at least a million people. When I began, and even now, I am no expert in this field. Over the years, I have learned several concepts from my mentors and other masters, and I believe I am only a conduit for sharing those with you.

Today, the fourth day of February 2018 marks the one-year milestone for us. We crossed 50,000 page views from more than 10,000 users across 136 countries. I hope my mission is underway and I have created interest in you towards data analysis 🙂

As a landmark event, I asked my mentor, Prof. Upmanu Lall from Columbia University in the City of New York to give us a lesson. He was gracious enough to agree and has taken time out of his ever busy schedule to teach us that “sometimes it is important to let the data speak.”

The man who has pioneered non-parametric methods for hydrology himself opens us to kernel density, why and what.

So, here is his gift on the occasion of our celebration of one-year together; or should I say,

Sometimes it is important to let the Master speak.

Sometimes it is important to let the data speak…


Founded in 1670, Charleston is the oldest and largest city in S. Carolina. For such a historic city and the surrounding community, reliable water supply is critical. Jill is the city manager. Charleston’s had a series of droughts in recent years, and the City Council wants to know how unusual these are, so they can think about investments to manage or capture water in the wettest months and have it available in the drier months.

Jill asked Jack, the comptroller to come up with a presentation so they could understand the situation. Jack was able to locate monthly rainfall data from 1893, and given over a 100 years of data felt that he could indeed present some ideas. After looking over the data, he first plotted a wet (June) and a dry month (January) to see how the rain changed across years.

A remarkable range of variation shows up especially in June, with many years with nearly zero rain, and others with as much as 15 inches of rain. On the other hand, the rainfall in January rarely exceeds 5 inches.

Jack first asked the question “How much rain can you count on in each of these months 10, 25, 50, 75 or 90% of the years”? The idea was to see how this compares with the amount of rain one would need to fall in each of the months to meet water needs that occur in that month.

This demand could change depending on how many people live in Charleston, and the collection area for water supply. Since the council could come up with different assumptions as to the future population or demand, Jack thought it would be wise to present his arguments using the rainfall needed per person, and the rainfall available per person.

Jack’s first take on this question was to pull together the January (June) data and re-arrange it from the smallest to the largest value for that month – using the 1893-2016 data. This was 124 values that are arranged from the lowest to the highest. Sorting the data this way let Jack identify the driest and the wettest years at a glance, and also to quickly do the calculation he was looking for.

Since Jack is R savvy, he actually just did this operation in R, using the quantile function, which reports the quantiles or percentiles of the data.

Looking at this, Jack could see that 10% of the time the rain could be less than 0.7 (1.47) inches and as much as 3.99 (6.79) inches in January (June).

He immediately wondered whether the June rainfall was likely to be low (high) in the same year that the January rainfall was high(low). Looking at their plot, it looked highly unlikely that there was a pattern. The calculated correlation was -0.03 confirming the plot.

Since presenting tables of numbers to a city council is never too exciting, Jack decided to present his findings graphically.

Again, being R savvy, he plotted the cumulative distribution function for the rain for each month.

The cumulative distribution function is just an estimate of the fraction of observations that are smaller than a given number – for instance for January (June), if we were curious about what the rainfall amount would be such that 50% of the values or a fraction 0.5 were smaller than these amounts, we could look that up with a value of the cumulative distribution function of 0.5, and find that it would be a rainfall amount of 1.95 (3.48) inches of rain.

The cumulative distribution function is then just a plot of the quantiles.

We see this in the figure below. For each of the quantiles Jack had calculated, he dropped blue (red) lines to show the corresponding rainfall for January (June). It is then easy to see that for each quantile the rain in January is less than the rain in June – both from the vertical lines and the cumulative distribution function curves.

Now refer back to Lesson 50. There the symmetry of the Normal distribution was talked about, and the area under the Normal density function between two rainfall values was interpreted as the percentage of time the values could fall in that range. We see some parallels.

If we are interested in the Probability that the January rainfall is less than 1.95 (P(Rain_{january} \le 1.95)), we see that this is 0.5, and for the Normal distribution this would be the area from –infinity (or really 0) rain to 1.95.

We also saw that the Normal distribution is symmetric. This would suggest that the areas to a distance left of the mean (median or the 50th percentile for the normal distribution is the same as the mean since it is a symmetric function) are the same as the areas the same distance right of the mean. Does that seem to hold for the Charleston rainfall?

These don’t really look symmetric.

Normally, the reason to use a model such as the Normal distribution is that we don’t have much data (e.g., 30 years) and if I am interested in a 100 year event (i.e. something that would occur on average once in a 100 years), unless I fit a probability model to it, I really cannot say anything about a 100 year event.

So, Jack wanted to make sure that even though he had 124 years of data, he was not challenged by a statistically savvy city council model for not fitting a probability distribution model, which if it is the right one, maybe more reliable.

So far, it doesn’t look promising to use a Normal distribution model. This is not a surprise since rainfall cannot be negative and we have so many years in which it is close to zero. In fact, the average rain in January (June) is 2.24 (4.04) inches compared to a maximum of 6.29 (16.29) inches over the 124 years. So, the data is definitely not symmetric, and the Normal distribution would not be a good choice.

Sometimes, people will argue that the logs of the data could be Normally distributed, since after taking logs, small positive values can be negative, so Jack thought he would check that out too. Here is what he found after taking logs of the rainfall data.

Now there was a bit of a dilemma. Comparing Min, Mean and Max, it seems that the data is not at all symmetric.

(Mean-Min) = 2.42 (62) vs Max-Min = 1.26 (1.62).

But the min and the max may not be the best way to look at this since these are rare values and may not always be symmetric, just if one anomalous value was recorded. So Jack also checked the symmetry across the quantiles.

Still not quite there 🙄

Finally, just to reassure himself, Jack used R to plot what the data for each month looked like along with the best fit of a Normal distribution, an exponential distribution (recommended for rainfall in a book), and also a kernel density plot.

The monthly rainfall values are plotted in red; the exponential distribution fit in blue; the normal distribution fit in green; and the kernel density plot in black.

For the original scale

For the LOG scale 

Looking at these plots for the original and the log-transformed data Jack was quite clear that the normal and the exponential models were very different from the kernel density plot and if the kernel density represents what the probability density of the data really looks like then one could not use the normal or the exponential model with the original data or the logs.

So what is a kernel density plot?

Aha – this needs a few lessons to master, but for now, think of it as a plot where the probability density of the variable – rainfall in our case – is estimated for any rainfall value (say rain = 3″) by looking at how many rainfall observations fell within a window of a given size (e.g., 1″) centered at the value of interest (3″).

So, the idea is that if a lot of values are close to 3″ then the probability density is high there, and conversely, if there are very few values then the probability density is low there. The result will clearly depend on the size of the window.

Also, observations that are far from the target should perhaps not be considered representative of the current value and should be down-weighted.

The weight or kernel function is then a function of the distance from the target. A common kernel function is the Normal distribution centered at each point at which an estimate is desired, with a certain specified standard deviation which functions as a window width.

For instance to estimate the density at rain =3″, we may use a Normal density function with mean equal to 3 and a standard deviation of 1. The density is then computed as the sum of the contribution of all observations at that point using this kernel.



K(u) = \frac{1}{\sqrt{2\pi}} e^{-\frac{u^{2}}{2}}

You can see that the kernel function K(u) looks like a Normal distribution with mean 0 and standard deviation equal to 1. Of course, comparing with Lesson 50, looking at the definition of u it is easy to see that the mean is x_{i} and the standard deviation is h.

Another way to interpret these equations is that the kernel density estimate at x is the weighted contribution of all observations x_{i} as being representative of what we expect at x. The weights are from the Normal distribution centered at x.

Thus, this is like doing a weighted moving average that is computed by moving from one value of x to another. As we can see from the figures for Charleston, the result does not look like a Normal distribution that is fit to the full data.

More on this later…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.