Lesson 59 – The Generalized extreme value distribution

Recap

Two weeks ago, we met Maximus Extremus Distributus. He was taking different forms depending on the origin (parent) distribution.

Last week, Mumble, Joe, and Devine met to discuss the central ideas behind extreme value distribution. They derived the exact function for it.

$F_{Y}(y)=P(Y \le y) = P(X_{1} \le y)P( X_{2} \le y) ...P(X_{n} \le y) = [F_{X}(y)]^{n}$ .

From this cumulative function, we can derive the probability function.

$f_{Y}(y) = n[F_{X}(y)]^{n-1}f_{X}(y)$

Mumble pointed out the issue of degeneration of the exact function as $n \to \infty$ and introduced the idea of normalizing constants to stabilize the function.

If there are two normalizing constants $a_{n}>0$ and $b_{n}$ , we can create a normalized version of Y as $Y^{*} = \frac{Y-b_{n}}{a_{n}}$ .

Before we closed off for the week, we knew that $Y^{*}$ converges to three types, Type I, Type II and Type III extreme value distributions.

If there exist normalizing constants $a_{n}>0$ and $b_{n}$ , then,

$P(\frac{Y - b_{n}}{a_{n}} \le z) \to G(z)$ as $n \to \infty$ .

$G(z)$ is the non-degenerate cumulative distribution function.

Type I (Gumbel Distribution): $G(z) = e^{-e^{-\frac{z-\alpha}{\beta}}}$ . Double exponential distribution.

Type II (Frechet Distribution): $G(z) = e^{-(\frac{z-\alpha}{\beta})^{-\gamma}}$ for $z > \alpha$ and 0 for $z \le \alpha$ . This is single exponential function.

Type III (Weibull Distribution): $G(z) = e^{-[-\frac{z-\alpha}{\beta}]^{\gamma}}$ for $z < \alpha$ and 1 for $z \ge \alpha$ . This is also a single exponential distribution.

The convergence

Let’s get some intuition on why the parent distributions converge to these three types. The mathematical foundation is much more in-depth. We will skip that for some later lessons and get a simple intuition here.

Exponential origin: Let’s take Joe’s wait time example from last week. We assume that the arrival times between successive vehicles has an exponential distribution. Let’s call this random variable $X$ .

$X \sim Exp(\lambda)$

There are $n$ arrival times between successive vehicles that can be shown as a set of random variables.

$X_{1}, X_{2}, X_{3}, ..., X_{n}$

The maximum wait time is the max of these numbers. Let’s call this maximum time, $Y$ .

$Y = max(X_{1}, X_{2}, X_{3}, ..., X_{n})$

The cumulative function for an exponential distribution is $F_{X}(x) = 1 - e^{-\lambda x}$ . Hence, for $Y$ (maximum wait time), it will be

$F_{Y}(y) = [1 - e^{-\lambda y}]^{n}$

For simplicity, let’s assume a value of 1 for $\lambda$ and take the binomial series expansion for $F_{Y}(y) = [1 - e^{-\lambda y}]^{n}$ .

$F_{Y}(y) = 1 - ne^{-y} + \frac{n(n-1)}{2!}e^{-2y} - ...$

As $n \to \infty$ , this series converges to $e^{-ne^{-y}}$ , an asymptotic double exponential functions.

Norming Constants: Now let’s derive the double exponential function with the idea of norming constants. We know $F_{Y}(y) = [1 - e^{-y}]^{n}$ for $\lambda = 1$ .

Let’s introduce a variable $z = y + ln(n)$ and evaluate the function $F_{Y}(y)$ at $z$ . We are adding a constant $ln(n)$ to $y$ .

$F_{Y}(y+ln(n)) = [1 - e^{-(y+ln(n))}]^{n}$

$F_{Y}(y+ln(n)) = [1 - \frac{1}{e^{ln(n)}}e^{-y}]^{n}$

$F_{Y}(y+ln(n)) = [1 - \frac{1}{n}e^{-y}]^{n}$

As $n \to \infty$ , $F_{Y}(y+ln(n))$ converges to $e^{-e^{-y}}$ , a double exponential function. If you observe the equation carefully, it is of the form $[1+\frac{1}{n}x]^{n}$ , which in the limit is $e^{x}$ . In our case, $x = -e^{-y}$ . Hence,

$F_{Y}(y+ln(n)) =e^{-e^{-y}}$

If we replace $y = z - ln(n)$ , we get,

$F_{Y}(z) =e^{-e^{-(z-ln(n))}}$

So, with appropriate scaling (stabilization/norming), we see a double exponential function when the origin is an exponential function.

Power law origin: Let’s try one more parent function. This time, it is a power law function with a cumulative density function $F_{X}(x)=1 - x^{-a}$ . We did not learn this distribution, but it is like a decay function with $a$ controlling the degree of decay. The distribution of the maximum value $Y$ of this function is

$F_{Y}(y) = [1 - y^{-a}]^{n}$

We will assume a new variable $z = n^{a}y$ and evaluate the function at $z$ .

$F_{Y}(n^{a}y) = [1 - (n^{a}y)^{-a}]^{n}$

$=[1 - n^{-1}y^{-a}]^{n}$

$=[1 + \frac{1}{n}(-y^{-a})]^{n}$

In the limit, as $n \to \infty$ , $F_{Y}(n^{a}y) = e^{-y^{-a}}$

Hence, $F_{Y}(z)=e^{-(\frac{z}{n^{a}})^{-a}}$

So, the origin functions with a power law type of functions converge to single exponential Type II Frechet distribution. Similar norming constants can be observed for other distributions that converge to Type III Weibull distribution.

To summarize, the three types are $e^{-e^{-\frac{z-\alpha}{\beta}}}, e^{-(\frac{z-\alpha}{\beta})^{-\gamma}}, e^{-[-\frac{z-\alpha}{\beta}]^{\gamma}}$ .

They are of the form $e^{-e^{-y}}, e^{-y^{-\gamma}}, e^{-(-y)^{\gamma}}$ .

Here’s a visual of how these three distributions look.

If the right tail is of exponential type, the extreme value distribution is a Gumbel distribution. Here the parent distribution (or the distribution of $X$ ) is unbounded on the right tail. Extremes of most common exponential type distributions such as normal, lognormal, exponential and gamma distributions converge to the double exponential Gumbel distribution. It is most commonly used to model maximum streamflow, maximum rainfall, earthquake occurrence and in some cases, maximum wind speed.

The Frechet distribution, like the Gumbel distribution, is unbounded on the right tail and is much fatter. Extremes from Pareto distribution (Power Law) and Cauchy distributions converge to Frechet Distribution. Rainfall and streamflow extremes, air pollution and economic impacts can be modeled using this type. Notice how the red line (Frechet distribution) has a heavy tail and is bigger than the black line (Gumbel distribution).

If the right tail converges to a finite endpoint, it is a Weibull distribution. The parent distribution is also bounded on the right. Remember the extremes of Uniform, which is bounded converged to a Weibull. It is most widely used in minima of the strength of materials and fatigue analysis. It is also used in modeling temperature extremes and sea level.

The Generalized Extreme Value Distribution (GEV)

The three types of extreme value distributions can be combined into a single function called the generalized extreme value distribution (GEV). Richard von Mises and Jenkinson independently showed this.

$G(z) = e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}_{+}}$

$\mu$ is the location parameter. $\sigma > 0$ is the scale parameter. $\xi$ is the shape parameter.

When $\xi \to 0$ , GEV tends to a Gumbel distribution. It is the same form $[1+\frac{1}{n}x]^{n}$ , which in the limit is $e^{x}$ . For GEV, the exponential term goes to $e^{-(\frac{z-\mu}{\sigma})}$ hence yielding the double exponential $G(z) = e^{-e^{-(\frac{z-\mu}{\sigma})}}$ .

When $\xi > 0$ , GEV tends to the Frechet distribution. Replace $\xi = 1$ and see for yourself what you get.

When $\xi < 0$ , GEV tends to the Weibull distribution. Replace $\xi = -1$ and check.

$\mu$ and $\sigma$ are the surrogate norming variables and $\xi$ controls the shape.

GEV folds all the three types into one form. The parameters $\mu, \sigma$ and $\xi$ can be estimated from the data, hence negating the necessity to know which Type a parent distribution or data converges to. The function has a closed form solution to compute the quantiles and probabilities. GEV also has the max stable property about which we will learn in some later lessons.

All these concepts will become more concrete once we play with some data. Aren’t you itching to do some GEV in R? You know what is coming next week.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 58 – Max (Min): The language of extreme value distribution

Mumble, Joe, and Devine meet to discuss extreme values and extreme value distribution. Extremus Distributus.

J: Extreme value distribution? Are we talking about the pictures we saw last week? The maximum of any distribution following specific probability distribution. How does my driving experience today morning fit the discussion?

M: Can you describe a bit more about your experience?

J: Yes. I was waiting at the merging lane near GW Bridge. The arrival time between successive vehicles was pretty fast; I was left there stranded.

M: We can assume that the arrival times between successive vehicles has an exponential distribution. As you have shown in your graphic, let’s call this random variable X.

$X \sim Exp(\lambda)$

Let’s say there are n+1 vehicles so that there are n arrival times between successive vehicles that can be shown as a set of random variables.

$X_{1}, X_{2}, X_{3}, ..., X_{n}$

This means your wait time to merge between successive vehicles could have been any of these n random variables. The maximum wait time is the max of these numbers. Let’s call this maximum time, Y.

$Y = max(X_{1}, X_{2}, X_{3}, ..., X_{n})$

Y follows an extreme value distribution and it is related to the distribution of $X_{1}, X_{2}, X_{3}, ..., X_{n}$ .

D: Since in your illustration, you have an exponential wait time as the parent distribution, Y, the maximum of X’s will have a distribution that is related to the exponential distribution. We can derive the exact function of Y based on what we know about X.

M: In fact, we use exponential distribution as the parent distribution to model maximum wait time between earthquakes, floods, droughts, etc. to price our insurance products.

J: How can we derive the probability function of Y from X? Can we go through that?

D: Sure. Mumble showed before that $Y = max(X_{1}, X_{2}, X_{3}, ..., X_{n})$ .

The cumulative distribution function $F_{Y}(y)=P(Y \le y)$ . Because Y is the maximum $X_{1}, X_{2}, X_{3}, ..., X_{n}$ , y should be greater than all these values. In other words,

$F_{Y}(y)=P(Y \le y) = P(X_{1} \le y \cap X_{2} \le y \cap ... X_{n} \le y)$

We assume that $X_{1}, X_{2}, X_{3}, ..., X_{n}$ are statistically independent; hence, the cumulative distribution of Y is the product of the cumulative distribution of individual parent distributions.

$F_{Y}(y)=P(Y \le y) = P(X_{1} \le y)P( X_{2} \le y) ...P(X_{n} \le y) = [F_{X}(y)]^{n}$

From this cumulative function, we can derive the probability function.

$f_{Y}(y) = \frac{d}{dy}F_{Y}(y) = \frac{d}{dy}[F_{X}(y)]^{n}$

$f_{Y}(y) = n[F_{X}(y)]^{n-1}f_{X}(y)$

M: If we know the distribution of X (parent), we can substitute that in our equations here and get the distribution of Y. Do you want to try it for the exponential wait times?

J: Sure. The cumulative function for an exponential distribution is $F_{X}(x) = 1 - e^{-\lambda x}$ . The probability function is $f_{X}(x)=\lambda e^{-\lambda x}$ . For Y it will be

$F_{Y}(y) = [1 - e^{-\lambda y}]^{n}$

$f_{Y}(y) = n[1 - e^{-\lambda y}]^{n-1}\lambda e^{-\lambda y}$

D: Excellent. Did you also notice that the cumulative and probability functions are dependent on n, the number of independent random variables $X_{1}, X_{2}, X_{3}, ..., X_{n}$ .

Here is how they vary for exponential parent distribution. They shift to the right with increasing values of n.

J: We can derive the functions for the minimum in a similar way I suppose.

M: That is correct. If $Y_{min}$ is the minimum of the random variables, then, each of these independent $X_{1}, X_{2}, X_{3}, ..., X_{n}$ are greater than $y_{min}$ .

$P(X_{1}>y_{min}, X_{2}>y_{min}, ..., X_{n}>y_{min}) = 1 - F_{Y_{min}}(y_{min})$

Using this equation, we can derive $F_{Y_{min}}(y_{min}) = 1 - [1-F_{X}(y_{min})]^{n}$ and $f_{Y_{min}}(y_{min})=n[1-F_{X}(y_{min})]^{n-1}f_{X}(y_{min})$ .

J: This is cool. All we need to know is the function for X, and we know everything about the extremes.

M: That is true to an extent. But what will we do if we do not know the form of the parent distribution with certainty? Remember Lesson 51? Did you check what happens to the functions with increasing values of n, i.e., n tends to $\infty$ ?

J: Let me plot the cumulative function for increasing values of n and see.

D: Joe, as you can see, $F_{Y}(y) = [F_{X}(y)]^{n}$ tends to 0 as n tends to $\infty$ . The function degeneates to a point. It is unstable.

J: Hmm. How do we stabilize it then?

M: Good question. We can work with a normalized version of Y instead of Y.

If there are two normalizing constants $a_{n}>0$ and $b_{n}$ , we can create a normalized version of Y.

$Y^{*} = \frac{Y-b_{n}}{a_{n}}$

Proper values of these two normalizing constants $a_{n}>0$ and $b_{n}$ will stabilize Y for increasing values of n. We need to find the distribution that $Y^{*}$ can take in the limit as n tends to $\infty$ .

D: Look at this visual to understand what stabilizing (norming) for increasing value of n means. It is for an exponential parent, our example. I am assuming $a_{n} = 1$ and $b_{n}$ varies with n.

The normalized function does not degenerate for large values of n.

M: It turns out that the limit distribution of $Y^{*}$ , the normalized version of Y, can be one of the three types, Type I, Type II and Type III.

If there exist normalizing constants $a_{n}>0$ and $b_{n}$ , then,

$P(\frac{Y - b_{n}}{a_{n}} \le z) \to G(z)$ as $n \to \infty$ .

$G(x)$ is the non-degenerate cumulative distribution function.

Type I (Gumbel Distribution): $G(z) = e^{-e^{-\frac{z-\alpha}{\beta}}}$ . Double exponential distribution.

Type II (Frechet Distribution): $G(z) = e^{-(\frac{z-\alpha}{\beta})^{-\gamma}}$ for $x > \alpha$ and 0 for $x \le \alpha$ . This is single exponential function.

Type III (Weibull Distribution): $G(z) = e^{-[-\frac{z-\alpha}{\beta}]^{\gamma}}$ for $x < \alpha$ and 1 for $x \ge \alpha$ . This is also a single exponential distribution.

The normalized version of Y converges to these three types, also called the family of extreme value distribution. It is said that the distribution $F_{Y}(y)$ is in the domain of attraction of $G(z)$ . $\alpha, \beta, \gamma$ are the location (central tendency), scale and the shape controlling parameters.

J: This convergence theorem sounds like the central limit theorem.

D: Yes, like the sum of the independent random variables converges to the normal distribution in the limit when $n \to \infty$ , the max of independent random variables converges to one of these three types, Gumbel, Frechet or Weibull depending on the parent distribution. Generally, exponential tails, for example, exponential and normal parent distribution tend to Gumbel, and polynomial tails tend to Frechet distribution.

You have already seen how the Gumbel looks like when exponential wait times converge. Here is an example Frechet and Weibull distribution.

Fisher and Tippet derived these three limiting distributions in the late 1920s. Emil Julius Gumbel, in chapter 5 of his book “Statistics of Extremes” explains the fundamentals of deriving the distributions.

J: This is plentyful. Are there any properties of the three types? Can we guess what distribution is used where? The converge of extremes to these three types seems very powerful. We can model outliers without much trouble. Some examples in R might help understand these better.

You already have a handful to digest. Let’s continue next week after you have processed it.

Be cautious. Coming up with the values for $\alpha, \beta, \gamma$ in the real world involves uncertainty. The extreme value distribution need not be an elixir for outliers.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 57 – My name is Maximus Extremus Distributus

The best school districts in New York have a name.

The name’s Extremus Distributus.

Heavy damaging winds have a name.

Extremus Distributus.

Central limit theorem tells you that the mean of the normal distribution follows a normal distribution in the limit.

Centerus Normalus Distributus.

The maxima of normal distribution also follow a distinct distribution.

Maximus Extremus Distributus

The maxima of several other distributions also follow a Maximus Extremus Distributus.

Tertia pars sequetur …

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 56 – Continuous distributions in R: Part II

The sun breaks, the clock rolls, and the temperature warms marking the end of the hibernation.

Last year, when we rolled back into the winter, we were learning discrete distributions in R using November temperature data.

It behooves that today, inviting the warmth, we continue learning continuous distributions using March temperature.

It’s 40-degree Fahrenheit with 40% humidity today in New York. Pleasant enough to take a nice walk without much effort. It comes after a stormy week.

Last week, before leaving, I asked you whether we will see a gusty event before today. Indeed, midweek, we had another windy Wednesday. The expected wait time is 19 days. It does not happen every 19 days. It may happen earlier, or we may have to wait longer. They average out to 19 days.

Since it is warm and relatively humid, we should analyze the data for March temperature.

Get the data from our source. Go through the chores of setting up the code. Coffee is a must.

FIRST STEPS

Step 1: You can download the filtered data file here.

Step 2: Create a new folder, “lesson56,” and save the data file in this folder.

Step 3: Create a new code for this lesson, “lesson56_code.R,” and save it in the same folder.

Step 4: Choose your working directory. “lesson56” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder.

setwd("path to your folder")

Step 5: Read the data into R workspace

I have the data in the file named “cp_temperature.csv.”

There are six columns in this data file. Year, Month, Day, TAVG (the average temperature for the day), TMAX (the maximum temperature for the day) and TMIN (the minimum temperature for the day).

We have data from 1869 onwards. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the file #
cp_data = read.csv("cp_temperature.csv",header=TRUE)

Let’s Understand the Central Limit Theorem

I want to take the data for TAVG in March and check its distribution. Since we are only nine days into March 2018, we will take the 31 days data for March 2017. You can use the which command to extract data.

# Extract March 2017 Daily Temperature Data #
index = which((cp_data[,2]==3) & (cp_data[,1] == 2017))
march_2017 = cp_data[index,4]

Through this command, you are looking for the days when the second column (Month) is 3 (March) and the first column (Year) is 2017. So “index” will be the row index in the data matrix. Then, we extract the fourth column (TAVG) of these rows.

This is how the probability distribution of these 31 values looks. Use the following lines to make this plot.

library(locfit)
hist(march_2017,prob=T,main="",font=2,font.lab=2,xlab="March 2017 Daily Temperature")
lines(locfit(~march_2017))

Doesn’t it look crazy? Let’s look at 2016 March daily data also.

These values are much more skewed.

Recall central limit theorem from lesson 48.

If $S_{n}$ is the sum or average of $n$ independent random variables, then the distribution function of $S_{n}$ can be well-approximated by a continuous function knows as the normal density function.

Let’s develop a simple code to see this manifest. These are the steps.

Take the 31 days TAVG data from March of each year. For simplicity, I will do it from 1951 to 2017. 67 years.
Plot the distribution of these 31 values. They may be skewed.
Compute the average monthly temperature for March. AVG is the average of the 31 values in March in that year. $AVG = \frac{1}{n}\sum_{i=1}^{i=31}TAVG_{i}$ .
Analyze the distribution of these averages (AVG). Because AVG is the average of n random variables, in the limit, it should converge to a normal distribution. In our case, n = 31; we create 67 values of AVG (1951 to 2017) and verify the probability distribution.

Here is the view of the manifestation of the central limit theorem. I set it up like an animation to run in the plot domain.

In the top panel, you will see the distribution of TAVG for March for each year unfolding. Each time I am marking the average of these 31 values with a red color triangle. Then, in the second panel, these red triangles are plotted as a distribution.

Here is the code for it. You can try it yourself.

## Code to demonstrate the Central Limit Theorem ##
yr1 = 1951
yr2 = 2017
n = yr2 - yr1 + 1

index = which((cp_data[,2]==3) & (cp_data[,1] == yr1)) 
march_daily_temperature = cp_data[index,4]

par(mfrow=c(2,1))
plot(locfit(~march_daily_temperature),main="Average Daily Temperature",col="grey",xlim=c(10,80),ylim=c(0,0.1),font=2,font.lab=2,xlab="Temperature (deg F)")
points(mean(march_daily_temperature,na.rm=T),0,col="red",pch=25)

avg = matrix(NA,nrow=n,ncol=1)
for (i in 1: n)
{
 yr = 1950 + i
 index2 = which((cp_data[,1]==yr) & (cp_data[,2]==3))
 daily = cp_data[index2,4]

lines(locfit(~daily),col="grey")
 
 avg[i,1] = mean(cp_data[index2,4],na.rm=T)
 points(avg[i,1],0,col="red",pch=25)
 Sys.sleep(0.5) 
}
plot(locfit(~avg),main="Average Monthly Temperature",col="white",lwd=2,xlim=c(10,80),font=2,font.lab=2,xlab="Temperature (deg F)")
lines(locfit(~avg),col="red",lwd=2)

In lesson 48, I also showed how the normal distribution emerges from the convolution of Uniform and Poisson distribution. Here is the code for it.

# Uniform Dist #
x1 = runif(20000)
x2 = runif(20000)
x3 = runif(20000)
x4 = runif(20000)
x5 = runif(20000)

z1 = x1
z2 = x1 + x2
z3 = x1 + x2 + x3
z4 = x1 + x2 + x3 + x4
z5 = x1 + x2 + x3 + x4 + x5

plot(density(x1),xlim=c(0,5),ylim=c(0,2),font=2,font.lab=2,main="",xlab="Sn")
text(0.2,1.1,"Sn = X1")
Sys.sleep(1)

lines(density(z2),col="green",lwd=2)
text(1.2,1.1,"Sn = X1+X2",col="green")
Sys.sleep(1)

lines(density(z3),col="pink",lwd=2)
text(2.2,0.8,"Sn = X1+X2+X3",col="pink")
Sys.sleep(1)

lines(density(z4),col="grey",lwd=2)
text(3.1,0.6,"Sn = X1+X2+X3+X4",col="grey")
Sys.sleep(1)

lines(density(z5),col="yellow",lwd=2)
text(3.6,0.3,"Sn = X1+X2+X3+X4+X5",col="yellow")
Sys.sleep(1)

# Poisson Dist #
lambda = 1
x1 = rpois(10000,lambda)
x2 = rpois(10000,lambda)
x3 = rpois(10000,lambda)
x4 = rpois(10000,lambda)
x5 = rpois(10000,lambda)
x6 = rpois(10000,lambda)
x7 = rpois(10000,lambda)
x8 = rpois(10000,lambda)
x9 = rpois(10000,lambda)
x10 = rpois(10000,lambda)

z1 = x1
z2 = x1 + x2
z3 = x1 + x2 + x3
z4 = x1 + x2 + x3 + x4
z5 = x1 + x2 + x3 + x4 + x5
z6 = x1 + x2 + x3 + x4 + x5 + x6
z7 = x1 + x2 + x3 + x4 + x5 + x6 + x7
z8 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
z9 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
z10 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10

Sys.sleep(1) 
hist(x1,xlim=c(0,20),ylim=c(0,5000),col="red",breaks=50,main="",font=2,font.lab=2,xlab="Sn")
Sys.sleep(1)

hist(z2,add=T,col="green",breaks=50)
Sys.sleep(1)

hist(z3,add=T,col="pink",breaks=50)
Sys.sleep(1)

hist(z4,add=T,col="grey",breaks=50)
Sys.sleep(1)

hist(z5,add=T,col="yellow",breaks=50)
Sys.sleep(1)

hist(z6,add=T,col="orange",breaks=50)
Sys.sleep(1)

hist(z7,add=T,col="blue",breaks=50)
Sys.sleep(1)

hist(z8,add=T,col="violet",breaks=50)
Sys.sleep(1)

hist(z9,add=T,col="magenta",breaks=50)
Sys.sleep(1)

hist(z10,add=T,col="black",breaks=50)
text(15,1100,"Sn = X1 + X2 + ... + X10")

Normal Distribution

We now know that AVG is a normal distribution.

What is the probability that the average March monthly temperature AVG will be less than or equal to 40?

I know you are fetching for your standard normal tables to compute

$P(AVG \le 40) = P(\frac{AVG-\mu}{\sigma} \le \frac{40-42.197}{3.113}) = P(Z \le -0.706) = 0.24$

While you can do this on paper, in R, you can just use the pnorm command like this.

# Probability Computations #
pnorm(40,mean(avg),sd(avg))

If you want to compute the probability that AVG will exceed 50 deg F, you can do it like this.

1 - pnorm(50,mean(avg),sd(avg))

Now, think about the qnorm command.

Lognormal Distribution

If AVG follows a normal distribution, then, $X = e^{AVG}$ is a lognormal distribution because log of X is normal.

We know that exponentiating a normal distribution will result in lognormal distribution.

The average monthly March temperature, AVG, if exponentiated, will transform into a lognormal distribution.

Instead of merely taking the numbers and showing the distribution of the exponentials, I think it will be fun to see where in the real world we do exponential of temperature.

Have you ever wondered if air temperature influences the amount of water vapor it can hold?

The water holding capacity (vapor) increases as the temperature increases. The higher the temperature, the stronger will be the bond between water and air molecules.

We can measure the vapor pressure using this simple equation.

$VP = 0.61078e^{\frac{17.27T}{T+243.05}}$

VP is the vapor pressure in kilopascals, T is the temperature in deg C. This is the famous Clausius-Clapeyron approximation used in meteorology and climatology to compute the vapor pressure.

You can check the equation and see that the water vapor pressure changes approximately exponentially with temperature. The water-holding capacity of the atmosphere increases by about 7% for every 1 °C rise in temperature.

Let’s compute the vapor pressure for these 67 values of temperature and check out the distribution of vapor pressure, which is roughly exponential of a normal distribution.

## Vapor Pressure ##
Tc = (avg-32)*(5/9) 
vp = 0.61078*exp((17.27*Tc)/(Tc+243.05))

# Plotting and Comparing #
par(mfrow=c(2,1))
hist(Tc,prob=T,main="Average Monthly Temperature",lwd=2,font=2,font.lab=2,xlab="Temperature (deg C)")
lines(locfit(~Tc),col="grey")

hist(vp,prob=T,main="Vapor Pressure ~ exp(T)",lwd=2,font=2,font.lab=2,xlab="Vapor Pressure (kpa)")
lines(locfit(~vp),col="grey")

You should see this plot in your plot domain.

The top panel is the temperature data expressed in deg C.

The bottom panel is the vapor pressure. See how the data becomes skewed. We transformed the data from normal to a lognormal distribution.

You must have noticed that I am using a line par(mfrow). This command is used to make multiple plots as panels. par(mfrow = c(2,1)) will split the plot space into two rows and one column so that we can make two plots, one below the other.

Think about the square of the temperature. $X = AVG^{2}$ . What distribution will it follow? AVG square – Chi-square?

A standard rule of thumb for distributions in R is to use the letters p and q in front of the name of the distribution.

Exponential distribution
pexp to compute the probability.
qexp to compute the quantile.

Gamma distribution
pgamma to compute the probability.
qgamma to compute the quantile.

Normal distribution
pnorm to compute the probability.
qnorm to compute the quantile.

Lognormal distribution
plnorm to compute the probability.
qlnorm to compute the quantile.

Chi-square distribution
pchisq to compute the probability.
qchisq to compute the quantile.

You should have guessed the command if you want to draw random variables from a distribution.

And so, as the beautiful warm day ends, we end our discussion of continuous distributions for now.

As the clocks rolleth, the days breaketh soon for you to relish the spring warmeth. Relish the warmeth, for the “extremum” will cometh.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 55 – Continuous distributions in R: Part I

Another one of those pharmacy umbrellas is resting in peace after yesterday’s Nor’easter. On a day when a trailer toppled on Verazzona bridge, I must be happy it is only my umbrella that blew away and not me. These little things are so soft they can at best handle a happy drizzle.

As per the National Weather Service, winds (gusts) were up to 35 miles per hour yesterday. We are still under a wind advisory as I write this.

My mind is on exploring how many such gusty events happened in the past, what is the wait time between them and more.

I have requested the daily wind gust data for New York City from NOAA’s Climate Data Online.

So, get your drink for the day and fire up your R consoles. Let’s recognize this Nor’easter with some data analysis before it fades away as a statistic. We don’t know how long we have to wait for another of such event. Or do we?

FIRST STEPS

Step 1: Get the data
You can download the filtered data file here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson55.” Save the data file in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script.” Save the code in the same folder “lesson55” using the “save” button or by using “Ctrl+S.” Use .R as the extension — “lesson55_code.R.” Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson55” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5: Read the data into R workspace
I have the data in the file named “cp_wind.csv.” I am only showing the wind gust data for the day. The column name (according to NOAA CDO) is WSFG.

WSFG = Peak guest wind speed (miles per hour or meters per second as per user preference)

Type the following line in your code and execute it. Use header=TRUE in the command.

#### Read the file ####
cp_data = read.csv("cp_wind.csv",header=TRUE)

Counts – Poisson Distribution

There is data from 1984 to 2017, 34 years.

Execute the following line in your code to count the total number of days when wind gust exceeded 35 mph.

# Select Wind Gust data > 35 mph from the file #
gust_events = length(which(cp_data$WSFG > 35))

The which command outputs the index of the row that is greater than 35 mph. The length command counts the number of such days.

There are 653 such events in 34 years, an average of 19.20 events per year.

Let’s count the events each year from 1984 to 2017.

Write the following lines in your code.

# Count the number of events each year #
years = range(cp_data$Year)

number_years = diff(years) + 1

gust_counts_yearly = matrix(NA,nrow=number_years,ncol=1)

for (i in 1:number_years)
{
 year = 1983 + i
 index = which(cp_data$Year == year)
 gust_counts_yearly[i,1] = length(which(cp_data$WSFG[index] > 35))
}

plot(1984:2017,gust_counts_yearly,type="o",font=2,font.lab=2,xlab="Years",ylab="Gust Counts")

hist(gust_counts_yearly)
lambda = mean(gust_counts_yearly)

Assuming that we can model the number of gust events each year as a Poisson process with the rate you estimated above (19.2 per year), we can generate Poisson random numbers to replicate the number of gust events (or counts) in a year.

If we generate 34 numbers like this, we just created one possible scenario of yearly gust events (past or future).

Run the following line in your code.

# create one scenario #
counts_scenario = rpois(34,lambda)

Now, you must be wondering how the command rpois works? What is the mechanism for creating these random variables?

To understand this, we have to start with uniform distribution and the cumulative density functions.

$F(x)$ is an increasing function of $x$ , and we know that $0 \le F(x) \le 1$ . In other words, $F(x)$ is uniformly distributed on [0, 1].

For any uniform random variable u, we can say $F(x) = u$ since $F(u) = u$ for uniform distribution.

Using this relation, we can obtain x using the inverse of the function F. $x = F^{-1}(u)$ .

If we know the cumulative distribution function F of a random variable X, we can obtain a specific value from this distribution by solving the inverse function for a uniform random variable u. The following graphic should make it clear as to how a uniform random variable u maps to a random variable x from a distribution. rpois implicitly does this.

Uniform Distribution

Do you remember lesson 42? Uniform distribution is a special case of the beta distribution with a standard form

$f(x) = cx^{a-1}(1-x)^{b-1}$ for $0 < x < 1$ .

a and b are the parameters that control the shape of the distribution. They can take any positive real numbers; a > 0 and b > 0.

c is called the normalizing constant. It ensures that the pdf integrates to 1. This normalizing constant c is called the beta function and is defined as the area under the graph of $x^{a-1}(1-x)^{b-1}$ between 0 and 1. For integer values of a and b, this constant c is defined using the generalized factorial function.

$c = \frac{(a + b - 1)!}{(a-1)!(b-1)!}$

Substitute a = 1 and b = 1 in the standard function.

$c = \frac{(1 + 1 - 1)!}{(1-1)!(1-1)!} = 1$

$f(x) = 1x^{1-1}(1-x)^{1-1} = 1$

A constant value 1 for all x. This flat function is called the uniform distribution.

In lesson 42, I showed you an animation of how the beta function changes for different values of a and b. Here is the code for that. You can try it for yourself.

I am first holding b constant at 1 and varying a from 0 to 2 to see how the function changes. Then, I keep a at 2 and increase b from 1 to 2. If you select the lines altogether and run the code, you will see the plot domain running this animation. The Sys.sleep (1) is the control to hold the animation/loop speed.

### Beta / Uniform Distribution ###

### Animation ###

a = seq(from = 0.1, to = 2, by = 0.1)
b = 1

x = seq(from = 0, to = 1, by = 0.01)
for (i in 1:length(a))
{
 c = factorial(a[i]+b-1)/(factorial(a[i]-1)*factorial(b-1))
 fx = c*x^(a[i]-1)*(1-x)^(b-1)
 
 plot(x,fx,type="l",font=2,font.lab=2,ylab="f(x)",ylim=c(0,5),main="Beta Family")
 txt1 = paste("a = ",a[i])
 txt2 = paste("b = ",b)
 
 text(0.6,4,txt1,col="red",font=2)
 text(0.6,3.5,txt2,col="red",font=2)
 
 Sys.sleep(1) 
}

a = 2
b = seq(from = 1, to = 2, by = 0.1)

x = seq(from = 0, to = 1, by = 0.01)
for (i in 1:length(b))
{
 c = factorial(a+b[i]-1)/(factorial(a-1)*factorial(b[i]-1))
 fx = c*x^(a-1)*(1-x)^(b[i]-1)
 
 plot(x,fx,type="l",font=2,font.lab=2,ylab="f(x)",ylim=c(0,5),main="Beta Family")
 txt1 = paste("a = ",a)
 txt2 = paste("b = ",b[i])
 
 text(0.6,4,txt1,col="red",font=2)
 text(0.6,3.5,txt2,col="red",font=2)
 
 Sys.sleep(1) 
}

You should see this running in your plot domain.

Exponential Distribution

Let’s get back to the gusty winds.

Remember what we discussed in lesson 43 about the exponential distribution. The time between these gusty events (wind gusts > 35 mph) follows an exponential distribution.

$P(T > t) = P(N = 0) = \frac{e^{-\lambda t}(\lambda t)^{0}}{0!} = e^{-\lambda t}$

Here is the code to measure the wait time (in days) between these 653 events, plot them as a density plot and measure the average wait time between the events (expected value of the exponential distribution).

### Exponential Distribution ###

# Select the indices of the Wind Gust data > 35 mph from the file #
library(locfit)

gust_index = which(cp_data$WSFG > 35)

gust_waittime = diff(gust_index)

x = as.numeric(gust_waittime)
hist(x,prob=T,main="",xlim=c(0.1,200),lwd=2,xlab="Wait time in days",ylab="f(x)",font=2,font.lab=2)
lines(locfit(~x),xlim=c(0.1,200),col="red",lwd=2,xlab="Wait time in days",ylab="f(x)",font=2,font.lab=2)

ave_waittime = mean(gust_waittime)

The first event recorded was on the 31st day from the beginning of the data record, January 31, 1984. A wait time of 31 days. The second event occurred again on day 55, February 24, 1984. A wait time of 24 days. If we record these wait times, we get a distribution of them like this.

Now, let’s generate 34 random numbers that replicate the time between the gusty events.

# Simulating wait times #
simulated_waittimes = rexp(34,lambda)*365

plot(locfit(~simulated_waittimes),xlim=c(0.1,200),col="red",lwd=2,xlab="Wait time in days",ylab="f(x)",font=2,font.lab=2)

rexp is the command to generate random exponentially distributed numbers with a certain rate of occurrence lambda. In our case, lambda is 19.2 per year. I multiplied them by 365 to convert them to wait time in days.

What is the probability of waiting more than 20 days for the next gusty event?

Yes, you’d compute $P(T > 20) = e^{-20\lambda} = e^{-20/19} = 0.3488$ .

In Rstudio, you can use this line.

# Probability of wait time greater than 20 days #
lambda = 1/ave_waittime

1 - pexp(20, lambda)

pexp computes $P(T \le 20)$ .

Now, imagine we waited for 20 days, what is the probability of waiting another 20 days?

Yes, no calculation required. It is 34.88%. Memoryless property. We learned this in in lesson 44.

I also showed you an animation for how it worked based on the idea of shifting the distribution by 20. Here is the code for that. Try it.

## Memoryless property ##

# animation plot to understand the memoryless property #
Sys.sleep(1)
t = seq(from=0,to=200,by=1)
ft = lambda*exp(-lambda*t)
plot(t,ft,type="l",font=2,font.lab=2,xlab="Wait Time to Gusty Events (days)",ylab="f(t)")

t = seq(from=20,to=200,by=1)

Sys.sleep(1)

abline(v=ave_waittime,col="red")

Sys.sleep(1)

ft = lambda*exp(-lambda*t)
lines(t,ft,col="red",lty=2,lwd=2,type="o")

Sys.sleep(1)

t = seq(from = 20, to = 200, by =1)
ft = lambda*exp(-lambda*(t-20))
lines(t,ft,col="red",type="o")

Gamma Distribution

The wait time for the ‘r’th event follows a Gamma distribution. In this example, wait time for the second gusty event from now, the third gusty event from now, etc. follows a Gamma distribution.

Check out this animation code for it.

r = seq(from=1,to=5,by=1)
t = seq(from=0,to=200,by=1)

for (i in 1:length(r))
{
 gt = (1/factorial(r[i]-1))*(lambda*exp(-lambda*t)*(lambda*t)^(r[i]-1))
 
 plot(t,gt,type="l",font=2,font.lab=2,xlab="Wait time",ylab="f(t)")
 txt1 = paste("lambda = ",round(lambda,2),sep="")
 txt2 = paste("r = ",r[i],sep="")
 text(50,0.002,txt1,col="red",font=2,font.lab=2)
 text(50,0.004,txt2,col="red",font=2,font.lab=2)
 Sys.sleep(1) 
}

Alternately, you can also create random numbers from Gamma distribution using the “rgamma” function and plot the density function.

# alternate way #
x1 = rgamma(10000,shape=1,scale=ave_waittime)
plot(locfit(~x1),font=2,font.lab=2,xlab="Wait time",ylab="f(t)")
txt1 = paste("lambda = ",round(lambda,2),sep="")
txt2 = paste("r = ",1,sep="")
text(50,0.002,txt1,col="red",font=2,font.lab=2)
text(50,0.004,txt2,col="red",font=2,font.lab=2)

x2 = rgamma(10000,shape=2,scale=ave_waittime)
plot(locfit(~x2),font=2,font.lab=2,xlab="Wait time",ylab="f(t)")
txt1 = paste("lambda = ",round(lambda,2),sep="")
txt2 = paste("r = ",2,sep="")
text(50,0.002,txt1,col="red",font=2,font.lab=2)
text(50,0.004,txt2,col="red",font=2,font.lab=2)

x3 = rgamma(10000,shape=3,scale=ave_waittime)
plot(locfit(~x3),font=2,font.lab=2,xlab="Wait time",ylab="f(t)")
txt1 = paste("lambda = ",round(lambda,2),sep="")
txt2 = paste("r = ",3,sep="")
text(50,0.002,txt1,col="red",font=2,font.lab=2)
text(50,0.004,txt2,col="red",font=2,font.lab=2)

Can you now tell me the probability of seeing the next gusty event before you see the next gusty lesson (part II)?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 54 – The Bernoulli gene

D: That thought you have going must be bothering you. It gave you a long face.

J: Hello. Yes, it is bugging me. We are racing ahead with the lessons, rolling on one distribution after another. I follow each lesson and can understand the distributions as they are. But there are many, and I am guessing these are just the major ones we are learning. How do I make sense of them? Is there any way to summarize the distributions we learned so far?

D: Hmm. If I tell you that there is a Bernoulli connection to the major distributions we learned, can you trace it back?

J: I am not following.

D: Okay, let’s try this. I will show you a graphic that connects the distributions we learned thus far. Can you work with me to trace the links? That will be an efficient way, both to summarize and remember the distributions.

J: I like the thought and the exercise. Let’s do it. Show me the graphic.

D: Here. I put the distributions in one place and connected the dots. It’s a high-resolution image. Click on it to open a full page graphic. I want you to walk me through it.

J: Cool. The Bernoulli trial (0 1) is the origin. Lesson 31. There are two possibilities; event occurred – success or event did not occur – failure. There is a sequence (Bernoulli gene code) of 0’s and 1’s (010110010000), n Bernoulli trials. There is a probability of occurrence of p that is constant over all the trials, and the trials are assumed to be independent.

D: Good. Now, what is the first distribution we learned?

J: The binomial distribution. Lesson 32. The number of successes in a Bernoulli sequence of n trials is a binomial distribution.

D: Correct. Can you write down the probability distribution function for the Binomial distribution?

J: Sure. $P(X = x) = \frac{n!}{(n-x)!x!}p^{x}(1-p)^{n-x}$

D: Now, refer to the recent lesson we did on the normal distribution and the central limit theorem. It is lesson 48.

J: Ah, I see how you connected the Binomial to the Normal distribution in the graphic. The Binomial distribution can be estimated very accurately using the normal density function. I went through the looong derivation you had.

$P(X = x) = \frac{n!}{(n-x)!x!}p^{x}(1-p)^{n-x} = f(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}$

Then, from the recent lessons, I know that the standard normal, the log-normal and the Chi-square distributions are transformations of the normal distribution.

The standard normal distribution is derived by subtracting from the distribution X, the mean ( $\mu$ ) and dividing by the standard deviation ( $\sigma$ ). Lesson 50.

$Z = \frac{X - \mu}{\sigma}$

$Y = e^{Z}$ is a log-normal distribution because the log of Y is normal, standard normal. Lesson 52.

Last week was Chi-square distribution. Lesson 53. If Z follows a normal distribution, then, $\chi = Z^{2}$ is a Chi-square distribution with one degree of freedom. If there are n standard normal random variables, $Z_{1}, Z_{2}, ..., Z_{n}$ , their sum of squares is a Chi-square distribution with n degrees of freedom.

$\chi = Z_{1}^{2}+Z_{2}^{2}+ ... + Z_{n}^{2}$

I now see the flow from Bernoulli to Binomial to Normal to Standard Normal to Log-normal and Chi-square. Amazing how deep the Bernoulli gene code is.

D: Indeed. Now go through the rest of the graphic. You will see the connections of the other distributions.

J: Yes. I am looking at the Poisson distribution now. If we are counting the number of 1’s of the Bernoulli sequence in an interval, then these counts follow a Poisson distribution. Lessons 36 and Lesson 37.

$P(X = x) = \frac{e^{-\lambda t}(\lambda t)^{x}}{x!}$

$\lambda$ is the rate of occurrence; the average number of 1’s per interval. This function is derived as an approximate function from the Binomial distribution with a large number of trials in an interval. Lesson 37.

D: That is correct. Now, do you remember how the Poisson distribution is connected to the Exponential distribution?

J: Yes. The Poisson distribution represents the number of events in an interval of time, and the exponential distribution represents the time between these events.

The time to arrival exceeds some value t, only if $N = 0$ within t, i.e., if there are no events in an interval [0, t].

$P(T > t) = P(N = 0) = \frac{e^{-\lambda t}(\lambda t)^{0}}{0!} = e^{-\lambda t}$

The probability density function f(t) is

$f(t) = \frac{d}{dt}F(t) = \frac{d}{dt}(1-e^{-\lambda t}) = \lambda e^{-\lambda t}$

Let me keep going.

On the discrete sense, if you measure the number of trials it takes to see the next 1 in the Bernoulli sequence, then it is a Geometric distribution. Lesson 33.

$P(1^{st} \hspace{5}1\hspace{5}on\hspace{5} k^{th}\hspace{5}trial) = (1-p)^{k-1}p$

The exponential distribution is the continuous analog of the discrete Geometric distribution.

Sticking to that thought, if we measure the trials or time to $r^{th}$ 1 or event, we are looking at the Negative Binomial distribution (Lesson 35) for discrete and Gamma distribution (Lesson 45) for the continuous case.

$P(r^{th} \hspace{5}1\hspace{5}on\hspace{5} k^{th}\hspace{5}trial) = \binom{k-1}{r-1}(1-p)^{k-r}p^{r}$ for the Negative Binomial, and

$f(t) = \frac{\lambda e^{-\lambda t}(\lambda t)^{r-1}}{(r-1)!}$ for the Gamma distribution.

The time to ‘r’th arrival $T_{r} = t_{1} + t_{2} + t_{3} + ... + t_{r}$ , (sum of exponentials) and we derived the probability density function (Gamma) of $T_{r}$ using the convolution of the individual exponential random variables $t_{1}, t_{2}, ... t_{r}$ .

Like the Geometric and Exponential, the Negative Binomial and Gamma are discrete and continuous analogs.

D: And, what did we learn last week about Chi-square?

J: That its probability density function is a Gamma density function with $\lambda=1/2$ and $r=n/2$ .

$f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!}$ for $\chi > 0$ and 0 otherwise.

Wow.

D: See how it all comes together. Bernoulli gene is running through the distribution line to connect them.

J: Yes. I can see that very clearly, now that I walked through each one in this order. Thank you for creating this fantastic graphic. It helps me in remembering the distributions.

D: You are welcome. Anything to make data analysis easy and fun.

J: What is up next? I am Filled with new energy.

Dust off your computer. We will learn some tricks in R for the continuous distributions. Time to crank some code.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 53 – Sum of squares: The language of Chi-square distribution

The conversations with people at the local bar last night were the usual local and national politics. I usually join the pool table after a casual drink. Yesterday, upon the request of an old friend, I toyed with the game of darts.

In my heyday, I used to be good at this, so without much trouble, I could hit the bullseye after getting my sight.

Maybe it is the draught that did the trick. I felt jubilant.

A couple of tequilas and a few minutes later, I hoped to beat my previous record.

“Did you change the darts?”, “There’s something in my eye,” “Maybe they moved the target.” These were circling in my mind after my performance.

They did not move the target, they did not change the darts, but there was indeed something in my head. I decided to stop playing the next round and go home, lest someone else becomes the bullseye.

While heading back, I thought about how far off I was from the bullseye (origin). On the X-Y coordinate plane, the distance from origin is $R=\sqrt{X^{2}+Y^{2}}$ .

The squared distance is $R^{2}=X^{2}+Y^{2}$ .

It varies with each try. Random variable.

Today morning, while catching up on my news feed, I noticed this interesting article.

The article starts with a conversation between fans.

People bet millions on heads vs. tails.

“…the toss is totally biased.” It reminded me of the famous article by Karl Pearson, “Science and Monte Carlo” in the fortnightly review published in 1894. He experimentally proved that the roulette wheels in Monte Carlo were biased.

He computed the error (observed – theoretical) (or squared error) and showed that the roulette outcomes in Monte Carlo could not happen by chance. He explained this based on two sets of 16,500 recorded throws of the ball in a roulette wheel during an eight week period in summer of 1892, manually, I may add. Imagine counting up 33,000 numbers, and performing various sets of hand calculations.

He also writes that he spent his vacation tossing a shilling 25,000 times for a coin-toss bias experiment.

These experiments led to the development of the relative squared error metric, the Pearson’s cumulative test statistic (Chi-square test statistic).

Some excerpts from his concluding paragraph.

“To sum up, then: Monte Carlo roulette, if judged by returns which are published apparently with the sanction of the Societe, is, if the laws of chance rule, from the standpoint of exact science the most prodigious miracle of the nineteenth century.
…
we are forced to accept as alternative that the random spinning of a roulette manufactured and daily readjusted with extraordinary care is not obedient to the laws of chance, but is chaotic in its manifestation!

By now you might be thinking, “What’s the point here. Maybe it’s his hangover writing.”

Hmm, the square talk is for the Chi-square distribution.

Last week, we visited Mumble’s office. He has transformed the data — log-normal distribution.

If $Z$ follows a normal distribution, then, $X = e^{Z}$ is a log-normal distribution because log of X is normal.

Exponentiating a normal distribution will result in log-normal distribution.

See for yourself, how $Z \sim N(0,1)$ tranforms into X, a log-normal distribution. X is non-negative.

Similarly, squaring a normal distribution will result in a Chi-square distribution.

If $Z$ follows a normal distribution, then, $\chi = Z^{2}$ is a Chi-square distribution with one degree of freedom.

See how the same $Z \sim N(0,1)$ tranforms into $\chi$ , a Chi-square distribution, again non-negative, since we are squaring.

Let’s derive the probability distribution function for the Chi-square distribution using the fact that $\chi = Z^{2}$ . We will assume Z is a standard normal distribution $Z \sim N(0,1)$ .

For this, we will start with the cumulative distribution function, $F(\chi)$ and take its derivative to obtain the probability distribution function $f(\chi)$ since $f(\chi)=\frac{d}{d\chi}F(\chi)$ .

$\chi = Z^{2}$

$F(\chi) = P(\chi \le \chi)$ is the cumulative distribution function.

$F_{\chi}(\chi) = P(Z^{2} \le \chi)$

$=P(-\sqrt{\chi} \le Z \le \sqrt{\chi})$

$=P(Z \le \sqrt{\chi}) - P(Z \le -\sqrt{\chi})$

$F_{\chi}(\chi) =F_{Z}(\sqrt{\chi}) - F_{Z}(-\sqrt{\chi})$

$f_{\chi}(\chi)=\frac{d}{d\chi}(F_{\chi}(\chi))$

$=\frac{d}{d\chi}(F_{Z}(\sqrt{\chi}) - F_{Z}(-\sqrt{\chi}))$

Applying the fundamental theorem of calculus and chain rule together, we get,

$=f_{Z}(\sqrt{\chi})*\frac{1}{2\sqrt{\chi}} + f_{Z}(-\sqrt{\chi})*\frac{1}{2\sqrt{\chi}}$

$=\frac{1}{2\sqrt{\chi}}*(f_{Z}(\sqrt{\chi})+f_{Z}(-\sqrt{\chi}))$

By now, you are familiar with the probability distribution function for Z. $f_{Z}(z) = \frac{1}{\sqrt{2\pi}}e^{\frac{-z^{2}}{2}}$ . Let’s use this.

$f_{\chi}(\chi)=\frac{1}{2\sqrt{\chi}}*(\frac{1}{\sqrt{2\pi}}e^{\frac{-(\sqrt{\chi})^{2}}{2}}+\frac{1}{\sqrt{2\pi}}e^{\frac{-(-\sqrt{\chi})^{2}}{2}})$

$f_{\chi}(\chi)=\frac{1}{\sqrt{2 \pi \chi}}e^{\frac{-\chi}{2}}$ for $\chi > 0$ .

As you can see, the function is only defined for $\chi > 0$ . It is 0 otherwise.

With some careful observation, you can tell that this function is the Gamma density function with $\lambda=\frac{1}{2}$ and $r = \frac{1}{2}$ .

Yes, I know, it is not very obvious. Let me rearrange it for you, and you will see the pattern.

$f(\chi) = \frac{1}{\sqrt{2 \pi \chi}}e^{-\frac{1}{2}\chi}$

$=\sqrt{\frac{1}{2}}*\sqrt{\frac{1}{\chi}}*\frac{1}{\sqrt{\pi}}*e^{-\frac{1}{2}\chi}$

$=\frac{(1/2)^{1/2}(\chi)^{-1/2}e^{-\frac{1}{2}\chi}}{\sqrt{\pi}}$

Multiply and divide by 1/2.

$=\frac{(1/2)*(1/2)^{1/2}*(\chi)^{-1/2}*e^{-\frac{1}{2}\chi}}{(1/2)\sqrt{\pi}}$

$=\frac{(1/2)*(1/2)^{-1/2}*(\chi)^{-1/2}*e^{-\frac{1}{2}\chi}}{\sqrt{\pi}}$

If $\lambda=1/2$

$f(\chi)=\frac{\lambda*(\lambda \chi)^{-1/2}*e^{-\lambda*\chi}}{\sqrt{\pi}}$

Drawing from the factorial concepts, we can replace $\sqrt{\pi}$ with (1/2)! or (-1/2)!, which means, r = 1/2

$f(\chi)=\frac{\lambda*(\lambda \chi)^{r-1}*e^{-\lambda*\chi}}{(r-1)!}$

This equation, as you know is the density function for the Gamma distribution.

So, Chi-square density function is Gamma density function with $\lambda=1/2$ and $r=1/2$ . It is a special case of the Gamma distribution.

Now let’s up a level. Sum of squares of two standard normals, like our squared distance ( $X^{2}+Y^{2}$ ).

$\chi = Z_{1}^{2} + Z_{2}^{2}$ .

We know from lesson 46 on convolution that if X and Y are two independent random variables with probability density functions $f_{X}(x)$ and $f_{Y}(y)$ , their sum $Z = X + Y$ is a random variable with a probability density function $f_{Z}(z)$ that is the convolution of $f_{X}(x)$ and $f_{Y}(y)$ .

$f(z) = f_{X}*f_{Y}(z) = \int_{-\infty}^{\infty}f_{X}(x)f_{Y}(z-x)dx$

We can use the principles of convolution to derive the probability density function for $\chi = Z_{1}^{2} + Z_{2}^{2}$ .

Let’s assume $Z_{1}^{2}=k$ . Then, $Z_{2}^{2}=\chi - k$ , and

$f_{\chi}(\chi)=\int_{-\infty}^{\infty}f_{Z_{1}^{2}}(k)f_{Z_{2}^{2}}(\chi - k)dk$

$f_{Z_{1}^{2}} = \frac{1}{\sqrt{2 \pi k}}e^{-\frac{k}{2}}$ . This is the same function we derived above for $\chi = Z^{2}$ . Using this,

$f_{\chi}(\chi)=\int_{-\infty}^{\infty}\frac{1}{\sqrt{2 \pi k}}e^{-\frac{k}{2}}\frac{1}{\sqrt{2 \pi (\chi - k)}}e^{-\frac{(\chi - k)}{2}}dk$

$=\frac{1}{2\pi}\int_{0}^{\chi}k^{-\frac{1}{2}}e^{-\frac{k}{2}}(\chi - k)^{-\frac{1}{2}}e^{-\frac{(\chi - k)}{2}}dk$

$=\frac{1}{2\pi}e^{-\frac{\chi}{2}}\int_{0}^{\chi}k^{-\frac{1}{2}}(\chi - k)^{-\frac{1}{2}}dk$

The term with the integral integrates to $2*arcsin(\frac{\sqrt{k}}{\sqrt{\chi}})$ , and its definite integral is $\pi$ since $arcsin(1) = \frac{\pi}{2}$ . Try it for yourself. Put your calculus classes to practice.

We are left with

$f_{\chi}(\chi) = \frac{1}{2}e^{-\frac{\chi}{2}}$ , again for $\chi > 0$ .

This function is a Gamma distribution with $\lambda = \frac{1}{2}$ and $r = 1$ .

Generalization for n random normal variables

If there are n standard normal random variables, $Z_{1}, Z_{2}, ..., Z_{n}$ , their sum of squares is a Chi-square distribution with n degrees of freedom.

$\chi = Z_{1}^{2}+Z_{2}^{2}+ ... + Z_{n}^{2}$

Its probability density function is a Gamma density function with $\lambda=1/2$ and $r=n/2$ . You can derive it by induction.

$f(\chi)=\frac{\frac{1}{2}*(\frac{1}{2} \chi)^{\frac{n}{2}-1}*e^{-\frac{1}{2}*\chi}}{(\frac{n}{2}-1)!}$ for $\chi > 0$ and 0 otherwise.

Look at this animation for Chi-square distribution with different degrees of freedom. See how it becomes more symmetric for large values of n (degrees of freedom).

We will revisit the Chi-square distribution when we learn hypotheses testing. Till then, the sum of squares and error square should remind you of Chi-square [Square – Square], and tequila square should remind you of

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 52 – Transformation: The language of lognormal distribution

A cluttered desk with scattered papers, piles of binders, and an open reference textbook with a pencil on top welcomes Mumble to his day at work.

“Clean up the workspace,” said Mumble in angst, staring at the checklist for the day. It was that time of February when his last New Year resolution of having an ambient workspace had already taken a backburner.

He has two essential data analysis tasks and two meetings for the day. Besides, he set himself two chores. If you think Mumble is doing his miscellaneous tasks at works, you are mistaken. He is like you and me when it comes to writing checklists.

He begins his first task: Analyze flood damage data.

The 2017 flood insurance claims are starting to come in. Mumble, who was the chief designer of the product was assigned the task of corroborating the flood damage claims with the reported flood damage data. He was also assigned the task of estimating the probability of damage exceeding a certain threshold set for the year by the company.

“Open Dartmouth Flood Observatory Database. They will have the archive for recent floods, the area affected and people displaced. I can use that data to verify if the claims from these areas are valid or not,” thought Mumble as he types his password to log in.

“Time for some coding and analysis. The best part of my day. Fire up RStudio. Check. Read the file into R workspace. Check.” He completes the essentials and is now ready to analyze the data. You can also look at the data if you want.

“Let me first filter the database to select and plot only the floods for 2017.” RStudio lets him make the map too.

“Aha. The locations match the countries from where we got the claims. So let me look at the data for people displaced and develop the probability distribution function for it.” He quickly types up a few lines of code as he sips his coffee.

“There are 123 events,” he thinks as he scrolls down his monitor. “Some of these floods must be minor. There are 0’s, no people displaced in 40 such events. There are some huge numbers though. Possibly created a lot of damage. Let me plot them to see their arrangement on the number line.”

“They do not look symmetric. The few large numbers are skewing the data. I wonder how the frequency plot will look.”

“Ugly. Not sure what probability distribution fits this data. Those large numbers are dominating, but they are also most of the insurance claims.” He took another sip, leaned back, and with his hands at the back of his head, stared at the ceiling.

After about 30 seconds, it struck him that the scale of the observations is obscuring the pattern in the data.

“What if I transform the data? If I re-express the data on a different scale, I can perhaps alter the distances between the points. Log scale?”

So he applied log transformation $y = log(x)$ and plotted their frequency.

“Amazing. A simple mathematical transformation can make a skewed distribution more symmetric. The application of “logarithm” has shrunk the large numbers on the right side and moved them closer to the center. The frequency plot looks like a normal distribution, ” he thought, as he typed another line into the code to update the plot.

He realized that the log of the data is a normal distribution. “Lognormal distribution,” he said.

If $Y = log(X)$ is a normal distribution with mean $\mu_{y}$ and standard deviation $\sigma_{y}$ , then X follows a lognormal distribution with a probabilty density function

$f(x) = \frac{1}{x} \frac{1}{\sqrt{2\pi\sigma_{y}^{2}}} e^{\frac{-1}{2}(\frac{ln(x)-\mu_{y}}{\sigma_{y}})^{2}}$

Remember, $\mu_{y}$ and $\sigma_{y}$ are the mean and standard deviation of the transformed variable Y.

The parameters (mean $\mu_{x}$ and the standard deviation $\sigma_{x}$ ) of X are related to $\mu_{y}$ and $\sigma_{y}$ .

$\mu_{x}=e^{\mu_{y}+\frac{\sigma_{y}^{2}}{2}}$

$\sigma_{x}^{2}=\mu_{x}^{2}(e^{\sigma_{y}^{2}}-1)$

From $\sigma_{x}^{2}=\mu_{x}^{2}(e^{\sigma_{y}^{2}}-1)$ , we can also deduce that $\sigma_{y}^{2}=ln(1+CV_{x}^{2})$ . $CV_{x}$ is the coefficient of variation of the lognormal variable X. You may recall from Lesson 20 that coefficient of variation is the ratio of the standard deviation and the mean. $CV_{x}=\frac{\sigma_{x}}{\mu_{x}}$ .

You must have noticed already that the probabilities can be computed using the transformed variable (Y) and the usual method of standard normal distribution, i.e., convert y to z using $z = \frac{y-\mu_{y}}{\sigma_{y}}=\frac{ln(x)-\mu_{y}}{\sigma_{y}}$

$P(X \le x)=P(e^{Y} \le x)=P(Y \le ln(x))=P(Z \le \frac{ln(x)-\mu_{y}}{\sigma_{y}})$

“So the lognormal distribution can be used to model skewed and non-negative data.” He sits up in his chair excited, types a few lines of code with enthusiasm and write down the following in his notepad for the meeting later.

$P(Displaced > 300,000) = 0.0256$

Can you tell how?

He then plots the flood-affected area using similar plots.

“Hmm, the original data is skewed with a large affected area to the right. But, the log-transformed data is not symmetric. It looks like there are numbers to the left that are far away from the log-transformed data.” He keeps staring at his screen as Aaron knocks on the door. “Meet now?”

“Sure.” Mumble walks out with his notepad and few other things. The flood-affected area plot is still pestering him, but he needs to take care of other issues for the day.

“Maybe logarithm is one type of transformation. Maybe there are other transformations?” He keeps thinking during his meetings.

As his 3 pm meeting comes to an end, he still has to do quality checks on the claims data. As is usual, his checklist stays incomplete. The filing taxes just got moved to the next day. He was however resolute to catch up on Transformers before Bumblebee hits the theaters this year. So transformers is on for the evening. It is timely as he transforms the data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 51 – Sometimes it is important to let the data speak

On the fifth day of February 2017, I embarked upon this journey to create a platform to make data analysis easy for at least a million people. When I began, and even now, I am no expert in this field. Over the years, I have learned several concepts from my mentors and other masters, and I believe I am only a conduit for sharing those with you.

Today, the fourth day of February 2018 marks the one-year milestone for us. We crossed 50,000 page views from more than 10,000 users across 136 countries. I hope my mission is underway and I have created interest in you towards data analysis 🙂

As a landmark event, I asked my mentor, Prof. Upmanu Lall from Columbia University in the City of New York to give us a lesson. He was gracious enough to agree and has taken time out of his ever busy schedule to teach us that “sometimes it is important to let the data speak.”

The man who has pioneered non-parametric methods for hydrology himself opens us to kernel density, why and what.

So, here is his gift on the occasion of our celebration of one-year together; or should I say,

Sometimes it is important to let the Master speak.

Sometimes it is important to let the data speak…

Founded in 1670, Charleston is the oldest and largest city in S. Carolina. For such a historic city and the surrounding community, reliable water supply is critical. Jill is the city manager. Charleston’s had a series of droughts in recent years, and the City Council wants to know how unusual these are, so they can think about investments to manage or capture water in the wettest months and have it available in the drier months.

Jill asked Jack, the comptroller to come up with a presentation so they could understand the situation. Jack was able to locate monthly rainfall data from 1893, and given over a 100 years of data felt that he could indeed present some ideas. After looking over the data, he first plotted a wet (June) and a dry month (January) to see how the rain changed across years.

A remarkable range of variation shows up especially in June, with many years with nearly zero rain, and others with as much as 15 inches of rain. On the other hand, the rainfall in January rarely exceeds 5 inches.

Jack first asked the question “How much rain can you count on in each of these months 10, 25, 50, 75 or 90% of the years”? The idea was to see how this compares with the amount of rain one would need to fall in each of the months to meet water needs that occur in that month.

This demand could change depending on how many people live in Charleston, and the collection area for water supply. Since the council could come up with different assumptions as to the future population or demand, Jack thought it would be wise to present his arguments using the rainfall needed per person, and the rainfall available per person.

Jack’s first take on this question was to pull together the January (June) data and re-arrange it from the smallest to the largest value for that month – using the 1893-2016 data. This was 124 values that are arranged from the lowest to the highest. Sorting the data this way let Jack identify the driest and the wettest years at a glance, and also to quickly do the calculation he was looking for.

Since Jack is R savvy, he actually just did this operation in R, using the quantile function, which reports the quantiles or percentiles of the data.

Looking at this, Jack could see that 10% of the time the rain could be less than 0.7 (1.47) inches and as much as 3.99 (6.79) inches in January (June).

He immediately wondered whether the June rainfall was likely to be low (high) in the same year that the January rainfall was high(low). Looking at their plot, it looked highly unlikely that there was a pattern. The calculated correlation was -0.03 confirming the plot.

Since presenting tables of numbers to a city council is never too exciting, Jack decided to present his findings graphically.

Again, being R savvy, he plotted the cumulative distribution function for the rain for each month.

The cumulative distribution function is just an estimate of the fraction of observations that are smaller than a given number – for instance for January (June), if we were curious about what the rainfall amount would be such that 50% of the values or a fraction 0.5 were smaller than these amounts, we could look that up with a value of the cumulative distribution function of 0.5, and find that it would be a rainfall amount of 1.95 (3.48) inches of rain.

The cumulative distribution function is then just a plot of the quantiles.

We see this in the figure below. For each of the quantiles Jack had calculated, he dropped blue (red) lines to show the corresponding rainfall for January (June). It is then easy to see that for each quantile the rain in January is less than the rain in June – both from the vertical lines and the cumulative distribution function curves.

Now refer back to Lesson 50. There the symmetry of the Normal distribution was talked about, and the area under the Normal density function between two rainfall values was interpreted as the percentage of time the values could fall in that range. We see some parallels.

If we are interested in the Probability that the January rainfall is less than 1.95 ( $P(Rain_{january} \le 1.95)$ ), we see that this is 0.5, and for the Normal distribution this would be the area from –infinity (or really 0) rain to 1.95.

We also saw that the Normal distribution is symmetric. This would suggest that the areas to a distance left of the mean (median or the 50th percentile for the normal distribution is the same as the mean since it is a symmetric function) are the same as the areas the same distance right of the mean. Does that seem to hold for the Charleston rainfall?

These don’t really look symmetric.

Normally, the reason to use a model such as the Normal distribution is that we don’t have much data (e.g., 30 years) and if I am interested in a 100 year event (i.e. something that would occur on average once in a 100 years), unless I fit a probability model to it, I really cannot say anything about a 100 year event.

So, Jack wanted to make sure that even though he had 124 years of data, he was not challenged by a statistically savvy city council model for not fitting a probability distribution model, which if it is the right one, maybe more reliable.

So far, it doesn’t look promising to use a Normal distribution model. This is not a surprise since rainfall cannot be negative and we have so many years in which it is close to zero. In fact, the average rain in January (June) is 2.24 (4.04) inches compared to a maximum of 6.29 (16.29) inches over the 124 years. So, the data is definitely not symmetric, and the Normal distribution would not be a good choice.

Sometimes, people will argue that the logs of the data could be Normally distributed, since after taking logs, small positive values can be negative, so Jack thought he would check that out too. Here is what he found after taking logs of the rainfall data.

Now there was a bit of a dilemma. Comparing Min, Mean and Max, it seems that the data is not at all symmetric.

(Mean-Min) = 2.42 (62) vs Max-Min = 1.26 (1.62).

But the min and the max may not be the best way to look at this since these are rare values and may not always be symmetric, just if one anomalous value was recorded. So Jack also checked the symmetry across the quantiles.

Still not quite there 🙄

Finally, just to reassure himself, Jack used R to plot what the data for each month looked like along with the best fit of a Normal distribution, an exponential distribution (recommended for rainfall in a book), and also a kernel density plot.

The monthly rainfall values are plotted in red; the exponential distribution fit in blue; the normal distribution fit in green; and the kernel density plot in black.

For the original scale

For the LOG scale

Looking at these plots for the original and the log-transformed data Jack was quite clear that the normal and the exponential models were very different from the kernel density plot and if the kernel density represents what the probability density of the data really looks like then one could not use the normal or the exponential model with the original data or the logs.

So what is a kernel density plot?

Aha – this needs a few lessons to master, but for now, think of it as a plot where the probability density of the variable – rainfall in our case – is estimated for any rainfall value (say rain = 3″) by looking at how many rainfall observations fell within a window of a given size (e.g., 1″) centered at the value of interest (3″).

So, the idea is that if a lot of values are close to 3″ then the probability density is high there, and conversely, if there are very few values then the probability density is low there. The result will clearly depend on the size of the window.

Also, observations that are far from the target should perhaps not be considered representative of the current value and should be down-weighted.

The weight or kernel function is then a function of the distance from the target. A common kernel function is the Normal distribution centered at each point at which an estimate is desired, with a certain specified standard deviation which functions as a window width.

For instance to estimate the density at rain =3″, we may use a Normal density function with mean equal to 3 and a standard deviation of 1. The density is then computed as the sum of the contribution of all observations at that point using this kernel.

$f(x)=\frac{1}{nh}\sum_{i=1}^{n}K(\frac{x-x_{i}}{h})$

$u=\frac{x-x_{i}}{h}$

$K(u) = \frac{1}{\sqrt{2\pi}} e^{-\frac{u^{2}}{2}}$

You can see that the kernel function $K(u)$ looks like a Normal distribution with mean 0 and standard deviation equal to 1. Of course, comparing with Lesson 50, looking at the definition of $u$ it is easy to see that the mean is $x_{i}$ and the standard deviation is $h$ .

Another way to interpret these equations is that the kernel density estimate at $x$ is the weighted contribution of all observations $x_{i}$ as being representative of what we expect at $x$ . The weights are from the Normal distribution centered at $x$ .

Thus, this is like doing a weighted moving average that is computed by moving from one value of $x$ to another. As we can see from the figures for Charleston, the result does not look like a Normal distribution that is fit to the full data.

Lesson 50 – The Standard Normal

Tom works in FedEx. He is responsible for weighing the parcels and estimating the cost of shipping. He verified the previous data on daily packages and found that the probability distribution resembled a normal distribution with a mean of 12 lbs and standard deviation of 3.5 lbs. There is an extra charge if the package weighs more than 20 lbs. Tom wants to know the probability that the parcels he weighs are within 20 lbs.

$P(X \le 20) = \int_{-\infty}^{20}\frac{1}{\sqrt{2 \pi 3.5^{2}}} e^{\frac{-1}{2}(\frac{x-12}{3.5})^{2}}dx$

Perhaps you can help Tom. Did you solve the integral of the normal density function from last week?

Dick is excited to start his first job as a construction site engineer for H&H Constructions. He was told that the pH of the soil on this site follows a normal distribution with a mean pH of 6 and a standard deviation of 0.1. Yesterday he collected a soil sample with the following question in mind: “what is the probability that the pH of this soil sample is between 5.90 and 6.15?”

$P(5.90 \le X \le 6.15) = \int_{5.90}^{6.15}\frac{1}{\sqrt{2 \pi 0.1^{2}}} e^{\frac{-1}{2}(\frac{x-6}{0.1})^{2}}dx$

Can you answer Dick’s question? Did you solve the integral of the normal density function from last week?

Harry, a regular reader of our blog, is new at the Hamilton Grange. He has an idea to attract customers on Saturday. He suggested offering a free drink to anyone not seated within $x$ minutes. Mr. Hamilton thought this was a cool idea and agreed to offer free drinks to 1% of the customers.

Can you work with Harry to compute x, the wait time, after which the customers get a free drink?

Before you help Tom, Dick, and Harry, what is your experience trying to solve the integral $P(X \le x) = \int_{-\infty}^{x}\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}dx$ ?

Could you not solve it?

Hmm, don’t feel bad. A closed form integral for this function is impossible.

How then can we compute these probabilities?

I am sure you are now thinking about some form of a numerical integration method for the definite integral. Maybe the trapezoidal method.

Maybe. That is the point. You can approximate the integral with reasonable accuracy.

But don’t you think it is a bit tedious to do this each time there is a normal distribution problem involved?

Enter Z, the Standard Normal Distribution

Let’s travel back to lesson 28 where we learned standardizing the data.

We can move the distribution from its original scale to a new scale, the Z-scale.

This process of standardization can be achieved by subtracting from the distribution, the mean of the data and dividing by the standard deviation. We have seen that when we subtract the mean and divide by the standard deviation, the expected value and the variance of the new standardized variable is 0 and 1.

$Z = \frac{X - \mu}{\sigma}$

$Z = \frac{X}{\sigma} - \frac{\mu}{\sigma}$

Let’s find the expected value and the variance of this new random variable Z.

$E[Z] = E[\frac{X}{\sigma}] - E[\frac{\mu}{\sigma}]$

$E[Z] = \frac{1}{\sigma}E[X] - E[\frac{\mu}{\sigma}]$

$E[Z] = \frac{\mu}{\sigma} - \frac{\mu}{\sigma}$

$E[Z] = 0$

The expected value of the new standardized variable Z is 0. In other words, it is centered on 0.

Now, for the variance.

$V[Z] = V[\frac{X}{\sigma}] + V[\frac{\mu}{\sigma}]$

$V[Z] = \frac{1}{\sigma^{2}}V[X] + 0$

$V[Z] = \frac{1}{\sigma^{2}}\sigma^{2}$

$V[Z] = 1$

The variance of the new standardized variable Z is 1. The standard deviation is 1.

The Standard Normal Z has a mean of 0 and a standard deviation of 1.

We just removed the influence of the location (center) and spread (standard deviation) from the original distribution. We are moving it from the original scale to the new Z-scale.

What are the units of Z?

Tom’s distribution can be moved like this.

Dick’s and Harry’s normal distribution will also look like this Z after transforming.

Subtracting the mean will give anomalies, i.e., differences from the mean centered on zero. Positive anomalies are the values greater than the mean, and negative anomalies are the values less than the mean.

Dividing by the standard deviation will provide a scaling factor; unit standard deviation for Z.

A weight of 15.5 lbs will have a Z-score of 1; 12 + 1*(3.5).

A package with a weight of 8.5 lbs will have a Z-score of -1; 12 – 1*(3.5).

Hence, the standardized scores (Z-scores) are the distance measures between the original data value and the mean, the units being the standard deviation. A package with a weight of 15 lbs is 0.857 standard deviations right of the mean 12 lbs.

Now, look at Tom’s question. What is the probability that the parcels he weighs are within 20 lbs?

$P(X \le 20)$

Let’s subtract the mean and divide X by its standard deviation.

$P(X \le 20) = P(\frac{X-\mu}{\sigma} \le \frac{20-12}{3.5})$

$P(X \le 20) = P(Z \le 2.285)$

Dick’s question. What is the probability that the pH of this soil sample is between 5.90 and 6.15?

$P(5.90 \le X \le 6.15) = P(\frac{5.90-6}{0.1} \le \frac{X-\mu}{\sigma} \le \frac{6.15-6}{0.1})$

$P(5.90 \le X \le 6.15) = P(-1 \le Z \le 1.5)$

$P(5.90 \le X \le 6.15) = P(Z \le 1.5) - P(Z \le 1)$

Harry’s problem. Compute x, the wait time, after which 1% of the customers get a free drink.

$P(X \ge x) = 1\% = 0.01$

$P(\frac{X-\mu}{\sigma} \ge \frac{x-20}{3.876}) = 1\% = 0.01$

Harry knows that on Saturday night a customer’s wait time X for a table is normally distributed with a mean of 20 minutes and a standard deviation of 3.876 minutes.

$P(Z \ge \frac{x-20}{3.876}) = 1\% = 0.01$

For Tom and Dick, we are computing the probability. In the case of Harry, we are computing the value for x that will give a specific probability.

You must have observed that the common thread in all the cases is its transformation to the standard normal (Z).

In all the cases, it is sufficient to approximate the integral of Z numerically.

$P(X \le 20) = P(Z \le 2.285) = \int_{-\infty}^{2.285}\frac{1}{\sqrt{2 \pi}} e^{\frac{-1}{2}(z)^{2}}dz$

The standard normal tables you find in most appendices of statistics textbooks or online are the numerical integral approximations of the standard normal distribution Z. An example is shown here.

The first column z represents the standardized normal variable (z) and the columns after that represent the probability $P(Z \le z)$ . Notice that at z = 0, $P(Z \le z) = 0.5$ indicating that $P(Z \le 0)$ is 50%.

The probability ranges from 0 to 0.999 for z = -3.4 to z = 3.4 as we move from left of the scale to the right. The first column is the random variable z, and the subsequent columns are the probabilities or the areas corresponding the values of z. This animation will make it clear.

Look at this example on how to read the probability for a z-score of 2.28, Tom’s case.

$P(X \le 20) = P(Z \le 2.285) = 0.9887$ .

Dick’s question. What is the probability that the pH of this soil sample is between 5.90 and 6.15?

Answer: 0.7745. Did you check?

Harry has to equate $\frac{x-20}{3.876}$ to the Z-score that gives a probability of 0.99.

$P(Z \ge \frac{x-20}{3.876}) = 1% = 0.01$

$P(Z \le \frac{x-20}{3.876}) = 1 - 0.01 = 0.99$

$\frac{x-20}{3.876} = 2.33$ .

Read from the table. The probability for a Z-score of 2.33 is 0.9901.

$x = 20 + 2.33*3.876 = 29.03$

You get a free drink if your seating wait time is greater than 29 minutes. 😉 Get the drink and continue to wait. It’s Saturday night on Amsterdam Avenue.

While you are waiting, think about the 68-95-99.7 rule of the normal distribution. Also, guess how many page views and users we have on our blog. We’ve traveled for close to a year now. Thank you so so much for this wonderful journey 🙂

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Author: Devine

Lesson 59 – The Generalized extreme value distribution

Recap

The convergence

The Generalized Extreme Value Distribution (GEV)

Lesson 58 – Max (Min): The language of extreme value distribution

Lesson 57 – My name is Maximus Extremus Distributus

Lesson 56 – Continuous distributions in R: Part II

FIRST STEPS

Let’s Understand the Central Limit Theorem

Normal Distribution

Lognormal Distribution

Lesson 55 – Continuous distributions in R: Part I

FIRST STEPS

Counts – Poisson Distribution

Uniform Distribution

Exponential Distribution

Gamma Distribution

Lesson 54 – The Bernoulli gene

Lesson 53 – Sum of squares: The language of Chi-square distribution

Generalization for n random normal variables

Lesson 52 – Transformation: The language of lognormal distribution

Lesson 51 – Sometimes it is important to let the data speak

Sometimes it is important to let the data speak…

For the original scale

For the LOG scale

Lesson 50 – The Standard Normal

Enter Z, the Standard Normal Distribution

Enjoy this blog? Please spread the word :)