Lesson 56 – Continuous distributions in R: Part II

The sun breaks, the clock rolls, and the temperature warms marking the end of the hibernation.

Last year, when we rolled back into the winter, we were learning discrete distributions in R using November temperature data.

It behooves that today, inviting the warmth, we continue learning continuous distributions using March temperature.

It’s 40-degree Fahrenheit with 40% humidity today in New York. Pleasant enough to take a nice walk without much effort. It comes after a stormy week.

Last week, before leaving, I asked you whether we will see a gusty event before today. Indeed, midweek, we had another windy Wednesday. The expected wait time is 19 days. It does not happen every 19 days. It may happen earlier, or we may have to wait longer. They average out to 19 days.

Since it is warm and relatively humid, we should analyze the data for March temperature.

Get the data from our source. Go through the chores of setting up the code. Coffee is a must.

FIRST STEPS

Step 1: You can download the filtered data file here.

Step 2: Create a new folder, “lesson56,” and save the data file in this folder.

Step 3: Create a new code for this lesson, “lesson56_code.R,” and save it in the same folder.

Step 4: Choose your working directory. “lesson56” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder.

setwd("path to your folder")

Step 5: Read the data into R workspace

I have the data in the file named “cp_temperature.csv.”

There are six columns in this data file. Year, Month, Day, TAVG (the average temperature for the day), TMAX (the maximum temperature for the day) and TMIN (the minimum temperature for the day).

We have data from 1869 onwards. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the file #
cp_data = read.csv("cp_temperature.csv",header=TRUE)
Let’s Understand the Central Limit Theorem

I want to take the data for TAVG in March and check its distribution. Since we are only nine days into March 2018, we will take the 31 days data for March 2017. You can use the which command to extract data.

# Extract March 2017 Daily Temperature Data #
index = which((cp_data[,2]==3) & (cp_data[,1] == 2017))
march_2017 = cp_data[index,4]

Through this command, you are looking for the days when the second column (Month) is 3 (March) and the first column (Year) is 2017. So “index” will be the row index in the data matrix. Then, we extract the fourth column (TAVG) of these rows.

This is how the probability distribution of these 31 values looks. Use the following lines to make this plot.

library(locfit)
hist(march_2017,prob=T,main="",font=2,font.lab=2,xlab="March 2017 Daily Temperature")
lines(locfit(~march_2017))

Doesn’t it look crazy? Let’s look at 2016 March daily data also.

These values are much more skewed.

Recall central limit theorem from lesson 48.

If S_{n} is the sum or average of n independent random variables, then the distribution function of S_{n} can be well-approximated by a continuous function knows as the normal density function.

Let’s develop a simple code to see this manifest. These are the steps.

  1. Take the 31 days TAVG data from March of each year. For simplicity, I will do it from 1951 to 2017. 67 years.
  2. Plot the distribution of these 31 values. They may be skewed.
  3. Compute the average monthly temperature for March. AVG is the average of the 31 values in March in that year. AVG = \frac{1}{n}\sum_{i=1}^{i=31}TAVG_{i}.
  4. Analyze the distribution of these averages (AVG). Because AVG is the average of n random variables, in the limit, it should converge to a normal distribution. In our case, n = 31; we create 67 values of AVG (1951 to 2017) and verify the probability distribution.

Here is the view of the manifestation of the central limit theorem. I set it up like an animation to run in the plot domain.

In the top panel, you will see the distribution of TAVG for March for each year unfolding. Each time I am marking the average of these 31 values with a red color triangle. Then, in the second panel, these red triangles are plotted as a distribution.

Here is the code for it. You can try it yourself.

## Code to demonstrate the Central Limit Theorem ##
yr1 = 1951
yr2 = 2017
n = yr2 - yr1 + 1

index = which((cp_data[,2]==3) & (cp_data[,1] == yr1)) 
march_daily_temperature = cp_data[index,4]

par(mfrow=c(2,1))
plot(locfit(~march_daily_temperature),main="Average Daily Temperature",col="grey",xlim=c(10,80),ylim=c(0,0.1),font=2,font.lab=2,xlab="Temperature (deg F)")
points(mean(march_daily_temperature,na.rm=T),0,col="red",pch=25)

avg = matrix(NA,nrow=n,ncol=1)
for (i in 1: n)
{
 yr = 1950 + i
 index2 = which((cp_data[,1]==yr) & (cp_data[,2]==3))
 daily = cp_data[index2,4]

lines(locfit(~daily),col="grey")
 
 avg[i,1] = mean(cp_data[index2,4],na.rm=T)
 points(avg[i,1],0,col="red",pch=25)
 Sys.sleep(0.5) 
}
plot(locfit(~avg),main="Average Monthly Temperature",col="white",lwd=2,xlim=c(10,80),font=2,font.lab=2,xlab="Temperature (deg F)")
lines(locfit(~avg),col="red",lwd=2)

In lesson 48, I also showed how the normal distribution emerges from the convolution of Uniform and Poisson distribution. Here is the code for it.

# Uniform Dist #
x1 = runif(20000)
x2 = runif(20000)
x3 = runif(20000)
x4 = runif(20000)
x5 = runif(20000)

z1 = x1
z2 = x1 + x2
z3 = x1 + x2 + x3
z4 = x1 + x2 + x3 + x4
z5 = x1 + x2 + x3 + x4 + x5

plot(density(x1),xlim=c(0,5),ylim=c(0,2),font=2,font.lab=2,main="",xlab="Sn")
text(0.2,1.1,"Sn = X1")
Sys.sleep(1)

lines(density(z2),col="green",lwd=2)
text(1.2,1.1,"Sn = X1+X2",col="green")
Sys.sleep(1)

lines(density(z3),col="pink",lwd=2)
text(2.2,0.8,"Sn = X1+X2+X3",col="pink")
Sys.sleep(1)

lines(density(z4),col="grey",lwd=2)
text(3.1,0.6,"Sn = X1+X2+X3+X4",col="grey")
Sys.sleep(1)

lines(density(z5),col="yellow",lwd=2)
text(3.6,0.3,"Sn = X1+X2+X3+X4+X5",col="yellow")
Sys.sleep(1)

# Poisson Dist #
lambda = 1
x1 = rpois(10000,lambda)
x2 = rpois(10000,lambda)
x3 = rpois(10000,lambda)
x4 = rpois(10000,lambda)
x5 = rpois(10000,lambda)
x6 = rpois(10000,lambda)
x7 = rpois(10000,lambda)
x8 = rpois(10000,lambda)
x9 = rpois(10000,lambda)
x10 = rpois(10000,lambda)

z1 = x1
z2 = x1 + x2
z3 = x1 + x2 + x3
z4 = x1 + x2 + x3 + x4
z5 = x1 + x2 + x3 + x4 + x5
z6 = x1 + x2 + x3 + x4 + x5 + x6
z7 = x1 + x2 + x3 + x4 + x5 + x6 + x7
z8 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
z9 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
z10 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10

Sys.sleep(1) 
hist(x1,xlim=c(0,20),ylim=c(0,5000),col="red",breaks=50,main="",font=2,font.lab=2,xlab="Sn")
Sys.sleep(1)

hist(z2,add=T,col="green",breaks=50)
Sys.sleep(1)

hist(z3,add=T,col="pink",breaks=50)
Sys.sleep(1)

hist(z4,add=T,col="grey",breaks=50)
Sys.sleep(1)

hist(z5,add=T,col="yellow",breaks=50)
Sys.sleep(1)

hist(z6,add=T,col="orange",breaks=50)
Sys.sleep(1)

hist(z7,add=T,col="blue",breaks=50)
Sys.sleep(1)

hist(z8,add=T,col="violet",breaks=50)
Sys.sleep(1)

hist(z9,add=T,col="magenta",breaks=50)
Sys.sleep(1)

hist(z10,add=T,col="black",breaks=50)
text(15,1100,"Sn = X1 + X2 + ... + X10")

Normal Distribution

We now know that AVG is a normal distribution.

What is the probability that the average March monthly temperature AVG will be less than or equal to 40?

.

.

.

I know you are fetching for your standard normal tables to compute

 P(AVG \le 40) = P(\frac{AVG-\mu}{\sigma} \le \frac{40-42.197}{3.113}) = P(Z \le -0.706) = 0.24

While you can do this on paper, in R, you can just use the pnorm command like this.

# Probability Computations #
pnorm(40,mean(avg),sd(avg))

If you want to compute the probability that AVG will exceed 50 deg F, you can do it like this.

1 - pnorm(50,mean(avg),sd(avg))

Now, think about the qnorm command.

Lognormal Distribution

If AVG follows a normal distribution, then, X = e^{AVG} is a lognormal distribution because log of X is normal.

We know that exponentiating a normal distribution will result in lognormal distribution.

The average monthly March temperature, AVG, if exponentiated, will transform into a lognormal distribution.

Instead of merely taking the numbers and showing the distribution of the exponentials, I think it will be fun to see where in the real world we do exponential of temperature.

Have you ever wondered if air temperature influences the amount of water vapor it can hold?

The water holding capacity (vapor) increases as the temperature increases. The higher the temperature, the stronger will be the bond between water and air molecules.

We can measure the vapor pressure using this simple equation.

VP = 0.61078e^{\frac{17.27T}{T+243.05}}

VP is the vapor pressure in kilopascals, T is the temperature in deg C. This is the famous Clausius-Clapeyron approximation used in meteorology and climatology to compute the vapor pressure.

You can check the equation and see that the water vapor pressure changes approximately exponentially with temperature. The water-holding capacity of the atmosphere increases by about 7% for every 1 °C rise in temperature.

Let’s compute the vapor pressure for these 67 values of temperature and check out the distribution of vapor pressure, which is roughly exponential of a normal distribution.

## Vapor Pressure ##
Tc = (avg-32)*(5/9) 
vp = 0.61078*exp((17.27*Tc)/(Tc+243.05))

# Plotting and Comparing #
par(mfrow=c(2,1))
hist(Tc,prob=T,main="Average Monthly Temperature",lwd=2,font=2,font.lab=2,xlab="Temperature (deg C)")
lines(locfit(~Tc),col="grey")

hist(vp,prob=T,main="Vapor Pressure ~ exp(T)",lwd=2,font=2,font.lab=2,xlab="Vapor Pressure (kpa)")
lines(locfit(~vp),col="grey")

You should see this plot in your plot domain.

The top panel is the temperature data expressed in deg C.

The bottom panel is the vapor pressure. See how the data becomes skewed. We transformed the data from normal to a lognormal distribution.

You must have noticed that I am using a line par(mfrow). This command is used to make multiple plots as panels. par(mfrow = c(2,1)) will split the plot space into two rows and one column so that we can make two plots, one below the other.

Think about the square of the temperature. X = AVG^{2}. What distribution will it follow? AVG square – Chi-square?

A standard rule of thumb for distributions in R is to use the letters p and q in front of the name of the distribution.

Exponential distribution
pexp to compute the probability.
qexp to compute the quantile.

Gamma distribution
pgamma to compute the probability.
qgamma to compute the quantile.

Normal distribution
pnorm to compute the probability.
qnorm to compute the quantile.

Lognormal distribution
plnorm to compute the probability.
qlnorm to compute the quantile.

Chi-square distribution
pchisq to compute the probability.
qchisq to compute the quantile.

You should have guessed the command if you want to draw random variables from a distribution.

And so, as the beautiful warm day ends, we end our discussion of continuous distributions for now.

As the clocks rolleth, the days breaketh soon for you to relish the spring warmeth. Relish the warmeth, for the “extremum” will cometh.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)