Lesson 58 – Max (Min): The language of extreme value distribution

Mumble, Joe, and Devine meet to discuss extreme values and extreme value distribution. Extremus Distributus.

 

 

J: Extreme value distribution? Are we talking about the pictures we saw last week? The maximum of any distribution following specific probability distribution. How does my driving experience today morning fit the discussion?

M: Can you describe a bit more about your experience?

J: Yes. I was waiting at the merging lane near GW Bridge. The arrival time between successive vehicles was pretty fast; I was left there stranded.

M: We can assume that the arrival times between successive vehicles has an exponential distribution. As you have shown in your graphic, let’s call this random variable X.

X \sim Exp(\lambda)

Let’s say there are n+1 vehicles so that there are n arrival times between successive vehicles that can be shown as a set of random variables.

X_{1}, X_{2}, X_{3}, ..., X_{n}

This means your wait time to merge between successive vehicles could have been any of these n random variables. The maximum wait time is the max of these numbers. Let’s call this maximum time, Y.

Y = max(X_{1}, X_{2}, X_{3}, ..., X_{n})

Y follows an extreme value distribution and it is related to the distribution of X_{1}, X_{2}, X_{3}, ..., X_{n}.

D: Since in your illustration, you have an exponential wait time as the parent distribution, Y, the maximum of X’s will have a distribution that is related to the exponential distribution. We can derive the exact function of Y based on what we know about X.

M: In fact, we use exponential distribution as the parent distribution to model maximum wait time between earthquakes, floods, droughts, etc. to price our insurance products.

J: How can we derive the probability function of Y from X? Can we go through that?

D: Sure. Mumble showed before that Y = max(X_{1}, X_{2}, X_{3}, ..., X_{n}).

The cumulative distribution function F_{Y}(y)=P(Y \le y). Because Y is the maximum X_{1}, X_{2}, X_{3}, ..., X_{n}, y should be greater than all these values. In other words,

F_{Y}(y)=P(Y \le y) = P(X_{1} \le y \cap X_{2} \le y \cap ... X_{n} \le y)

We assume that X_{1}, X_{2}, X_{3}, ..., X_{n} are statistically independent; hence, the cumulative distribution of Y is the product of the cumulative distribution of individual parent distributions.

F_{Y}(y)=P(Y \le y) = P(X_{1} \le y)P( X_{2} \le y) ...P(X_{n} \le y) = [F_{X}(y)]^{n}

From this cumulative function, we can derive the probability function.

f_{Y}(y) = \frac{d}{dy}F_{Y}(y) = \frac{d}{dy}[F_{X}(y)]^{n}

f_{Y}(y) = n[F_{X}(y)]^{n-1}f_{X}(y)

M: If we know the distribution of X (parent), we can substitute that in our equations here and get the distribution of Y. Do you want to try it for the exponential wait times?

J: Sure. The cumulative function for an exponential distribution is F_{X}(x) = 1 - e^{-\lambda x}. The probability function is f_{X}(x)=\lambda e^{-\lambda x}. For Y it will be

F_{Y}(y) = [1 - e^{-\lambda y}]^{n}

f_{Y}(y) = n[1 - e^{-\lambda y}]^{n-1}\lambda e^{-\lambda y}

D: Excellent. Did you also notice that the cumulative and probability functions are dependent on n, the number of independent random variables X_{1}, X_{2}, X_{3}, ..., X_{n}.

Here is how they vary for exponential parent distribution. They shift to the right with increasing values of n.

J: We can derive the functions for the minimum in a similar way I suppose.

M: That is correct. If Y_{min} is the minimum of the random variables, then, each of these independent X_{1}, X_{2}, X_{3}, ..., X_{n} are greater than y_{min}.

P(X_{1}>y_{min}, X_{2}>y_{min}, ..., X_{n}>y_{min}) = 1 - F_{Y_{min}}(y_{min})

Using this equation, we can derive F_{Y_{min}}(y_{min}) = 1 - [1-F_{X}(y_{min})]^{n} and f_{Y_{min}}(y_{min})=n[1-F_{X}(y_{min})]^{n-1}f_{X}(y_{min}).

J: This is cool. All we need to know is the function for X, and we know everything about the extremes.

M: That is true to an extent. But what will we do if we do not know the form of the parent distribution with certainty? Remember Lesson 51? Did you check what happens to the functions with increasing values of n, i.e., n tends to \infty?

J: Let me plot the cumulative function for increasing values of n and see.

D: Joe, as you can see, F_{Y}(y) = [F_{X}(y)]^{n} tends to 0 as n tends to \infty. The function degeneates to a point. It is unstable.

J: Hmm. How do we stabilize it then?

M: Good question. We can work with a normalized version of Y instead of Y.

If there are two normalizing constants a_{n}>0 and b_{n}, we can create a normalized version of Y.

Y^{*} = \frac{Y-b_{n}}{a_{n}}

Proper values of these two normalizing constants a_{n}>0 and b_{n} will stabilize Y for increasing values of n. We need to find the distribution that Y^{*} can take in the limit as n tends to \infty.

D: Look at this visual to understand what stabilizing (norming) for increasing value of n means. It is for an exponential parent, our example. I am assuming a_{n} = 1 and b_{n} varies with n.

The normalized function does not degenerate for large values of n.

M: It turns out that the limit distribution of Y^{*}, the normalized version of Y, can be one of the three types, Type I, Type II and Type III.

If there exist normalizing constants a_{n}>0 and b_{n}, then,

P(\frac{Y - b_{n}}{a_{n}} \le z) \to G(z) as n \to \infty.

G(x) is the non-degenerate cumulative distribution function.

Type I (Gumbel Distribution): G(z) = e^{-e^{-\frac{z-\alpha}{\beta}}}. Double exponential distribution.

Type II (Frechet Distribution): G(z) = e^{-(\frac{z-\alpha}{\beta})^{-\gamma}} for x > \alpha and 0 for x \le \alpha. This is single exponential function.

Type III (Weibull Distribution): G(z) = e^{-[-\frac{z-\alpha}{\beta}]^{\gamma}} for x < \alpha and 1 for x \ge \alpha. This is also a single exponential distribution.

The normalized version of Y converges to these three types, also called the family of extreme value distribution. It is said that the distribution F_{Y}(y) is in the domain of attraction of G(z). \alpha, \beta, \gamma are the location (central tendency), scale and the shape controlling parameters.

J: This convergence theorem sounds like the central limit theorem.

D: Yes, like the sum of the independent random variables converges to the normal distribution in the limit when n \to \infty, the max of independent random variables converges to one of these three types, Gumbel, Frechet or Weibull depending on the parent distribution. Generally, exponential tails, for example, exponential and normal parent distribution tend to Gumbel, and polynomial tails tend to Frechet distribution.

You have already seen how the Gumbel looks like when exponential wait times converge. Here is an example Frechet and Weibull distribution.

Fisher and Tippet derived these three limiting distributions in the late 1920s. Emil Julius Gumbel, in chapter 5 of his book “Statistics of Extremes” explains the fundamentals of deriving the distributions.

J: This is plentyful. Are there any properties of the three types? Can we guess what distribution is used where? The converge of extremes to these three types seems very powerful. We can model outliers without much trouble. Some examples in R might help understand these better.

You already have a handful to digest. Let’s continue next week after you have processed it.

Be cautious. Coming up with the values for \alpha, \beta, \gamma in the real world involves uncertainty. The extreme value distribution need not be an elixir for outliers.

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 57 – My name is Maximus Extremus Distributus

The best school districts in New York have a name.

The name’s Extremus Distributus.

Heavy damaging winds have a name.

Extremus Distributus.

Central limit theorem tells you that the mean of the normal distribution follows a normal distribution in the limit.

Centerus Normalus Distributus.

The maxima of normal distribution also follow a distinct distribution.

Maximus Extremus Distributus

The maxima of several other distributions also follow a Maximus Extremus Distributus.

Tertia pars sequetur …

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 56 – Continuous distributions in R: Part II

The sun breaks, the clock rolls, and the temperature warms marking the end of the hibernation.

Last year, when we rolled back into the winter, we were learning discrete distributions in R using November temperature data.

It behooves that today, inviting the warmth, we continue learning continuous distributions using March temperature.

It’s 40-degree Fahrenheit with 40% humidity today in New York. Pleasant enough to take a nice walk without much effort. It comes after a stormy week.

Last week, before leaving, I asked you whether we will see a gusty event before today. Indeed, midweek, we had another windy Wednesday. The expected wait time is 19 days. It does not happen every 19 days. It may happen earlier, or we may have to wait longer. They average out to 19 days.

Since it is warm and relatively humid, we should analyze the data for March temperature.

Get the data from our source. Go through the chores of setting up the code. Coffee is a must.

FIRST STEPS

Step 1: You can download the filtered data file here.

Step 2: Create a new folder, “lesson56,” and save the data file in this folder.

Step 3: Create a new code for this lesson, “lesson56_code.R,” and save it in the same folder.

Step 4: Choose your working directory. “lesson56” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder.

setwd("path to your folder")

Step 5: Read the data into R workspace

I have the data in the file named “cp_temperature.csv.”

There are six columns in this data file. Year, Month, Day, TAVG (the average temperature for the day), TMAX (the maximum temperature for the day) and TMIN (the minimum temperature for the day).

We have data from 1869 onwards. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the file #
cp_data = read.csv("cp_temperature.csv",header=TRUE)
Let’s Understand the Central Limit Theorem

I want to take the data for TAVG in March and check its distribution. Since we are only nine days into March 2018, we will take the 31 days data for March 2017. You can use the which command to extract data.

# Extract March 2017 Daily Temperature Data #
index = which((cp_data[,2]==3) & (cp_data[,1] == 2017))
march_2017 = cp_data[index,4]

Through this command, you are looking for the days when the second column (Month) is 3 (March) and the first column (Year) is 2017. So “index” will be the row index in the data matrix. Then, we extract the fourth column (TAVG) of these rows.

This is how the probability distribution of these 31 values looks. Use the following lines to make this plot.

library(locfit)
hist(march_2017,prob=T,main="",font=2,font.lab=2,xlab="March 2017 Daily Temperature")
lines(locfit(~march_2017))

Doesn’t it look crazy? Let’s look at 2016 March daily data also.

These values are much more skewed.

Recall central limit theorem from lesson 48.

If S_{n} is the sum or average of n independent random variables, then the distribution function of S_{n} can be well-approximated by a continuous function knows as the normal density function.

Let’s develop a simple code to see this manifest. These are the steps.

  1. Take the 31 days TAVG data from March of each year. For simplicity, I will do it from 1951 to 2017. 67 years.
  2. Plot the distribution of these 31 values. They may be skewed.
  3. Compute the average monthly temperature for March. AVG is the average of the 31 values in March in that year. AVG = \frac{1}{n}\sum_{i=1}^{i=31}TAVG_{i}.
  4. Analyze the distribution of these averages (AVG). Because AVG is the average of n random variables, in the limit, it should converge to a normal distribution. In our case, n = 31; we create 67 values of AVG (1951 to 2017) and verify the probability distribution.

Here is the view of the manifestation of the central limit theorem. I set it up like an animation to run in the plot domain.

In the top panel, you will see the distribution of TAVG for March for each year unfolding. Each time I am marking the average of these 31 values with a red color triangle. Then, in the second panel, these red triangles are plotted as a distribution.

Here is the code for it. You can try it yourself.

## Code to demonstrate the Central Limit Theorem ##
yr1 = 1951
yr2 = 2017
n = yr2 - yr1 + 1

index = which((cp_data[,2]==3) & (cp_data[,1] == yr1)) 
march_daily_temperature = cp_data[index,4]

par(mfrow=c(2,1))
plot(locfit(~march_daily_temperature),main="Average Daily Temperature",col="grey",xlim=c(10,80),ylim=c(0,0.1),font=2,font.lab=2,xlab="Temperature (deg F)")
points(mean(march_daily_temperature,na.rm=T),0,col="red",pch=25)

avg = matrix(NA,nrow=n,ncol=1)
for (i in 1: n)
{
 yr = 1950 + i
 index2 = which((cp_data[,1]==yr) & (cp_data[,2]==3))
 daily = cp_data[index2,4]

lines(locfit(~daily),col="grey")
 
 avg[i,1] = mean(cp_data[index2,4],na.rm=T)
 points(avg[i,1],0,col="red",pch=25)
 Sys.sleep(0.5) 
}
plot(locfit(~avg),main="Average Monthly Temperature",col="white",lwd=2,xlim=c(10,80),font=2,font.lab=2,xlab="Temperature (deg F)")
lines(locfit(~avg),col="red",lwd=2)

In lesson 48, I also showed how the normal distribution emerges from the convolution of Uniform and Poisson distribution. Here is the code for it.

# Uniform Dist #
x1 = runif(20000)
x2 = runif(20000)
x3 = runif(20000)
x4 = runif(20000)
x5 = runif(20000)

z1 = x1
z2 = x1 + x2
z3 = x1 + x2 + x3
z4 = x1 + x2 + x3 + x4
z5 = x1 + x2 + x3 + x4 + x5

plot(density(x1),xlim=c(0,5),ylim=c(0,2),font=2,font.lab=2,main="",xlab="Sn")
text(0.2,1.1,"Sn = X1")
Sys.sleep(1)

lines(density(z2),col="green",lwd=2)
text(1.2,1.1,"Sn = X1+X2",col="green")
Sys.sleep(1)

lines(density(z3),col="pink",lwd=2)
text(2.2,0.8,"Sn = X1+X2+X3",col="pink")
Sys.sleep(1)

lines(density(z4),col="grey",lwd=2)
text(3.1,0.6,"Sn = X1+X2+X3+X4",col="grey")
Sys.sleep(1)

lines(density(z5),col="yellow",lwd=2)
text(3.6,0.3,"Sn = X1+X2+X3+X4+X5",col="yellow")
Sys.sleep(1)

# Poisson Dist #
lambda = 1
x1 = rpois(10000,lambda)
x2 = rpois(10000,lambda)
x3 = rpois(10000,lambda)
x4 = rpois(10000,lambda)
x5 = rpois(10000,lambda)
x6 = rpois(10000,lambda)
x7 = rpois(10000,lambda)
x8 = rpois(10000,lambda)
x9 = rpois(10000,lambda)
x10 = rpois(10000,lambda)

z1 = x1
z2 = x1 + x2
z3 = x1 + x2 + x3
z4 = x1 + x2 + x3 + x4
z5 = x1 + x2 + x3 + x4 + x5
z6 = x1 + x2 + x3 + x4 + x5 + x6
z7 = x1 + x2 + x3 + x4 + x5 + x6 + x7
z8 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
z9 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
z10 = x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10

Sys.sleep(1) 
hist(x1,xlim=c(0,20),ylim=c(0,5000),col="red",breaks=50,main="",font=2,font.lab=2,xlab="Sn")
Sys.sleep(1)

hist(z2,add=T,col="green",breaks=50)
Sys.sleep(1)

hist(z3,add=T,col="pink",breaks=50)
Sys.sleep(1)

hist(z4,add=T,col="grey",breaks=50)
Sys.sleep(1)

hist(z5,add=T,col="yellow",breaks=50)
Sys.sleep(1)

hist(z6,add=T,col="orange",breaks=50)
Sys.sleep(1)

hist(z7,add=T,col="blue",breaks=50)
Sys.sleep(1)

hist(z8,add=T,col="violet",breaks=50)
Sys.sleep(1)

hist(z9,add=T,col="magenta",breaks=50)
Sys.sleep(1)

hist(z10,add=T,col="black",breaks=50)
text(15,1100,"Sn = X1 + X2 + ... + X10")

Normal Distribution

We now know that AVG is a normal distribution.

What is the probability that the average March monthly temperature AVG will be less than or equal to 40?

.

.

.

I know you are fetching for your standard normal tables to compute

 P(AVG \le 40) = P(\frac{AVG-\mu}{\sigma} \le \frac{40-42.197}{3.113}) = P(Z \le -0.706) = 0.24

While you can do this on paper, in R, you can just use the pnorm command like this.

# Probability Computations #
pnorm(40,mean(avg),sd(avg))

If you want to compute the probability that AVG will exceed 50 deg F, you can do it like this.

1 - pnorm(50,mean(avg),sd(avg))

Now, think about the qnorm command.

Lognormal Distribution

If AVG follows a normal distribution, then, X = e^{AVG} is a lognormal distribution because log of X is normal.

We know that exponentiating a normal distribution will result in lognormal distribution.

The average monthly March temperature, AVG, if exponentiated, will transform into a lognormal distribution.

Instead of merely taking the numbers and showing the distribution of the exponentials, I think it will be fun to see where in the real world we do exponential of temperature.

Have you ever wondered if air temperature influences the amount of water vapor it can hold?

The water holding capacity (vapor) increases as the temperature increases. The higher the temperature, the stronger will be the bond between water and air molecules.

We can measure the vapor pressure using this simple equation.

VP = 0.61078e^{\frac{17.27T}{T+243.05}}

VP is the vapor pressure in kilopascals, T is the temperature in deg C. This is the famous Clausius-Clapeyron approximation used in meteorology and climatology to compute the vapor pressure.

You can check the equation and see that the water vapor pressure changes approximately exponentially with temperature. The water-holding capacity of the atmosphere increases by about 7% for every 1 °C rise in temperature.

Let’s compute the vapor pressure for these 67 values of temperature and check out the distribution of vapor pressure, which is roughly exponential of a normal distribution.

## Vapor Pressure ##
Tc = (avg-32)*(5/9) 
vp = 0.61078*exp((17.27*Tc)/(Tc+243.05))

# Plotting and Comparing #
par(mfrow=c(2,1))
hist(Tc,prob=T,main="Average Monthly Temperature",lwd=2,font=2,font.lab=2,xlab="Temperature (deg C)")
lines(locfit(~Tc),col="grey")

hist(vp,prob=T,main="Vapor Pressure ~ exp(T)",lwd=2,font=2,font.lab=2,xlab="Vapor Pressure (kpa)")
lines(locfit(~vp),col="grey")

You should see this plot in your plot domain.

The top panel is the temperature data expressed in deg C.

The bottom panel is the vapor pressure. See how the data becomes skewed. We transformed the data from normal to a lognormal distribution.

You must have noticed that I am using a line par(mfrow). This command is used to make multiple plots as panels. par(mfrow = c(2,1)) will split the plot space into two rows and one column so that we can make two plots, one below the other.

Think about the square of the temperature. X = AVG^{2}. What distribution will it follow? AVG square – Chi-square?

A standard rule of thumb for distributions in R is to use the letters p and q in front of the name of the distribution.

Exponential distribution
pexp to compute the probability.
qexp to compute the quantile.

Gamma distribution
pgamma to compute the probability.
qgamma to compute the quantile.

Normal distribution
pnorm to compute the probability.
qnorm to compute the quantile.

Lognormal distribution
plnorm to compute the probability.
qlnorm to compute the quantile.

Chi-square distribution
pchisq to compute the probability.
qchisq to compute the quantile.

You should have guessed the command if you want to draw random variables from a distribution.

And so, as the beautiful warm day ends, we end our discussion of continuous distributions for now.

As the clocks rolleth, the days breaketh soon for you to relish the spring warmeth. Relish the warmeth, for the “extremum” will cometh.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 55 – Continuous distributions in R: Part I

Another one of those pharmacy umbrellas is resting in peace after yesterday’s Nor’easter. On a day when a trailer toppled on Verazzona bridge, I must be happy it is only my umbrella that blew away and not me. These little things are so soft they can at best handle a happy drizzle.

As per the National Weather Service, winds (gusts) were up to 35 miles per hour yesterday. We are still under a wind advisory as I write this.

My mind is on exploring how many such gusty events happened in the past, what is the wait time between them and more.

I have requested the daily wind gust data for New York City from NOAA’s Climate Data Online.

So, get your drink for the day and fire up your R consoles. Let’s recognize this Nor’easter with some data analysis before it fades away as a statistic. We don’t know how long we have to wait for another of such event. Or do we?

FIRST STEPS

Step 1: Get the data
You can download the filtered data file here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson55.” Save the data file in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script.” Save the code in the same folder “lesson55” using the “save” button or by using “Ctrl+S.” Use .R as the extension — “lesson55_code.R.” Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson55” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5: Read the data into R workspace
I have the data in the file named “cp_wind.csv.” I am only showing the wind gust data for the day. The column name (according to NOAA CDO) is WSFG.

WSFG = Peak guest wind speed (miles per hour or meters per second as per user preference)

Type the following line in your code and execute it. Use header=TRUE in the command.

#### Read the file ####
cp_data = read.csv("cp_wind.csv",header=TRUE)
Counts – Poisson Distribution

There is data from 1984 to 2017, 34 years.

Execute the following line in your code to count the total number of days when wind gust exceeded 35 mph.

# Select Wind Gust data > 35 mph from the file #
gust_events = length(which(cp_data$WSFG > 35))

The which command outputs the index of the row that is greater than 35 mph. The length command counts the number of such days.

There are 653 such events in 34 years, an average of 19.20 events per year.

Let’s count the events each year from 1984 to 2017.

Write the following lines in your code.

# Count the number of events each year #
years = range(cp_data$Year)

number_years = diff(years) + 1

gust_counts_yearly = matrix(NA,nrow=number_years,ncol=1)

for (i in 1:number_years)
{
 year = 1983 + i
 index = which(cp_data$Year == year)
 gust_counts_yearly[i,1] = length(which(cp_data$WSFG[index] > 35))
}

plot(1984:2017,gust_counts_yearly,type="o",font=2,font.lab=2,xlab="Years",ylab="Gust Counts")

hist(gust_counts_yearly)
lambda = mean(gust_counts_yearly)

Assuming that we can model the number of gust events each year as a Poisson process with the rate you estimated above (19.2 per year), we can generate Poisson random numbers to replicate the number of gust events (or counts) in a year.

If we generate 34 numbers like this, we just created one possible scenario of yearly gust events (past or future).

Run the following line in your code.

# create one scenario #
counts_scenario = rpois(34,lambda)

Now, you must be wondering how the command rpois works? What is the mechanism for creating these random variables?

To understand this, we have to start with uniform distribution and the cumulative density functions.

F(x) is an increasing function of x, and we know that 0 \le F(x) \le 1. In other words, F(x) is uniformly distributed on [0, 1].

For any uniform random variable u, we can say F(x) = u since F(u) = u for uniform distribution.

Using this relation, we can obtain x using the inverse of the function F. x = F^{-1}(u).

If we know the cumulative distribution function F of a random variable X, we can obtain a specific value from this distribution by solving the inverse function for a uniform random variable u. The following graphic should make it clear as to how a uniform random variable u maps to a random variable x from a distribution. rpois implicitly does this.

Uniform Distribution

Do you remember lesson 42? Uniform distribution is a special case of the beta distribution with a standard form

f(x) = cx^{a-1}(1-x)^{b-1} for 0 < x < 1.

a and b are the parameters that control the shape of the distribution. They can take any positive real numbers; a > 0 and b > 0.

c is called the normalizing constant. It ensures that the pdf integrates to 1. This normalizing constant c is called the beta function and is defined as the area under the graph of x^{a-1}(1-x)^{b-1} between 0 and 1. For integer values of a and b, this constant c is defined using the generalized factorial function.

 c = \frac{(a + b - 1)!}{(a-1)!(b-1)!}

Substitute a = 1 and b = 1 in the standard function.

 c = \frac{(1 + 1 - 1)!}{(1-1)!(1-1)!} = 1

f(x) = 1x^{1-1}(1-x)^{1-1} = 1

A constant value 1 for all x. This flat function is called the uniform distribution.

In lesson 42, I showed you an animation of how the beta function changes for different values of a and b. Here is the code for that. You can try it for yourself.

I am first holding b constant at 1 and varying a from 0 to 2 to see how the function changes. Then, I keep a at 2 and increase b from 1 to 2. If you select the lines altogether and run the code, you will see the plot domain running this animation. The Sys.sleep (1) is the control to hold the animation/loop speed.

### Beta / Uniform Distribution ###

### Animation ###

a = seq(from = 0.1, to = 2, by = 0.1)
b = 1

x = seq(from = 0, to = 1, by = 0.01)
for (i in 1:length(a))
{
 c = factorial(a[i]+b-1)/(factorial(a[i]-1)*factorial(b-1))
 fx = c*x^(a[i]-1)*(1-x)^(b-1)
 
 plot(x,fx,type="l",font=2,font.lab=2,ylab="f(x)",ylim=c(0,5),main="Beta Family")
 txt1 = paste("a = ",a[i])
 txt2 = paste("b = ",b)
 
 text(0.6,4,txt1,col="red",font=2)
 text(0.6,3.5,txt2,col="red",font=2)
 
 Sys.sleep(1) 
}

a = 2
b = seq(from = 1, to = 2, by = 0.1)

x = seq(from = 0, to = 1, by = 0.01)
for (i in 1:length(b))
{
 c = factorial(a+b[i]-1)/(factorial(a-1)*factorial(b[i]-1))
 fx = c*x^(a-1)*(1-x)^(b[i]-1)
 
 plot(x,fx,type="l",font=2,font.lab=2,ylab="f(x)",ylim=c(0,5),main="Beta Family")
 txt1 = paste("a = ",a)
 txt2 = paste("b = ",b[i])
 
 text(0.6,4,txt1,col="red",font=2)
 text(0.6,3.5,txt2,col="red",font=2)
 
 Sys.sleep(1) 
}

You should see this running in your plot domain.

Exponential Distribution

Let’s get back to the gusty winds.

Remember what we discussed in lesson 43 about the exponential distribution. The time between these gusty events (wind gusts > 35 mph) follows an exponential distribution.

 P(T > t) = P(N = 0) = \frac{e^{-\lambda t}(\lambda t)^{0}}{0!} = e^{-\lambda t}

Here is the code to measure the wait time (in days) between these 653 events, plot them as a density plot and measure the average wait time between the events (expected value of the exponential distribution).

### Exponential Distribution ###

# Select the indices of the Wind Gust data > 35 mph from the file #
library(locfit)

gust_index = which(cp_data$WSFG > 35)

gust_waittime = diff(gust_index)

x = as.numeric(gust_waittime)
hist(x,prob=T,main="",xlim=c(0.1,200),lwd=2,xlab="Wait time in days",ylab="f(x)",font=2,font.lab=2)
lines(locfit(~x),xlim=c(0.1,200),col="red",lwd=2,xlab="Wait time in days",ylab="f(x)",font=2,font.lab=2)

ave_waittime = mean(gust_waittime)

The first event recorded was on the 31st day from the beginning of the data record, January 31, 1984. A wait time of 31 days. The second event occurred again on day 55, February 24, 1984. A wait time of 24 days. If we record these wait times, we get a distribution of them like this.

Now, let’s generate 34 random numbers that replicate the time between the gusty events.

# Simulating wait times #
simulated_waittimes = rexp(34,lambda)*365

plot(locfit(~simulated_waittimes),xlim=c(0.1,200),col="red",lwd=2,xlab="Wait time in days",ylab="f(x)",font=2,font.lab=2)

rexp is the command to generate random exponentially distributed numbers with a certain rate of occurrence lambda. In our case, lambda is 19.2 per year. I multiplied them by 365 to convert them to wait time in days.

What is the probability of waiting more than 20 days for the next gusty event?

Yes, you’d compute P(T > 20) = e^{-20\lambda} = e^{-20/19} = 0.3488.

In Rstudio, you can use this line.

# Probability of wait time greater than 20 days #
lambda = 1/ave_waittime

1 - pexp(20, lambda)

pexp computes P(T \le 20).

Now, imagine we waited for 20 days, what is the probability of waiting another 20 days?

Yes, no calculation required. It is 34.88%. Memoryless property. We learned this in in lesson 44.

I also showed you an animation for how it worked based on the idea of shifting the distribution by 20. Here is the code for that. Try it.

## Memoryless property ##

# animation plot to understand the memoryless property #
Sys.sleep(1)
t = seq(from=0,to=200,by=1)
ft = lambda*exp(-lambda*t)
plot(t,ft,type="l",font=2,font.lab=2,xlab="Wait Time to Gusty Events (days)",ylab="f(t)")

t = seq(from=20,to=200,by=1)

Sys.sleep(1)

abline(v=ave_waittime,col="red")

Sys.sleep(1)

ft = lambda*exp(-lambda*t)
lines(t,ft,col="red",lty=2,lwd=2,type="o")

Sys.sleep(1)

t = seq(from = 20, to = 200, by =1)
ft = lambda*exp(-lambda*(t-20))
lines(t,ft,col="red",type="o")
Gamma Distribution

The wait time for the ‘r’th event follows a Gamma distribution. In this example, wait time for the second gusty event from now, the third gusty event from now, etc. follows a Gamma distribution.

Check out this animation code for it.

r = seq(from=1,to=5,by=1)
t = seq(from=0,to=200,by=1)

for (i in 1:length(r))
{
 gt = (1/factorial(r[i]-1))*(lambda*exp(-lambda*t)*(lambda*t)^(r[i]-1))
 
 plot(t,gt,type="l",font=2,font.lab=2,xlab="Wait time",ylab="f(t)")
 txt1 = paste("lambda = ",round(lambda,2),sep="")
 txt2 = paste("r = ",r[i],sep="")
 text(50,0.002,txt1,col="red",font=2,font.lab=2)
 text(50,0.004,txt2,col="red",font=2,font.lab=2)
 Sys.sleep(1) 
}

Alternately, you can also create random numbers from Gamma distribution using the “rgamma” function and plot the density function.

# alternate way #
x1 = rgamma(10000,shape=1,scale=ave_waittime)
plot(locfit(~x1),font=2,font.lab=2,xlab="Wait time",ylab="f(t)")
txt1 = paste("lambda = ",round(lambda,2),sep="")
txt2 = paste("r = ",1,sep="")
text(50,0.002,txt1,col="red",font=2,font.lab=2)
text(50,0.004,txt2,col="red",font=2,font.lab=2)

x2 = rgamma(10000,shape=2,scale=ave_waittime)
plot(locfit(~x2),font=2,font.lab=2,xlab="Wait time",ylab="f(t)")
txt1 = paste("lambda = ",round(lambda,2),sep="")
txt2 = paste("r = ",2,sep="")
text(50,0.002,txt1,col="red",font=2,font.lab=2)
text(50,0.004,txt2,col="red",font=2,font.lab=2)

x3 = rgamma(10000,shape=3,scale=ave_waittime)
plot(locfit(~x3),font=2,font.lab=2,xlab="Wait time",ylab="f(t)")
txt1 = paste("lambda = ",round(lambda,2),sep="")
txt2 = paste("r = ",3,sep="")
text(50,0.002,txt1,col="red",font=2,font.lab=2)
text(50,0.004,txt2,col="red",font=2,font.lab=2)

Can you now tell me the probability of seeing the next gusty event before you see the next gusty lesson (part II)?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)