Lesson 69 – The quest for truth: Learning estimators in R, part I

For the fifteenth time, Mr. Alfred Quality looks at the two candies in front of him and shakes his head. He is visibly disgruntled. He spent 50 years selling quality candies in North Dakota. The kids love it. Quality Candies are local, and classic North Dakotan. Now, this new “Classic Series Pineapple Candy” seems to have deviated from his expectations.

There is a knock on the door. “That must be Wilfred,” thought Mr. Quality in his mind as he speaks in an elevated voice, “Come on in.”

Wilfred Quality, son of Alfred Quality gently opens the doors and walks in. He is well dressed and looks around 50 years old. He could notice the displeasure of his father as he steps closer.

“Son, I have noticed a lapse in quality in this Classic Series Candy. There is a candy with no candy. What would our customers think? We need immediate action on this. I want to know what proportion of our candy in the stores are dummies. We have to fix this before the customers accuse us of cheating by filling in empty candies. Is this a packing machine error, or is it a trick to adjust for inflation?”

“I will look into it. It must be a machine error. There would be no such inflation adjustment tricks, I assure you,” said Wilfred who is the current CEO of Quality Candies. He is running the business very efficiently since his father took golf seriously.

“I will ask Junior to collect the data and give us the answers,” he said.

∞∞∞

Wilfred Quality Jr. is an enthusiastic lad. He is always in the quest for the truth. This summer, he is interning at one of his father’s offices in Hillsboro. He was unhappy at the quality issue and was determined to find the answers. In his mind, he is also thinking this would make a neat little data analysis project.

His father and grandfather gave him three questions to investigate.

What proportion of their candies is wrongly packed – dummies?

Is it a one-off error or is it systematic?

What is the true nature of our candies, i.e., what are the population characteristics?

He had 30 days, enough monetary resources (it’s his company after all), and Rstudio at his disposal.

∞∞∞

For the next 25 days, he spent his time visiting around 40 local stores a day. From each store, he purchased a pack (100 candies) of “Classic Series Pineapple Candy.”

He also hired Keneth, the kid next door to help him with some experiments. Keneth’s job was to take the candies out of the candy pack and separate the candies from the no candies, like this.

He should also weigh them and enter the readings into a data table. Some of those candies become his prizes after he is done with the experiment.

Wilfred Jr. then analyzes them in RStudio.

All in all, they visited 1000 stores and prepared this data file. The data file is a matrix with 100 rows and 1000 columns. Each column is the data of the weight of the candies from each store.

They take the data from the first store and make a simple dot chart.

“See, the two dummies I found,” said Ken after looking at the plot on Wilfred’s screen.

So that is like 2 in 100, a 2% error. Let’s take the second store and see what comes up,” said Wilfred.

“Hmm. There are no dummies in this sample. That’s 2 in 200 then – 1% error.

“Should we plot a few stores and see if there are variations?” said Ken.

“Yes, we can do that. Let me take the first 12 stores and make the plots.” Wilfred types the following lines in RStudio and gets the plot.

# increasing number of stores # 
par(mfrow=c(4,3))
for (i in 1:12)
{
store_sample = candy_samples[,i]
stripchart(store_sample,font=2,cex=1.25,pch=21,cex.axis=1.5)
}

They see some variation. Some candy samples have dummies, some do not. “It looks like the dummies are not a one-off incident. It is a pattern. There is some proportion that are dummies,” thought Wilfred.

They decided to dig deeper.

They also decided to use around 1.5 grams as a cut off weight to screen for dummies.

If the weight is less than 1.5 grams in a sample, they will call it a dummy and compute the proportion of samples that are dummies.

Wilfred Jr. knows that the maximum likelihood estimator of p, the true proportion of dummies \hat{p}=\frac{r}{n}.

r is the number of dummies in a sequence of n independent and identically distributed random sample. He is using this knowledge for estimating the proportion of dummies.

The more the samples, the better his estimate will be.

Each store he visits gives him a new sample. The total number of candies keep increasing. The more samples he gets, the closer he goes to the true population.

As the samples increase, the estimate gets closer to the truth.

“Let’s compute two estimates. \hat{p_{1}} as something that keeps changing as we add more samples; getting better as n increases. \hat{p_{2}} as the estimate for the store. Since we are visiting 1000 stores, \hat{p_{2}} is a distribution of size 1000. It represents the overall distribution of the proportion of dummies across all our stores.”

Wilfred seems to be speaking to himself. Ken was only interested in counting the candies

Seeing him indifferent, Wilfred showed him the following video on his screen.

Ken’s face lit up. “It is wobbly at the beginning, but seems to stabilize at 0.02,” he said.

“Yes, as the sample size is increasing, the estimated proportion of dummies is better represented. So, the estimate is closer to the true estimate. It looks like our true proportion of dummies is somewhere around 2%. 0.02024 to be precise.”

This is the consistency criterion of the estimates.

\displaystyle{\lim_{n\to\infty} P(|T_{n}(\theta)-\theta|>\epsilon)} \to 0

Estimators are a function of sample size. As n approaches infinity, the sample estimate approaches the true parameter.

Fisher put it this way.

Wilfred also showed another graphic, the distribution of \hat{p_{2}}.

This is the frequency distribution of the estimated proportion from each sample. For example, \hat{p_{2}} for the first store is 0.02, for the second store is 0 and so on.

Wilfred pointed out that the estimate \frac{r}{n} is an unbiased estimate of the true proportion of dummies. The distribution of the estimate \hat{p_{2}} will be centered on the true proportion p.

He is showing the expected value of the distribution, 0.02024, as the red triangle in the plot.

E[\hat{p_{2}}] = 0.02024, the same value we got from the large sample (population).

While there is variability from store to store, the average proportion of dummies is 0.02024.

Keneth is now curious about how to do these calculations and make such visuals.

Wilfred showed him his code.

# Proportions #

nstores = ncol(candy_samples) # number of stores 
ncandies = nrow(candy_samples) # number of candies per store

increase_sample = NULL # an empty value that will be filled as more an more samples --> approaches the population. 
n = NULL # an empty value that will be filled by growing sample size.

phat1_bad = NULL # proportion changes over samples... and stabilizes: Consistency or Ergodicity
phat2_bad = matrix(NA,nrow=nstores,ncol=1) # proportion value for each store. The expected value of this vector should be the population proportion E[phat]

for (i in 1:nstores)
{
store_sample = candy_samples[,i]
increase_sample = c(increase_sample,store_sample)
n = c(n,length(increase_sample))

p1 = length(which(increase_sample<1.5))/n[i]
phat1_bad = c(phat1_bad,p1)


p2 = length(which(store_sample<1.5))/ncandies
phat2_bad[i,1] = p2
}

# plot showing the consistency property #
sam = seq(from=1,to=1000,by=20)
for(i in 1:length(sam))
{
plot(0,0,col="white",xlim=c(0,100000),ylim=c(0.005,0.03),font=2,cex.lab=1.25,font.lab=2,xlab="Samples Size",ylab="Proportion")
lines(n[1:sam[i]],phat1_bad[1:sam[i]])
txt = paste("n = ",n[sam[i]])
text(10000,0.03,txt,col="red")
Sys.sleep(0.5)
}

# Histogram of phat2_bad # 
hist(phat2_bad,font=2,main="",xlab="Proportion",font.lab=2,col="grey")
points(mean(phat2_bad),0.0001,pch=17,cex=2.5,col="red")

While Ken and Wilfred start exploring the candy characteristics – true nature, you can set up today’s experiment yourself and understand the property of consistency and bias more closely using the data. See you in one week. Happy Coding.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 68 – Bias joins Variance

In the last week’s “guess the age” contest, Madam Chairman Yellen was the clear winner. The expected value of the estimated age was 69.69 years, a nominal bias of 1.31 years from her true age.

However, the great Thomas Sowell is the clear winner of our hearts. May he always remain a generation young, like his ideas.

Look at the table once again.

Can you notice the variation in the estimates of the groups?

Yes, Chairman Yellen’s estimated age ranges from 62 years to 78 years, while Dr. Sowell’s estimated age ranges from 55 years to 70 years.

Based on the data from the 16 groups, I computed the variance (and the standard deviation) of the estimated ages.

For Chairman Yellen, the variance of the estimated age is 22.10 and the standard deviation of the estimated age is \sqrt{22.10}=4.7 years.

For Dr. Sowell, the variance is 23.98 and the standard deviation is \sqrt{23.98}=4.9 years.

The variance of the estimator is an important property. It tells us about the spread of the estimate, i.e., how widely the estimate is distributed.

For any estimator \hat{\theta}, we compute the expected value E[\hat{\theta}] as a measure of the central tendency and how far it could be from the truth – Bias, and V[\hat{\theta}] as a measure of the spread of the distribution.

An estimator should be understood as a probability distribution.

The standard deviation of the estimator is called the standard error of an estimator. In our age guessing experiment, the standard error of Dr. Sowell’s estimated age is 4.9 years.

The standard error of \hat{\theta} is \sqrt{V[\hat{\theta}]}.


In lesson 67, while learning the concept of bias, we deduced that the sample mean \hat{\mu} = \bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean \mu.

In other words, we derived that E[\hat{\mu}] = \mu .

Now, let’s derive the variance of the sample mean V[\hat{\mu}] or V[\bar{x}].

For this, let us consider that we have a random sample x_{1}, x_{2}, ..., x_{n} of size n from a population with a mean \mu and variance \sigma^{2}.

Since x_{1}, x_{2}, ..., x_{n} are independent and identically distributed random samples from the population with true parameters \mu and \sigma^{2}, they will have the same expected value and variance as the population. E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}.

\hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is the sample mean.

V[\hat{\mu}] = V[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

V[\hat{\mu}] = V[\frac{1}{n}(x_{1}+x_{2}+...+x_{n})]

V[\hat{\mu}] = \frac{1}{n^{2}}V[(x_{1}+x_{2}+...+x_{n})]

Where did that come from? Was it lesson 27? V[aX]=a^{2}V[X]

V[\hat{\mu}] = \frac{1}{n^{2}}(V[x_{1}]+V[x_{2}]+...+V[x_{n}])

Because the samples are independent, the covariance terms do not exist in the summation.

 V[\hat{\mu}] = \frac{1}{n^{2}}(\sigma^{2}+\sigma^{2}+...+\sigma^{2})=\frac{1}{n^{2}}*n \sigma^{2}=\frac{\sigma^{2}}{n}

The variance of the sample mean V[\hat{\mu}] = V[\bar{x}] =\frac{\sigma^{2}}{n}.

The standard error of the sample mean is \frac{\sigma}{\sqrt{n}}. It is a measure of the precision of the estimator.


Joe has a question

Do you remember last week’s dilemma?

Joe wants to choose a path; an estimator for the true population mean.

We tried to help him by computing the bias of each of the three estimators. The idea was to choose the one that has no bias. But it turned out that all three estimators are unbiased.

How then, can we select the best estimator?

Look at the estimators again.

\hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\hat{\mu_{2}} = x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}

\hat{\mu_{3}} = median(x_{1},x_{2},...,x_{n})

I told you that estimators should be understood as probability distributions and that we can compute the expected value E[\hat{\theta}] for the bias, and V[\hat{\theta}] to measure of the spread of the distribution.

Let’s compute the variance of these three estimators.

V[\hat{\mu_{1}}] = V[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

\hat{\mu_{1}} is the sample mean. We just derived its variance as V[\hat{\mu_{1}}] = \frac{\sigma^{2}}{n}.

So let’s work on the second estimator.

V[\hat{\mu_{2}}] = V[x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}]

V[\hat{\mu_{2}}] = V[ x_{[1]} + \frac{1}{2}x_{[n]} - \frac{1}{2}x_{[1]}] = V[\frac{1}{2}x_{[1]} + \frac{1}{2}x_{[n]}]

V[\hat{\mu_{2}}] = \frac{1}{4}\sigma^{2} + \frac{1}{4}\sigma^{2}

V[\hat{\mu_{2}}] = \frac{1}{2}\sigma^{2}

The variance of the third estimators \hat{\mu_{3}} will be \sigma^{2} if the sample size is odd-numbered, or \frac{1}{2}\sigma^{2} if the sample size is even-numbered.

So, the three estimators \hat{\mu_{1}}, \hat{\mu_{2}}, \hat{\mu_{3}} are all unbiased. Their variances are as follows:

V[\hat{\mu_{1}}]=\frac{\sigma^{2}}{n}

V[\hat{\mu_{2}}]=\frac{\sigma^{2}}{2}

V[\hat{\mu_{2}}]=\sigma^{2} or \frac{\sigma^{2}}{2}

The three estimator probability distributions are centered on the truth (\mu), but their spreads are different. If the sample size n is greater than 2 (n>2), estimator \hat{\mu_{1}} has the lowest variance.

They will look like this visually.

Among all the estimators that are unbiased, we choose the one which has minimum variance. \hat{\mu_{1}} is the minimum variance unbiased estimator. It is more likely to produce an estimate close to \mu, the truth.

Among all estimators which are unbiased, choose the one which has minimum variance. This chosen \hat{\theta} is called the minimum variance unbiased estimator (MVUE) of \theta, and it is most likely among all unbiased estimators to produce an estimate close to the truth. This principle is called the principle of minimum variance unbiased estimation.


You must be wondering what if an estimator has low variance, but is biased. Like this.

Isn’t \hat{\theta_{1}} more likely to produce an estimate close to the truth?

Perhaps.

So there should be a way to combine bias and variance into one measure to help us with the selection.

The mean squared error is that measure. Let’s see how to derive this measure.

We have an estimator \hat{\theta} for the true paramter \theta.

The error of this estimator from the truth is \hat{\theta}-\theta.

The squared error is (\hat{\theta}-\theta)^{2}.

The mean squared error (MSE) is the expected value of the squared error.

E[(\hat{\theta}-\theta)^{2}]

MSE = E[(\hat{\theta}-\theta)^{2}]

MSE = V[\hat{\theta}-\theta] + (E[\hat{\theta}-\theta])^{2}

Can you tell how we wrote the above expression?

MSE = V[\hat{\theta}] + V[\theta] + (E[\hat{\theta}] - E[\theta])^{2}

MSE = V[\hat{\theta}] + (E[\hat{\theta}] - \theta)^{2}

This is because, E[\theta]=\theta and V[\theta]=0.

V[\hat{\theta}] is the variance of the estimator.

E[\hat{\theta}] - \theta is the bias of the estimator.

So, MSE = Variance + Bias^{2}

For unbiased estimators, the mean squared error is equal to the variance of the estimator.

We can use this measure to compare two estimators. If MSE(\hat{\theta_{1}})<MSE(\hat{\theta_{2}}) we can say that \hat{\theta_{1}} is a better estimator, a more efficient estimator in producing the truth.

The ratio \frac{MSE(\hat{\theta_{1}})}{MSE(\hat{\theta_{2}})} is called relative efficiency of \hat{\theta_{2}} to \hat{\theta_{1}}.

If you can create an estimator that has low variance, but you have to compromise on the bias, you can go for it, as long as the reduced variance is greater than the squared bias.

In estimating true parameters, there is always a tradeoff between bias and variance.

To close off for the week, let me ask you three questions.

If \frac{MSE(\hat{\theta_{1}})}{MSE(\hat{\theta_{2}})} is the relative efficiency, what is consistency?

Is the estimator for variance \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}} unbiased? Did you do your homework?

When was the last time you were on RStudio?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 67 – Bias

“Nooooo,” “No way…” “Ooooh,” “Yessss,” “I told ya,” “I knew it.”

These are some of the funny reactions from our high-school students when the true age is finally revealed; because the bias in their estimation is also revealed.

Every year, I start off the data analysis boot camp with this game from the bag of tricks for teaching statistics by Andrew Gelman.

The students sit in groups of four and look at rolling pictures to guess their ages. They have their own formulas (estimators) for the age!

What you saw in the opening image are the guessed ages of the former Chairman of the Federal Reserve, Jannet Yellen and the great economist Thomas Sowell.

Do you see anything peculiar?

Yes. The estimated age of Chairman Yellen is somewhat close to her true age (as per Wikipedia) and the estimated age of Dr. Sowell is not even close (again, as per Wikipedia).

Let’s compute the average of all the guesses, the expected value of the estimated age.

For Chairman Yellen it is 69.69 years, a difference of 1.31 years from her true age.

Pretty close to the truth. The students’ estimate has very little bias.

For Dr. Sowell, it is 61.88 years, a difference of 25.13 years from his true age.

Far from the truth. The students’ estimate is biased.

Let’s spend a little more time on the estimated ages.

There are 16 groups that guessed the ages based on a picture of the person. Let us use a notation \hat{\theta} for this estimate. So, after heavy thinking and vehement deliberation within the group, the estimate from group A is \hat{\theta}=73 for Chairman Yellen.

Group B is also filled with thinkers and negotiators, so their estimate for Chairman Yellen’s age is \hat{\theta}=70.

Likewise, the 16 groups provided a distribution of estimates. The expected value E[\hat{\theta}] of this distribution is 69.69 years.

The true age \theta is 71 years.

The bias is E[\hat{\theta}]-\theta=-1.31

Look at this visual. I am taking the 16 estimates and plotting them as a histogram (or a probability density). The true age is also shown in the histogram. The estimates are distributed about the true age. The mean of this distribution is very close to 71.

Dr. Sowell on the other hand, somehow seemed younger in the minds of the high-school students. The expected value of the estimate E[\hat{\theta}] is 61.88 years. The true age is 87 years, a bias (E[\hat{\theta}]-\theta) of around 25 years.The bias could be because of the picture I am using, his pleasant smile or his aura!

This idea of measuring the closeness of the estimate to the true parameter is called the bias of the estimate.

An estimator (\hat{\theta}) (i.e., a formulation to compute the true population parameter) is unbiased if on average it produces values that are close to the truth. E[\hat{\theta}]=\theta. In other words, if we have a distribution of the estimates, and if their mean is close (or centered) to the truth, the estimator is unbiased.

♦Θ♦


In lesson 63, we learned that the maximum likelihood estimator of p, the true probability of success for a Binomial distribution is \hat{p}=\frac{r}{n}, where r is the number of successes in a sequence of n independent and identically distributed Bernoulli random numbers. Do you think \hat{p} is unbiased?

Let’s check out.

The bias of \hat{p} is defined by E[\hat{p}]-p.

E[\hat{p}]=E[\frac{r}{n}]=\frac{1}{n}E[r]=\frac{1}{n}*np=p

The key point in the above derivation is that the expected value of the number of successes (r) for a binomial distribution is np.

The estimate \hat{p}=\frac{r}{n} is an unbiased estimate of the true probability of success.

The distribution of the estimate \hat{p} will be centered on the true probability p.

§⇔§

The maximum likelihood estimators for the Normal distribution are \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} and \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. Do you think they are unbiased?

\hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is the sample mean. Let’s compute the bias of this estimate.

E[\hat{\mu}] = E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

E[\hat{\mu}] = E[\frac{1}{n}(x_{1}+x_{2}+...+x_{n})]

E[\hat{\mu}] = \frac{1}{n}E[(x_{1}+x_{2}+...+x_{n})]

E[\hat{\mu}] = \frac{1}{n}(E[x_{1}]+E[x_{2}]+...+E[x_{n}])

Since x_{1}, x_{2}, ..., x_{n} are random samples from the population with true parameters \mu and \sigma, E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}. In other words, since x_{i}s are independent and identically distributed, they will have the same expected value and the variance as the population.

So, E[\hat{\mu}] = \frac{1}{n}(\mu+\mu+...+\mu)=\frac{1}{n}*n \mu=\mu

The sample mean \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean \mu.

What about the variance estimator \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}?

You know the chore now. Compute the bias of this estimator and find out for yourself. Passive minds have to wait until next week.

♦Θ♦


Joe is a smart chap. The other day, he asked me if he could use any other estimator to get the population mean (truth). I probed him further, and he came up with three estimators to estimate the true mean \mu.

\hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\hat{\mu_{2}} = x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}

\hat{\mu_{3}} = median(x_{1},x_{2},...,x_{n})

The first estimator \hat{\mu_{1}} is known to all of us. It is the sample mean.

The second estimator \hat{\mu_{2}} is computing the central tendency as the sum of the smallest value x_{[1]} and half the range of the data \frac{x_{[n]}-x_{[1]}}{2}. It is a reasonable way to tell where the center will be.

The third estimator \hat{\mu_{3}} is the median of the sample.

Can you help Joe choose the correct path?

Your immediate reaction would be, “compute the bias for each estimator and select the one that has the smallest bias.”

Fair enough. Let’s try that then.

We already showed that the first estimator \hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is unbiased.

Employing the bias computation for the second estimator, we can see that

E[\hat{\mu_{2}}] = E[x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}] = E[x_{[1]}] + \frac{1}{2}(E[x_{[n]}]-E[x_{[1]}])

E[\hat{\mu_{2}}] = \mu + \frac{1}{2}(\mu-\mu)=\mu

The second estimator is also unbiased.

Now, depending on whether there is even number of samples or an odd number of samples, you can derive that the third estimator is also unbiased.

How then, can Joe select the best estimator?

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 66 – Spare a moment

Hello! Do you have a moment?

What is the first thing that comes to your mind when you hear the word “moment?”

For some of you, it is “let me capture and share the moment.”

For some of you, it is the coffee cup cliches — “live in the moment,” “enjoy every moment,” and “be happy for this moment.”

Some of you may actually be believing in the “power of the present moment” or “don’t wait for the perfect moment, take the moment and make it perfect” stuff!

How many of you remembered torque from your high school physics lesson?

The torque of a force, or moment of a force about the axis of rotation?

\Gamma = \textbf{r} * \textbf{F}

Moment of a force is the product of the force and its distance from an axis. It is a measure of the tendency of the force to cause a body to rotate.

I am sure you are thinking that the moment you spared for me is not worth it. You don’t want to recollect those moments from the horrid highschool days.

Bear with me for a few more moments. You will see why we are talking moments in data analysis.

You know by now that we can use a sample to estimate the true value of the parameter of the population. The formulation used to get that estimate is called an estimator, and there are methods of estimation of these parameters. In lesson 63, lesson 64 and lesson 65, we dabbled with the method of Maximum Likelihood Estimation (MLE).

Another method of estimating population parameters from the sample is the Method of Moments.

Let’s see how the idea of the moment from physics is related to the Method of Moments.

Assume that we have a sample of five data points x_{1}, x_{2}, x_{3}, x_{4}, x_{5} from a population with a probability density function f_{X}(x). Let’s place them on a number line like this.

Imagine that each number is represented by an equal-weighted ball. Since each data point is independent and identically distributed, i.e., equally likely, it is reasonable to assume that the probability of their individual occurrences (\frac{1}{n}) is the mass of the data point.

For any data point, the torque, or the moment about the axis of rotation is \Gamma = \textbf{r} * \textbf{F}.

For example, a data point at a distance of x_{i} from the axis has a moment of x_{i}*\frac{1}{n}. I am dropping the standard gravity term, and you will see why shortly.

Since there are more than one forces acting on the sample number line, we can compute the total torque acting as \Gamma = {\displaystyle \sum_{i=1}^{n}r_{i} * F_{i}}.

In other words, the total moment of the sample data about the axis of rotation is {\displaystyle \sum_{i=1}^{n}x_{i}*\frac{1}{n}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}.

Did you notice that this is the equation for the sample mean?\bar{x}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. The centroid of the data points.

The population has a probability density function f_{X}(x).

We saw in Lesson 64 on how to divide the space into equal intervals of length dx. The probability that a given sample data point x_{i} falls in the range dx is f_{X}(x_{i})dx for an infinitesimally small range dx. Like in the sample case, we can assume this is the mass of any subdivision. Hence, the moment of this infinitesimal body about the rotation axis is x_{i}*f_{X}(x_{i})dx.

For the entire population, the total moment is {\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx}.

I know. You will tell me that this is the expected value E[X], or the centroid of the probability density function.

E[X] = {\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx}

If the five data points x_{1}, x_{2}, x_{3}, x_{4}, x_{5} are originating from a population with a probability density function f_{X}(x), it is reasonable to equate the total moment of the sample data to the total moment of the population data. The centroid of the sample is equal to the centroid of the probability density functions.

{\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx} =\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}


E[X] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}


Let me pester you again!

What is the moment of inertia?

No worries. Let’s activate your deep memory neurons. Imagine that the sample data points are rotating on the plane about the axis with a velocity v = r\omega, where \omega is the constant angular velocity of the plane. In our example of five data points, the velocity of any given point will be v_{i} = x_{i}\omega.

If the body is rotating, it will be subject to a tangential force of m\frac{dv}{dt}. For our sample point which is at a radial distance of x_{i} from the axis, the tangential force will be \frac{1}{n}\frac{dv}{dt} = \frac{1}{n}x_{i}\frac{d \omega}{dt}.

To simplify life, let’s call \frac{d \omega}{dt}=\alpha, the angular acceleration of the point.

The tangential force is \frac{1}{n}x_{i}\alpha, and the torque of the tangential force applied on one point is x_{i}*\frac{1}{n}x_{i}\alpha = \frac{1}{n}x_{i}^{2}\alpha.

The total torque of all the tangential forces = {\displaystyle \sum_{i=1}^{n}\frac{1}{n}x_{i}^{2}\alpha}.

The term {\displaystyle \sum_{i=1}^{n}\frac{1}{n}x_{i}^{2}} is called the moment of inertia of the points.

Just looking at the equation, you can say that the moment of inertia of points far apart is larger than the moment of inertia of the points closer to the axis.

This moment of inertia is called the second moment and it gives us a sense of how spread out the data points are about the axis.

Sample moment of inertia = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}.

By extension to the population, we can write the population moment of inertia, or the population second moment as

E[X^{2}]={\displaystyle \int_{- \infty}^{\infty}x^{2}f_{X}(x)dx}.

Notice the generalization E[X^{2}] for the second moment, just like E[X]={\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx} is the first moment.

Population moments are the expected values of the powers of the random variable X.

I can sense that you are charged up by the word “spread.” Yes, the second moment will be used to get the second parameter, the variance of the distribution.

We can equate the sample moment of inertia to the population moment of inertia.


E[X^{2}] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}


This form of equating sample moments to the population moments is called the Method of Moments for parameter estimation.

It was developed by Karl Pearson in the early 1900s. Pearson demonstrated this method by fitting different forms of frequency curves (probability density functions) to data by calculating as many moments from the sample data as there are parameters of the probability density function.

A generalized representation of the Method of Moments can be done like this.


E[X^{k}] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{k}}


To estimate the k parameters of the probability density function, we can equate the k population moments to the k sample moments.

If the probability density function has one parameter, the first moment equation is sufficient. If the function has two parameters, we have the first and the second moment equations — two unknowns, two equations to solve. If the function has three parameters, we go for three moment equations.

Time for some action

Let’s solve for the parameters of the normal distribution.

There is a sample of n data points x_{1}, x_{2}, x_{3}, ..., x_{n} that we think originated from a normal distribution (population) with a probability density function f_{X}(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}.

x_{1}, x_{2}, x_{3}, ..., x_{n} \sim N(\mu, \sigma)

\mu and \sigma are the parameters of the normal probability density function. Let’s apply the Method of Moments to estimate these parameters from the sample. In other words, let’s formulate the equations for \mu and \sigma.

The first population moment is E[X^{1}]. The first sample moment is \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{1}}.

For the normal distribution, we know that the expected value E[X] is the mean \mu.

\mu = E[X] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

The second population moment is E[X^{2}]. The second sample moment is \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}.

E[X^{2}]. Hmm. You have to go down the memory lane, to Lesson 26 to realize that E[X^{2}]=V[X] + (E[X])^{2}.

.
.
.
Did you see that? Yes, so E[X^{2}]=\sigma^{2} + \mu^{2}. Equating this to the sample second moment, we get

\sigma^{2} + \mu^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}} - \mu^{2}

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}} - (\bar{x})^{2}

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - n\bar{x}^{2}\Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2n\bar{x}^{2} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2\bar{x}n\bar{x} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2\bar{x}\sum x_{i} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - \sum 2\bar{x} x_{i} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - \sum 2\bar{x} x_{i} + \sum \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \sum (x_{i}^{2} - 2\bar{x} x_{i} + \bar{x}^{2})

\sigma^{2} = \frac{1}{n} \sum (x_{i} - \bar{x})^{2}


Let’s spend a few more moments and solve for the parameters of the Gamma distribution.

The probability density function of the Gamma distribution is

f(x) = \frac{\lambda e^{-\lambda x}(\lambda x)^{r-1}}{(r-1)!}

It has two parameters, r and \lambda.

\lambda is called the scale parameter. It controls the width of the distribution, and r is called the shape parameter because it controls the shape of the distribution.

We know that the expected value of the Gamma distribution E[X] = \frac{r}{\lambda}. If you forgot this, solve the integral \int xf(x)dx for the Gamma distribution. Equating this first population moment to the sample moment, we can see that

\frac{r}{\lambda} = \frac{1}{n}\sum x_{i} = \bar{x}

r = \lambda \bar{x}

Then, we know that V[X] for the Gamma distribution is \frac{r}{\lambda^{2}}.

Using this with the second population moment, E[X^{2}]=V[X] + (E[X])^{2}, we can see that E[X^{2}] = \frac{r(r+1)}{\lambda^{2}}.

We can now equate the second population moment to the second sample moment.

E[X^{2}] = \frac{1}{n}\sum x_{i}^{2}

\frac{r(r+1)}{\lambda^{2}} = \frac{1}{n}\sum x_{i}^{2}

\frac{\lambda \bar{x} (\lambda \bar{x}+1)}{\lambda^{2}} = \frac{1}{n}\sum x_{i}^{2}

\frac{\bar{x}(\lambda \bar{x}+1)}{\lambda} = \frac{1}{n}\sum x_{i}^{2}

 \bar{x} + \lambda \bar{x}^{2} = \frac{\lambda}{n}\sum x_{i}^{2}

\lambda (\frac{1}{n}\sum x_{i}^{2} - \bar{x}^{2}) = \bar{x}

\lambda = \frac{\bar{x}}{\frac{1}{n}\sum(x_{i}-\bar{x})^{2}}

So,

r = \frac{\bar{x}^{2}}{\frac{1}{n}\sum(x_{i}-\bar{x})^{2}}

Very tiring, I know. You will rarely find all the algebra in textbooks. So, the moments are worth it.

I asked you to spare a moment. Now you are going “this is quite a moment.” What do you think? Would you remember the Method of Moments next time you spare a moment?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 65 – Most Likelihood

“Maybe there is one estimate.”

Jill is staring at the plot on his desktop. The 6×6 cubicle and the gloomy weather outside is certainly not helping his concentration. Jill, the new intern at the Department of Environmental Protection, is working on developing estimates for the probability of water quality violations in summer at Pier 84, NYC’s Hudson River Park. To toy around with the idea, his boss gave him a ‘sample data’ for seven days during August 1 – August 7, 2016.

“I should perhaps develop two estimates; the probability of violation during the day and probability of violation during the night.”

Dissolved oxygen (DO) is used as an indicator by the New York State Department of Environmental Conservation to measure ecological health. A typical standard on acceptable water quality is a DO greater than 5 mg/L.

Jill saved the excel file, unplugged the pen-drive from his computer, tossed it in his backpack along with a few other folders, and did something on his phone. He took a deep breath, got up and walked out to the front door of his office.

It was around 6 pm when he stepped out.

♦♦♦

A black ford with an uber sticker on the front pane stopped in front of an apartment building on 96th street and Amsterdam Avenue. Jill is back at his apartment. Joe, his roommate, has already ordered Chinese. They plan to work on Jill’s problem over the evening along with writing the end of the semester project reports. The floor in the living room is filled with books, papers of all sort, pens, and markers. Along the glazed window facing Amsterdam Avenue, there are half-burnt candles. It is their finals week.

“I think two separate estimates will be appropriate,” said Jill to Joe as they both look at the plot on Jill’s laptop.

“The sample data I have has 672 data points of measured DO every 15 minutes. I have to develop estimates for p = P(DO < 5). I am thinking of dividing the data into daytime observations and nighttime observations and provide p_{day} and p_{night}. But, a single estimate for all the data, p_{full} may also be okay.

“Is there reason to believe that there can be two separate estimates?” asked Joe.

“Yes. Did you see that DO is higher during the day and is low during the night? I have done some study on this and found that such a daily cycle is due to the day-night change in phytoplankton growth. The DO increases at sunrise as phytoplankton begins photosynthesis and produces oxygen. During later afternoon and evening, the chlorophyll starts decreasing, and during the night, it is low.”

“I started making some simple assumptions on the day and night measurements. Morning 6 am to evening 6 pm is daytime and 6 pm to 6 am in the morning is nighttime. In the ‘sample data,’ there are 364 daytime measurements and 308 nighttime measurements. During the daytime, there are 185 violations, i.e., when DO is less than 5 mg/L. During the nighttime, I found 233 violations. So,

p_{day} = \frac{185}{364} = 0.51

p_{night} = \frac{233}{308} = 0.76.

However, a small part of me is saying, I count all data and provide

p_{full} = \frac{418}{672} = 0.62,

since the overall violations are 418 for the 672 measurements.

I am not able to decide.”

Why don’t you test both your assumptions using the maximum likelihood approach?” Joe suggested. “Since the maximum likelihood estimation method computes the parameters of the population distribution which agree most closely with the sample observations, you can test your data with these two formulations. The formulation that gives the most likelihood will win. What say?”

“That is an interesting thought. How do we set up such a test?” asked Jill as he cracks up his fortune cookie.

“Simple. Since you are looking for violations, it is safe to assume a Bernoulli distribution for the data. X = [0,1], 0 if DO > 5 mg/L and 1 if DO < 5 mg/L is the distribution with a parameter p, the probability of the violation. Your sample data [x_{1}, x_{2}, ..., x_{n}] can be used to find the optimal parameter that maximizes the likelihood (joint probability of occurrence) of observing the sample.”

As Joe stretches to grab a piece of paper, Jill speaks up.

“Okay, let’s say there are two different estimators to estimate the probability of violation; p_{full} assumes that all the data comes from the same Bernoulli distribution, and p_{day},p_{night} assumes that there are different Bernoulli distributions for nighttime samples and daytime samples. You mean to say, we can test these two theories?”

“Precisely,” said Joe as he shows his scribble on the paper.

Assumption 1: Full data comes from the same population.

Assumption 2: Partition the data into night and day.

Jill seems to be excited. He is already visualizing himself presenting this testing framework to his boss. “Let’s compute the log-likelihood for the first assumption.” He pulls up his laptop to open lesson 63 of the data analysis classroom where Joe and Devine discussed the likelihood function for the Bernoulli distribution.

“The log-likelihood function for Bernoulli distribution is ln(L) = rln(p)+(n-r)ln(1-p). n is the number of samples, 672 in our study, r is the number of 1s in the sample data, 418 total violations, and p is the parameter, the probability of violation. p_{full}=0.62.”

“Yes, and the log-likelihood is -445.59,” said Joe as he completes the equation on the paper.

Assumption 1: Full data comes from the same population.

ln(L_{1}) = r_{full}ln(p_{full})+(n_{full}-r_{full})ln(1-p_{full})

ln(L_{1}) = 418*ln(0.62)+(672 - 418)ln(0.38)

ln(L_{1}) = -445.59

Jill gives an agreeable nod. “How can we write the log-likelihood equation for the second assumption?” he asks immediately.

Joe wastes no time. He pens down a few quick lines.

L_{2} = {\displaystyle \prod_{i=1}^{n_{day}}f(x_{i}|p_{day})}{\displaystyle \prod_{i=1}^{n_{night}}f(x_{i}|p_{night})}

L_{2} = p_{day}^{r_{day}}*(1-p_{day})^{n_{day}-r_{day}}*p_{night}^{r_{night}}*(1-p_{night})^{n_{night}-r_{night}}

“If the data is partitioned into night and day, there will be two functions, one for the day and one for the night. If the sample is independent and identically distributed, the overall likelihood is the product of the two equations,” he says. Be watchful with the assumption of independence. You have to check for this.

Jill stares at it for a bit, looked up the ceiling and seemed to concur with the thought.

“Let me write the log-likelihood function for this,” he said as he starts writing the equation on the paper.

ln(L_{2}) = r_{day}ln(p_{day}) + (n_{day}-r_{day})ln(1 - p_{day}) + r_{night}ln(p_{night}) + (n_{night}-r_{night})ln(1 - p_{night}) 

ln(L_{2}) = 185ln(0.51) + (364 - 185)ln(0.49) + 223ln(0.76) + (308 - 223)ln(0.24)

ln(L_{2}) = -434.76

“So, ln(L_{1}) = -445.59 and ln(L_{2}) = -434.76. ln(L_{2}) > ln(L_{1}). Clearly, Assumption 2 seems to be more appropriate based on the data and the maximum likelihood method.”

“Yes. You can show this work tomorrow and ask for more data from your boss,” said Joe as he grabs some papers from the floor.

“Absolutely. I will prepare a few graphics with data and equations and show it to him. I think he will like it.”

So, as the night grows, Joe starts his project report, and Jill is preparing himself for his meeting tomorrow.

♦♦♦

A thought on the precision of the estimate has come to mind as Jill is typing up his equations.

“The maximum likelihood estimation method seems to be an elegant way to come up with parameters for probability distributions. The likelihood itself can be used to test assumptions, as we did here. But, are we confident enough that the maximization produces precise estimates? I mean, maximization is like finding the peak of a function. How can we be confident that the value where we find the peak is very precise?” asked Jill.

“Well, since you have 600 data points, your estimate will be precise. Precision is a function of sample size. Smaller sample sizes will yield less precise estimates.” Joe said uninterrupted from his writing.

“How do you mean?”

“Okay. Let’s take your example. The log-likelihood function for Bernoulli distribution. ln(L) = rln(p)+(n-r)ln(1-p). \hat{p}=\frac{r}{n} is the maximum likelihood estimate of p. For a given sample X = [x_{1}, x_{2}, ..., x_{n}], \hat{p} maximizes the likelihood (joint probability of occurence) of observing the sample. However, for smaller sample sizes, the parameter is not estimated very precisely.”

“Can you illustrate it?”

“Sure. Give me a second.” Joe types up a few lines of code in his RStudio console and creates a figure. “Look at this,” he said.

“This graphic shows the log-likelihood (L_{1}) for various values of p and a sample size n = 10. Imagine you had a sample of 10 observations with 6 or 7 violations. The peak of the function is at p = 0.6 or so, but around that region, the function is flat. So, pinpointing the peak at 0.62 is not possible. Less precise estimate.”

“To see this better, I computed the slope of the points from the maximum value of the log-likelihood. The difference in the log-likelihood at every point from the peak. This difference (slope) is another function that shows the precision. At the best estimate, 0.6 ish, the slope function should be very steep to say that the estimate is precise. The function is peaky at the best estimate. If the slope function is flat, the estimate is less precise.”

“I created this difference function for n = 20, 50, and 200 to see how the slope changes with increasing sample size. This is the graphic you see here with four different function lines. Can you see any pattern?”

Jill squints his eyes as he looks at the figures. “Yes, I see the slope function is flat for n = 10 and gets peaky as n increases. It is very steep at 0.62 for n = 200. The more sample we have, the more precise estimates we can get.”

“Should I include this too when I talk tomorrow?”

“Why not. The more information you present, the better. It shows your thought process and rigor.”

“Also, show that the maximum likelihood estimation method has an invariance property.

“What is that?”

“If \hat{\theta} is the maximum likelihood estimator for a parameter \theta, then the maximum likelihood estimator of any function g(\theta) is g(\hat{\theta}), the same function g of the estimator.”

“Think about variance and standard deviation for example. We know that the maximum likelihood estimator for variance \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. Standard deviation \sigma = \sqrt{\sigma^{2}}, is a function of the variance. So, the maximum likelihood estimator for standard deviation is \hat{\sigma}=\sqrt{\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}}.”

“Since you are having large sample sizes, while you discuss properties, you can throw in the fact that the maximum likelihood method produces estimates (\hat{\theta}) that are approximately normally distributed, and approximately unbiased estimates with minimum variance.”

“Wait, you just threw in a few new words like bias and minimum variance, and you are asking me to throw it on my boss!”

Jill looks clearly lost. It may be late in the night also. They both go back to doing their works, hoping to sail through hurricane FINALS.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 64 – More Likelihood

J: The maximum likelihood estimation method provides the formulation for computing the parameters of the population distribution which agree most closely with the sample observations. For a given sample X = [x_{1}, x_{2}, ..., x_{n}], the optimal parameter maximizes the likelihood (joint probability of occurrence) of observing the sample. Last week, we derived the parameters for the Binomial and the Poisson, both discrete distributions. I wonder if we can use the same logic for continuous distributions?

D: Hello Joe. You look recharged and ready to go. So let’s cut to the chase. Yes, we can apply the same joint probability logic to the continuous probability functions. Let’s take a random sample X = [x_{1}, x_{2}, ..., x_{n}] from a population with a probability density function f_{X}(x) as in this figure.

Assume we can divide the space into equal intervals of length dx. Can you tell me the probability that a given sample data point x_{1} falls in the range dx?

J: That would be the integral of f_{X}(x_{1})dx over the range.

D: Correct. For an infinitesimally small range dx, we can approximate this area as f_{X}(x_{1})dx. Since the sample observations are independent and identically distributed, the joint probability of observing them, or the likelihood function is

L = f_{X}(x_{1})dx*f_{X}(x_{2})dx*...*f_{X}(x_{n})dx

L = {\displaystyle \prod_{i=1}^{n}f_{X}(x_{i})}(dx)^{n}

Maximizing the above function is equivalent to maximizing L = {\displaystyle \prod_{i=1}^{n}f_{X}(x_{i})} since we have equal intervals of length dx.

J: I see. So, it reduced to the same likelihood we did last week; the product of the probability density function for different values of the sample. We can choose the value of the parameter(s) that maximizes this quantity.

D: Precisely. Here are Fisher’s own words from his paper.

J:  Shall we try some continuous distributions today?

D: We shall. Begin with Exponential Distribution.

J: Let me solve it. The probability function for the exponential distribution is f_{X}(x) = \lambda e^{- \lambda x}. So the likelihood function is

L = {\displaystyle \prod_{i=1}^{n}\lambda e^{- \lambda x_{i}}

L = \lambda^{n}e^{-\lambda (x_{1}+x_{2}+...+x_{n})}

The log-likelihood will be

ln(L) = nln(\lambda) - \lambda (x_{1}+x_{2}+...+x_{n})

\lambda is the parameter we are trying to estimate for maximum likelihood. So take the derivate of the function ln(L) with respect to \lambda and equate it to 0 to solve for \lambda.

\frac{d ln(L)}{d \lambda}=0

\frac{n}{\lambda} - \sum_{i=1}^{n}x_{i}=0

\lambda = \frac{1}{1/n \sum_{i=1}^{n}x_{i}} = \frac{1}{\bar{x}}=\frac{1}{E[X]}

The maximum likelihood estimator for the exponential distribution is \hat{\lambda} = \frac{1}{\bar{x}}.

D: Excellent. Another continuous function you can try is the Rayleigh density. It is the probability distribution of the distance r = \sqrt{x^{2}+y^{2}} from the origin to a point with coordinates (x , y) in an x-y plane. Both X and Y should have a normal distribution.

f(r) = \frac{r}{\theta}e^{-\frac{r^{2}}{2 \theta}}

\theta is the parameter.

Do you recognize this function? We learned a probability distribution which is similar to this.

J: Aaah… Chi-square? Sum of squares of normals.

D: Correct. Rayleigh is the distribution of the distance. The square root of the Chi-square distribution.

J: Let me go through the steps of estimating \theta using the maximum likelihood method.

L = {\displaystyle \prod_{i=1}^{n}\frac{r_{i}}{\theta}e^{-\frac{r_{i}^{2}}{2\theta}}

L = \frac{1}{\theta^{n}} e^{- \frac{1}{2 \theta}\sum_{i=1}^{n}r_{i}^{2}} {\displaystyle \prod_{i=1}^{n}r_{i}}

ln(L) = -nln(\theta) + {\displaystyle \sum_{i=1}^{n}ln(r_{i})}-\frac{1}{2 \theta}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

We have one parameter to estimate, so \frac{d ln(L)}{d \theta}=0.

-\frac{n}{\theta}+\frac{1}{2 \theta^{2}}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}=0

\theta = \frac{1}{2n}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

The maximum likelihood estimator for the Rayleigh distribution is \hat{\theta} = \frac{1}{2n}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

D: Very well done. Now let’s flex our muscles. How can we estimate two or more parameters?

J: Like for example, the parameters of the normal distribution, \mu and \sigma?

D: Yes.

J:

D: Simple extension. Since there will be two parameters in the likelihood function of a normal distribution, we have to solve the set of equations derived through the partial derivates. Look at these steps for the normal distribution.

f_{X}(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}}

{\displaystyle \prod_{i=1}^{n}} \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2}(\frac{x_{i}-\mu}{\sigma})^{2}}

L = (2\pi)^{-n/2}\sigma^{-n}e^{-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-\mu)^{2}}

ln(L)=-\frac{n}{2}ln(2\pi) - nln(\sigma) - \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

Now we can have two equations based on the partial derivatives of the log-likelihood function.

\frac{\partial ln(L)}{\partial \mu}=0

\frac{\partial ln(L)}{\partial \sigma}=0

Let’s take the first partial derivative.

\frac{\partial ln(L)}{\partial \mu}=0

\frac{\partial}{\partial \mu}(- \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}) =0

\frac{\partial}{\partial \mu}(- \frac{1}{2\sigma^{2}}({\displaystyle \sum_{i=1}^{n} x_{i}^{2}} + {\displaystyle \sum_{i=1}^{n}\mu^{2}} - 2\mu{\displaystyle \sum_{i=1}^{n}x_{i}})) = 0

-\frac{1}{2\sigma^{2}} {\displaystyle \sum_{i=1}^{n}2\mu} + \frac{1}{\sigma^{2}} {\displaystyle \sum_{i=1}^{n}x_{i}}=0

n\mu = {\displaystyle \sum_{i=1}^{n}x_{i}}

\mu = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

Now, take the second partial derivative.

\frac{\partial ln(L)}{\partial \sigma}=0

\frac{\partial}{\partial \sigma}(-\frac{n}{2}ln(2\pi) - nln(\sigma) - \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}) = 0

-\frac{n}{\sigma}+\frac{1}{\sigma^{3}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}=0

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

The maximum likelihood estimators for the Normal distribution are \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} and \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

J: That makes it very clear. There is some work involved in solving, but the procedure is straightforward.

D: You can try estimating the parameters of the lognormal distribution for practice.

J: Well, lognormal has a very similar probability density function. Let me try Gumbel. It has a double exponential distribution. Will be fun to solve.

D: Okay. Go for it.

J: The cumulative function for the Gumbel distribution is F(z)=e^{-e^{-\frac{z-\alpha}{\beta}}}. \alpha and \beta are the parameters.

Taking the derivative of this function, we can get the probability density function for the Gumbel distribution as

f(z)=\frac{1}{\beta}e^{-\frac{z-\alpha}{\beta}}e^{-e^{-\frac{z-\alpha}{\beta}}}

The likelihood function is

L = {\displaystyle \prod_{i=1}^{n} \frac{1}{\beta}e^{-\frac{z_{i}-\alpha}{\beta}}e^{-e^{-\frac{z_{i}-\alpha}{\beta}}}}

L = \beta^{-n} e^{-{\displaystyle \sum_{i=1}^{n} (\frac{z_{i}-\alpha}{\beta} + e^{-\frac{z_{i}-\alpha}{\beta}}) }}

ln(L) = -nln(\beta) -{\displaystyle \sum_{i=1}^{n} (\frac{z_{i}-\alpha}{\beta} + e^{-\frac{z_{i}-\alpha}{\beta}}) }

We can get the partial derivatives with respect to \alpha and \beta and solve the system of equations.

\frac{\partial ln(L)}{\partial \alpha} = \frac{n}{\beta} - \frac{1}{\beta}{\displaystyle \sum_{i=1}^{n}e^{-\frac{z_{i}-\alpha}{\beta}}}=0

\frac{\partial ln(L)}{\partial \beta} = -\frac{n}{\beta} + \frac{1}{\beta^{2}}{\displaystyle \sum_{i=1}^{n}(z_{i}-\alpha)} - \frac{1}{\beta^{2}}{\displaystyle \sum_{i=1}^{n}(z_{i}-\alpha) e^{-\frac{z_{i}-\alpha}{\beta}}}=0

D: Yes, I know. There is no easy solution for this system. You have to use iterative numerical techniques. Do you remember Newton-Raphson method?

J:

D: It’s okay if you do not remember. We will let R do it for us. The moral is that MLE method is not a panacea. We also get into issues of discontinuity for likelihood functions sometimes. Think uniform distribution for example. As you know, we cannot use calculus methods for discontinuous functions. A combination of the MLE method with numerical techniques and approximations is a good approach.

J: I can see how things are building up. Are there any such pitfalls I need to pay attention to?

D: Yes, there are some properties, limitations and real-world applications we can learn. Do you want to do it now?

J: Let’s do it next week. I am afraid these equations are going to be my sweet dreams tonight

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 63 – Likelihood: modus operandi for estimation

D: Often interchangeably used for probability (guilty as charged), the term likelihood is to be used for inferring the parameters of the population. 

J: Are we talking about using a sample to compute the true value of the parameter? You said last week that there are methods of estimation of these parameters. I was able to grasp that there are sample records, i.e., the data we have at hand, and there is a hypothetical population from where these samples originate.

D: That is correct. As you remember, there was an example of extreme wind speed data (sample). Our goal is to understand this bulk data using some probability function (GEV in this case) with few control parameters. The probability function represents the entire information contained in the sample data. It is the hypothetical population you mentioned, and the data we observe is a random sample from this population.

If you compute the mean (\bar{x}) and variance (s^{2}) of the sample data, they are good guesses (estimates or estimators) of the mean (\mu) and variance (\sigma^{2}) of the population. The formulation used to get the estimate is called an estimator.

J: Then what is likelihood? You are saying that it is a well-known method for estimation.

D: It is the joint probability of occurrence of the observations in any distribution. But before we delve into the mathematical expose, I think it will help if we take a simple example and work it out step by step using some graphics.

J: I like that idea. It helps if we know how the concept builds up, so we can go back to its roots when required.

D: So let’s begin. Imagine you ordered ten shirts from Amazon. Just imagine. Don’t even think of actually doing it 

J:

D: On arrival, you find that the fourth (red shirt), fifth (black shirt) and eighth (white shirt) is defective.

J:

D: Can you convert these ten observations into a Bernoulli sequence?

J: Sure. I will use 1 for defective shirts and 0 for shirts that are not defective. Then, we have

X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0].

D: Great. Now, if these numbers (sample shirts) were coming from a hypothetical population (Bernoulli trials), we can say that there is a certain probability of defective items. Let’s call it p.

J: So, p will be the true probability of occurrence for this Bernoulli trial. What we have is a sample ten values of 0’s and 1’s from this population.

D: Correct. Can you estimate p using the sample data?

J: Based on the ten data points one guess would be 0.3 since there are three defective items in ten. But, how do we know if 0.3 is the best estimate of the true probability?

D: Good question. For that, we need to know about likelihood. The joint probability of observing the sample we have.

J: Can you elaborate.

D: You told me that the Bernoulli sequence of these ten observations is X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0]. Can you tell me what the joint probability of observing this sequence is?

J: Sure. If we assume p for the probability of success, P(X = 1) = p and P(X = 0) = 1-p. Then, the joint probability of seeing X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0] is

P(X=0)*P(X=0)*P(X=0)*P(X=1)*P(X=1)*P(X=0)*P(X=0)*P(X=1)*P(X=0)*P(X=0)=p^{3}(1-p)^{7}.

They are independent and identically distributed, so the joint probability is the product of the individual probabilities.

D: Great. Let’s write it concisely as L = p^{3}(1-p)^{7}.

I am using L to describe the likelihood.

We can now choose a value for p that will maximize L, i.e., that value of p that gives the most L. The brute-force way of doing it is to substitute possibles values of p, compute L and find the value of p where L is the highest.

J: Yes. Let me try that first.

L is maximum at \hat{p}=0.3.

D: That is correct. Now let’s try solving it as a maximization problem. You have the equation L = p^{3}(1-p)^{7}. How do you find p that maximizes L?

J: We can differentiate L. At the maximum or minimum, the slope \frac{dL}{dp}=0. We can use this property to solve for p.

D: Yes. But it is easier to differentiate ln(L).

J: Why is that?

D: Because ln(L) is a monotonic function of L. The largest value of L is proportional to the largest value of ln(L).

Finding p that maximizes ln(L) is equivalent to finding p that maximizes L. Moreover, taking the logarithm of the likelihood function converts the product terms into summation terms.

ln(L) = 3ln(p)+7ln(1-p)

Now we can find the derivative of the function and equate it to 0 to find the answer.

\frac{3}{p} - \frac{7}{1-p}=0

3 - 3p = 7p

p = \frac{3}{10}

Can you generalize this for any number of samples?

J: Yes, I can do that. Assuming there are r 1’s in a sample of n, the likelihood function will be L = p^{r}(1-p)^{n-r}

Log-likelihood function is ln(L) = rln(p)+(n-r)ln(1-p)

\frac{dln(L)}{dp} = 0

\frac{r}{p}-\frac{n-r}{1-p}=0

p=\frac{r}{n}

\frac{d^{2}ln(L)}{dp^{2}} = -[\frac{r}{p^{2}} + \frac{n-r}{(1-p)^{2}}]

A negative second derivative is sufficient to ensure maxima of the function.

D: Very nice. \hat{p}=\frac{r}{n} is called the maximum likelihood estimate of p. For a given sample  X = [x_{1}, x_{2}, ..., x_{n}], \hat{p} maximizes the likelihood (joint probability of occurence) of observing the sample.

R. A. Fisher pioneered this method in the 1920s. The opening image on likelihood is from his 1922 paper “On the Mathematical Foundations of Theoretical Statistics” in the Philosophical Transactions of the Royal Society of London where he first proposed this method.

Here is how L and ln(L) compare when finding the value of p that gives the most likelihood or log-likelihood.

J: That makes it very clear. Some form of visualization always helps better grasp the abstractions. So, this is for Bernoulli sequence, or by extension for the Binomial distribution. Shall we try this maximum likelihood estimation method for one more distribution?

D: The floor is yours. Try it for the Poisson distribution.

J: I will start with the fact that there is a sample of n observations x_{1}, x_{2}, ..., x_{n} that is assumed to have originated from a Poisson distribution with the probability function f_{X}(x) = P(X=x) = \frac{e^{-\lambda t}(\lambda t)^{x}}{x!}. Our sample is counts, i.e., the number of times an event occurs in an interval. Let’s assume a unit time interval, t = 1.

The likelihood L of observing the sample x_{1}, x_{2}, ..., x_{n} is P(X=x_{1})*P(X=x_{2}) ... P(X=x_{n}). We can represent this concisely using the product operator \prod.

L = {\displaystyle \prod_{i=1}^{n} \frac{e^{-\lambda}(\lambda)^{x_{i}}}{x_{i}!}}

L = e^{-n\lambda}{\displaystyle \prod_{i=1}^{n} \frac{(\lambda)^{x_{i}}}{x_{i}!}}

ln(L) = -n\lambda + {\displaystyle \sum_{i=1}^{n} (x_{i}ln(\lambda) - ln(x_{i}!))}

\frac{dln(L)}{d\lambda} = 0

-n + {\displaystyle \sum_{i=1}^{n}\frac{x_{i}}{\lambda} = 0

\lambda = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\frac{d^{2}ln(L)}{d\lambda^{2}} < 0

The maximum likelihood estimator is \hat{\lambda} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. It is the mean of the n observations in the sample. Ah, I see a connection. The expected value of the Poisson distribution E[X] is its parameter \lambda.

D: Very clearly described. Do you have the energy for one more?

J: Most likely no.

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 62 – Knowing the unknown: Inferentia

Think about your recent wine tasting retreat. While you were relaxing and refreshing, you were also tasting/testing wine samples and guessing the quality of the product. You were knowing the unknown using samples and the process of induction. In other words, you were inferring about the true quality of the product based on the taste of the wine sample. A method of generalization from observations.

 

Think about your recent visit to the theater after being impressed by the trailer. You must have judged how good the movie will be, based on the sample they showed you. If you are like me, you would have told yourself that this is the last time you are watching a movie based on the trailer. But you keep going back. Here again, you are inferring about the film based on the sample teaser. Knowing the unknown via inference.

 

Do those robocalls for election survey or ‘get out the vote’ campaign bother you. Be nice to them. They are taking a sample survey to estimate/guess the election results. They are trying to know the unknown. Inferring the true outcome from the sample data.

 

Abraham Wald in 1942 used the sample of fighter planes that returned with bullet holes to judge/infer where to put the armor plates. It is where the holes aren’t there. The truth is that the planes with bullet holes in the wings returned, while the others did not.

Statistical Inference

Deriving general conclusion from what we observe (sample data) is central to increasing our knowledge about the world. Statistical inference helps us with this.

Imagine we wish to know about incomes or IQ levels of adult population in the US. The ideal thing to do is to test every adult for IQ and record their income levels. In other words, get the data for the entire population. But this is not practical. We can, however, take a sample that represents the population, know the IQ and incomes of this sample. We can make a general statement about the population from the sample. The statement could be “the average IQ of adults in the US is 100.” We get this based on a large enough representative sample. The average IQ from the sample is an estimate of the true IQ for the population. Since we are basing it on the sample, the estimate will be sensitive to how much data (sample size) we have.

Imagine we wish to know about the climate of a region we want to relocate to, i.e., what is the temperature distribution and how much does it rain or how many hurricanes does it get per year. Proper measurements of rainfall and temperature are a recent phenomenon. Maybe the last 80 years with good quality. We don’t have an observational record of what happened before that. But, we can take the data we have (sample of climate observations) and estimate the temperature distribution and the hurricane counts and make a statement about the overall climate in this place so you can relocate. The generalization can be “the average temperature of this place is 60F with four hurricanes per year.” Don’t complain after you move that the area is getting more hurricanes. The sample may not have represented the climate history. Our inference is based on what we had. We can always update it.

Come 2020, the media and polling companies will predict the outcome of the presidential election. They will make voter calls or take online surveys to infer which side the election will swing. Ask for sample breakdowns and survey questions. You will know what biases are built into it.

In the most recent lessons on the generalized extreme value distribution, we assumed that the data on extreme wind speed (block maxima) would most likely follow an extreme value distribution since maximum values converge to an extreme value distribution in the limit. So the data we observed is a sample that represents the true GEV functional form

G(z) = e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}_{+}}

This compact GEV function is the population. A true form of all extreme wind values in this place. The data we observed for a few years is a sample that represents this function (population). We can use the sample to draw conclusions (inference) about this GEV function.

The true location, scale and shape parameters for this GEV function are \mu, \sigma and \xi. Based on the data we observed, an estimate (a good guess) of this location, scale and shape parameters is 47.44, 5.03 and 0. We are drawing inference about the true function from the sample data we have at hand.

Hence, the objective of inference is to use a ‘sample’ to compute a number which represents a good guess for the true value of the parameter. The true population parameter is unknown, and what we estimate from the sample will enable us to obtain the closest answer in some sense.

The general rule, or a formula which is used to get the estimate, is called an estimator.

For example, if you compute the mean (\bar{x}) and variance (s^{2}) of the data, they are good guesses (estimates or estimators) of the mean (\mu) and variance (\sigma^{2}) of the population.

A different sample will yield a different estimate for the parameter. So we have to think of the estimate as a range of values, an interval or a probability distribution instead of a single value. The truth may be in this interval if we have good representative samples.

Over the next few lessons, we will learn some methods of estimation from data, how reasonable the estimates are, or the criteria for being good, their range or intervals, and many more concepts.

In summary, statistical inference is about understanding the entire population from samples and sample statistics.

The sample has to be representative of the population for good inference.

In other words, the sample distribution must be similar to the population distribution. Extreme winds data would be a good representation of an extreme value function as population, not normal distribution.

 

 

Joe and his girlfriend broke up. They inferred their future based on the sample dates. Is it a good inference?

 

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 61 – Resto


It is a journey, long haul. Let’s take an exit, stretch our legs, get a coffee and trace our tracks.

Seeing Data and Making Friends with R

Lesson 1: When you see something, say data

Lesson 2: R is your companion

Sets and Properties

Lesson 3: Sets

Lesson 4: Visualizing Sets

Probability, Axioms, And rules

Lesson 6: Defining Probability

Lesson 7: Axioms of Probability

Lesson 9: Conditional Probability

Lesson 10: Independent Events

Lesson 12: Total Probability

Lesson 13: Bayes Rule

More R: Shortcuts, for, if-else, and functions
Lesson 5: Let us 'R'eview

Lesson 8: The search 'for' w'if'i

Lesson 11: Fran, the functionin' R-bot
Exploratory Data Analysis

Lesson 14: Order Statistics

Lesson 15: Frequency Plots

Lesson 16: Average

Lesson 17: Variance

Lesson 18: Skewness

Lesson 19: Outliers

Lesson 20: Coefficient of Variation

Random Variables, Expectation and others

Lesson 22: Random Variable

Lesson 23: Probability Distribution

Lesson 24: Expectation Operation

Lesson 25: More Expectation

Lesson 26: Variance Operation

Lesson 27: More Variance

Lesson 28: Standardization

More R: data exploration and large number games
Lesson 21: Beginners guide to summarize data in R

Lesson 29: Large number games in R
Discrete Probability Distributions

Lesson 31: Bernoulli Trials

Lesson 32: Binomial Distribution

Lesson 33: Geometric Distribution

Lesson 34: Return Period

Lesson 35: Negative Binomial Distribution

Lesson 36: Poisson Distribution

Lesson 37: Still Poisson Distribution

Lesson 38: Hypergeometric Distribution

Continuous Probability Distributions

Lesson 41: Continuous Functions

Lesson 42: Beta Distribution

Lesson 43: Exponential Distribution

Lesson 44: Memoryless Property

Lesson 45: Gamma Distribution

Lesson 46: Convolution

Lesson 47: Central Limit Theorem Visuals

Lesson 48: Central Limit Theorem Derivation

Lesson 49: Normal Distribution

Lesson 50: The Standard Normal Distribution

Lesson 51: Guest Blog: Prof. Upmanu Lall from Columbia University

Lesson 52: Lognormal Distribution

Lesson 53: Chi-square Distribution

Lesson 54: The Connection of Probability Distributions to Bernoulli

More R: Discrete and Continuous Distributions
Lesson 39: Discrete Distributions in R Part I

Lesson 40: Discrete Distributions in R Part II

Lesson 55: Continuous Distributions in R Part I

Lesson 56: Continuous Distributions in R Part II
Extreme Value Distributions

Lesson 57: Visual Understanding of Extreme Values

Lesson 58: Basics of Extreme Value Distributions

Lesson 59: Generalized Extreme Value Distribution

More R: Extreme Value Distributions
Lesson 60: Extreme Value Distributions in R
Inference

This is a classroom – Use it.

This is a resource – Take it.

This is a free source – Bookmark it.

This is knowledge, share it.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

 

Lesson 60 – Extreme value distributions in R

“In theory, there is no difference between theory and practice. In practice, there is.” — ??

By now, you recognize the pattern in this classroom. We follow up theory with practice. A set of lessons with a common theme will culminate with some experience in R. Being true to our trend, today, we leap into the world of extRemes. Put on your hard hat and bring your tools and machinery. Don’t forget your drink of the day.

On the seventeenth day of March, two thousand eighteen, we met Maximus Extremus Distributus. He was taking different forms depending on the origin (parent) distribution. You can create him too using a few simple lines of code in R.

Here is the creation code for normal origins. We generate N = 1000 normally distributed random variables with a zero mean and unit standard deviation, select the maximum value out of these 1000 values, and repeat the process 1000 times to get 1000 maximum values. These maximum values converge to the Type I extreme value distribution – Gumbel (e^{-e^{-y}}). The code runs like an animation. You can control the speed by changing the number in Sys.sleep(). There is a plot that shows the convergence of the average values to a Normal distribution. Do you know why?

library(locfit)
library(logspline)

# Normal Origins #
par(mfrow=c(3,1))
x = rnorm(10000,0,1)
plot(logspline(x),xlim=c(-5,5),ylim=c(0,0.5),font=2,cex.lab=2,font.lab=2,xlab="Normal Distribution")

N = 1000
ave = matrix(NA,nrow=N,ncol=1)
max1 = matrix(NA,nrow=N,ncol=1)

for (i in 1:N)
{
 x = rnorm(N,0,1)
 
 lines(locfit(~x),col="grey")
 points(mean(x),0,col="blue",pch=17)
 points(max(x),0,col="red",pch=17)
 ave[i] = mean(x)
 max1[i] = max(x)
 Sys.sleep(0.01) 
}

Sys.sleep(1)
plot(locfit(~ave),xlim=c(-5,5),ylim=c(0,9),ylab="",cex.lab=2,font=2,font.lab=2,xlab="Normal Distribution",col="white")
lines(locfit(~ave),lwd=2,col="blue")

Sys.sleep(1)
plot(locfit(~max1),xlim=c(-5,5),font=2,font.lab=2,ylab="",xlab="Extreme Value Distribution (Gumbel)",cex.lab=2,col="white")
lines(locfit(~max1),lwd=2,col="red")


 

The creation code for exponential origins has the same procedure. We generate N = 1000 exponentially distributed random variables with \lambda = 0.5 as the parent. The maximum values of an exponential distribution again converge to the Gumbel distribution.

## EVD from Exponential## 
par(mfrow=c(3,1))
x = rexp(10000,0.5)
hist(x,prob=T,xlim=c(0,25),ylab="",main="",font=2,cex.lab=2,font.lab=2,xlab="Exponential Distribution")
plot(logspline(x),add=T)

N = 1000

ave = matrix(NA,nrow=N,ncol=1)
max1 = matrix(NA,nrow=N,ncol=1)

for (i in 1:N)
{
 x = rexp(N,0.5)
 
 lines(locfit(~x),col="grey")
 points(mean(x),0,col="blue",pch=17)
 points(max(x),0,col="red",pch=17)
 ave[i] = mean(x)
 max1[i] = max(x)
 Sys.sleep(0) 
}

Sys.sleep(1)
plot(locfit(~ave),xlim=c(0,25),ylim=c(0,4),ylab="",cex.lab=2,font=2,font.lab=2,xlab="Normal Distribution",col="white")
lines(locfit(~ave),lwd=2,col="blue")

Sys.sleep(1)
plot(locfit(~max1),xlim=c(0,25),font=2,font.lab=2,ylab="",xlab="Extreme Value Distribution (Gumbel)",cex.lab=2,col="white")
lines(locfit(~max1),lwd=2,col="red")

 

Here is the creation code for uniform origins. This time the maximum values from uniform distribution converge to a different type of extreme value distribution, the Type III Weibull distribution (e^{-(-y)^{\gamma}}).

# Uniform Origins #
x = runif(100000,0,1)
#hist(x,prob=T,xlim=c(0,1),ylab="",main="",font=2,cex.lab=2,font.lab=2,xlab="Uniform Distribution")
plot(density(x),xlim=c(0,1),ylab="",main="",font=2,cex.lab=2,font.lab=2,xlab="Uniform Distribution")

N = 200

ave = matrix(NA,nrow=N,ncol=1)
max1 = matrix(NA,nrow=N,ncol=1)

for (i in 1:N)
{
 x = runif(N,0,1)
 
 # lines(locfit(~x),col="grey")
 points(mean(x),0,col="blue",pch=17)
 points(max(x),0,col="red",pch=17)
 ave[i] = mean(x)
 max1[i] = max(x)
 Sys.sleep(0) 
}

Sys.sleep(1)
plot(locfit(~ave),xlim=c(0,1),ylim=c(0,30),ylab="",cex.lab=2,font=2,font.lab=2,xlab="Normal Distribution",col="white")
lines(locfit(~ave),lwd=2,col="blue")

Sys.sleep(1)
plot(locfit(~max1),xlim=c(0,1),ylim=c(0,65),font=2,font.lab=2,ylab="",xlab="Extreme Value Distribution (Weibull)",cex.lab=2,col="white")
lines(density(max1),lwd=2,col="red")

I want you to experiment with other types of origin distributions.

The three Types (Gumbel, Frechet, Weibull) and the GENERALIZED EXTREME VALUE DISTRIBUTION (GEV)

The three types of extreme value distributions have double exponential and single exponential forms. The maxima of independent random variables converge (in the limit when n \to \infty) to one of the three types, Gumbel (e^{-e^{-y}}), Frechet (e^{-y^{-\gamma}}) or Weibull (e^{-(-y)^{\gamma}}) depending on the parent distribution.

We saw last week that these three types could be combined into a single function called the generalized extreme value distribution (GEV).

G(z) = e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}_{+}}

\mu is the location parameter. \sigma > 0 is the scale parameter. \xi controls the shape of the distribution (shape parameter).

When \xi \to 0, GEV tends to a Gumbel distribution.

When \xi > 0, GEV tends to the Frechet distribution.

When \xi < 0, GEV tends to the Weibull distribution.

Its okay if you don’t know the origin distribution for an extreme dataset. GEV folds all the three types into one form, and the parameters \mu, \sigma, and \xi can be estimated from the data. The function has a closed form solution to compute the quantiles and probabilities.

GEV IN R

Let’s play with some data and use GEV in R. We will use two datasets, NYC temperature, and cycles to fatigue of steel. The NYC temperature data is the same file we used in lesson 56. The cycles to fatigue is the data from our labs where we measured the maximum number of cycles before failure due to fatigue for ten steel specimens.

Read the files into the workspace using the following lines.

# fatigue data
fatigue = read.table("cycles_fatigue.txt",header=F)

# temperature data 
temperature_data = read.csv("cp_temperature.csv",header=T)

ExtRemes Package

We need to install a package in R called extRemes. Type the following lines in your code.

install.packages("extRemes")
library(extRemes)

This package has functions built for GEV distribution. Let’s try a few simple things first by generating random variables of the three types. Type the following lines in your code.

x = revd(10000,loc=0,scale=1,shape=0)

This command (revd) will generate 10000 GEV random variables with a location of 0, scale of 1 and shape of 0. So, this is a Gumbel distribution.

Type these lines and see what you get.

# Generate Gumbel Random numbers
x = revd(10000,loc=0,scale=1,shape=0)
hist(x,prob=T,xlab="Random Variables from Gumbel (location = 0,scale = 1, shape =0)",main="Gumbel Distribution",ylab="f(x)",font=2,family="serif",font.lab=2,cex.lab=1.5)
plot(logspline(x),add=T)

Now hold the shape parameter constant at 0 and alter the location and scale parameters. A change in the location parameter will shift the distribution; a change in the scale parameter will stretch or shrink the distribution.

I guess you know the chores now.

Yes, we experiment with positive and negative shape parameters to generate the Frechet and Weibull distributions.

Generate Frechet Random numbers

# Frechet distribution plot
y = revd(10000,loc=0,scale=1,shape=0.2)
hist(y,prob=T,ylim=c(0,0.4),xlab="Random Variables from Frechet (location = 0, scale = 1, shape = 0.2)",main="Frechet Distribution",ylab="f(x)",font=2,family="serif",font.lab=2,cex.lab=1.5)
plot(logspline(y),add=T,col="red")
plot(logspline(x),add=T)

Generate Weibull Random numbers

## Weibull Distribution Plot 
z = revd(10000,loc=0,scale=1,shape=-0.6)
hist(z,prob=T,ylim=c(0,0.5),xlab="Random Variables from Weibull (location = 0, scale = 1, shape = -0.6)",main="Weibull Distribution",ylab="f(x)",font=2,family="serif",font.lab=2,cex.lab=1.5)
plot(logspline(z),add=T,col="red")

If you are comfortable with this, it is time to get your hand dirty with real data.

Fitting GEV distribution to data

Let’s examine the maximum cycles to fatigue data. We do not know which extreme value distribution it follows. If we fit a GEV and observe the shape parameter, we can say with certain confidence that the data follows Type I, Type II or Type III distribution.

For this, we can use the fevd command. Let’s use the fevd function to fit a GEV to the cycles to fatigue data.

fatigue_cycles = as.matrix(fatigue)

fit = fevd(fatigue_cycles,type="GEV")
summary(fit)

[1] "Estimation Method used: MLE"

Negative Log-Likelihood Value: 132.8045

Estimated parameters:
 location scale shape 
 5.765016e+05 1.680629e+05 -5.231015e-01

Standard Error Estimates:
 location scale shape 
6.276363e+04 6.709826e+04 4.199849e-01

Estimated parameter covariance matrix.
 location scale shape
location 3.939273e+09 -527662537.18 -9.654182e+03
scale -5.276625e+08 4502177071.08 -2.306339e+04
shape -9.654182e+03 -23063.39 1.763873e-01

AIC = 271.6089

BIC = 272.5167

Pay attention to the red highlighted text for now. It shows the results for the estimated parameters. The shape parameter is -0.52 (\xi < 0). So, a Weibull distribution fits the data with high likelihood.

Block Maxima

Let’s examine the temperature data. Assume we are interested in analyzing the data for maximum temperature each year. Remember we only care about the extremes. For this, we should first extract the annual maximum temperature values. This idea is called the block maxima. Each year is a block, and we get the maximum for each year.

Type the following lines in your code to get the annual maximum temperature values from 1951 to 2017.

# Annual Maximum Ave Temperature #
yr1 = 1951
yr2 = 2017
n = yr2 - yr1 + 1

annmax = matrix(NA,nrow=n,ncol=1)
for (i in 1:n)
{
 yr = 1950 + i
 index = which((temperature_data[,1] == yr))
 temperature_yr = temperature_data[index,4]
 annmax[i,1] = max(temperature_yr,na.rm=T)
}

hist(annmax,xlab="Annual Maximum Temperature",font=2,main="",font.lab=2)

There is one value very different and far away from all other values. This phenomenon is the feature of the extreme values.

Rare events never seen before can occur. Using a model (e.g., GEV function) for these “unknowns” comes with uncertainty.

Now let’s fit GEV to the temperature data and look at a few things.

fit_temperature = fevd(annmax,type="GEV")

The fevd function will fit a GEV distribution to the data. The location, scale and shape parameters of the function are estimated based on the data.

Now type the following line in your code.

plot(fit_temperature)

You should see the following figure appear in the plot window.

For now, focus on the bottom two images. The one in the bottom left is showing how well the GEV function (blue line) is matching the observed data (black line). Pretty reasonable.

Can you tell me what is the image on the bottom right?

Okay, let me ask you this. What is the probability that the annual maximum temperature will be less than or equal to 92 degrees F?

Yes, you can compute this from the cumulative distribution function of the GEV distribution.

G(z) = P(Z \le z) = e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}_{+}}

Substitute z = 92, \mu = 86.15, \sigma = 2.65, \xi = -0.03 in the equation and you get P(Z \le z) = 0.903.

I am sure you know why I substituted  \mu = 86.15, \sigma = 2.65, \xi = -0.03. These are the estimated parameters you get from the summary of the fit.

So, the probability that the annual maximum temperature will be less than or equal to 92 degrees F is 0.9.

Can I say, that the probability of exceeding 92 degrees F is 0.1?

P(Z > 92) = 0.1

A temperature of 92 degrees F is exceeded 10% of the times.

What do you know about exceedance probability and return period? If a random variable is exceeded with 10% probability, what is the frequency of its occurrence? Lesson 34? The language of return period?

Return Period T = \frac{1}{P(exceedance)}

So if 92 degrees is the temperature that exceeds 10% of the times, its return period is ten years.

On the average, daily temperature as high as 92 degrees will occur once every ten years in New York City.

In equation form, Return Period of a quantile z is T = \frac{1}{1 - G(z)}.

Now, can you answer this?

What temperature (z) occurs once in 50 years?

I know you are thinking of inverting the return period equation. That is the way.

G(z) = 1 - \frac{1}{T}

e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}} = 1 - \frac{1}{T}

[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi} = -ln(1 - \frac{1}{T})

1+\xi(\frac{z-\mu}{\sigma}) = (-ln(1 - \frac{1}{T}))^{-\xi}

z = \mu + \frac{\sigma}{\xi}((-ln(1 - \frac{1}{T}))^{-\xi} - 1)

Now if you substitute T = 50 years and  \mu = 86.15, \sigma = 2.65, \xi = -0.03, you get z = 95.95.

The 50 year return period temperature is 95.95 degrees F.

This is what is being computed internally within the extRemes package to create the plot you see. For quantile z, extRemes package has qevd() function where you have to input probability p and other parameters. For probability P(Z \le z), it is pevd(), and you have to input the quantile z and the other parameters.

We can get their relations from the plot.

What are those dashed lines in the return period plot?

The 500 year return period temperature is 101.3 degree F. What does it even mean when they say the return period is 500 years?

How can something happen once in 500 years when we do not have 500 years of data?

Do you want to believe it? How confident are you in this estimate?

If this does not spin your head, let me add more. Look at the summary of the model once again.

summary(fit_temperature)

fevd(x = annmax, type = "GEV")

[1] "Estimation Method used: MLE"

 Negative Log-Likelihood Value: 169.1602

 Estimated parameters:
 location scale shape 
86.14530247 2.64886498 -0.02706872

Standard Error Estimates:
 location scale shape 
0.3552756 0.2500867 0.0663774

Estimated parameter covariance matrix.
 location scale shape
location 0.126220730 0.028413268 -0.006875094
scale 0.028413268 0.062543342 -0.003859995
shape -0.006875094 -0.003859995 0.004405960

AIC = 344.3204

BIC = 350.9344

The shape parameter is negative but very close to 0. Can we say that the GEV is a Gumbel distribution? Why not Weibull?

Did you ask yourself how these parameters are estimated from the data? How much data is required for this?

Did you ask yourself whether the shape parameter value is significantly different from 0 to say it is not Gumbel?

You can see so many new terms, “Estimation Method used: MLE,” standard error, and so on. What are these? What about the other two images in the fevd plot? Aren’t you curious about them? Well, hold on to that suspense for a few weeks. Very soon, we will start a new journey of inference.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)