Lesson 68 – Bias joins Variance

In the last week’s “guess the age” contest, Madam Chairman Yellen was the clear winner. The expected value of the estimated age was 69.69 years, a nominal bias of 1.31 years from her true age.

However, the great Thomas Sowell is the clear winner of our hearts. May he always remain a generation young, like his ideas.

Look at the table once again.

Can you notice the variation in the estimates of the groups?

Yes, Chairman Yellen’s estimated age ranges from 62 years to 78 years, while Dr. Sowell’s estimated age ranges from 55 years to 70 years.

Based on the data from the 16 groups, I computed the variance (and the standard deviation) of the estimated ages.

For Chairman Yellen, the variance of the estimated age is 22.10 and the standard deviation of the estimated age is \sqrt{22.10}=4.7 years.

For Dr. Sowell, the variance is 23.98 and the standard deviation is \sqrt{23.98}=4.9 years.

The variance of the estimator is an important property. It tells us about the spread of the estimate, i.e., how widely the estimate is distributed.

For any estimator \hat{\theta}, we compute the expected value E[\hat{\theta}] as a measure of the central tendency and how far it could be from the truth – Bias, and V[\hat{\theta}] as a measure of the spread of the distribution.

An estimator should be understood as a probability distribution.

The standard deviation of the estimator is called the standard error of an estimator. In our age guessing experiment, the standard error of Dr. Sowell’s estimated age is 4.9 years.

The standard error of \hat{\theta} is \sqrt{V[\hat{\theta}]}.


In lesson 67, while learning the concept of bias, we deduced that the sample mean \hat{\mu} = \bar{x} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean \mu.

In other words, we derived that E[\hat{\mu}] = \mu .

Now, let’s derive the variance of the sample mean V[\hat{\mu}] or V[\bar{x}].

For this, let us consider that we have a random sample x_{1}, x_{2}, ..., x_{n} of size n from a population with a mean \mu and variance \sigma^{2}.

Since x_{1}, x_{2}, ..., x_{n} are independent and identically distributed random samples from the population with true parameters \mu and \sigma^{2}, they will have the same expected value and variance as the population. E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}.

\hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is the sample mean.

V[\hat{\mu}] = V[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

V[\hat{\mu}] = V[\frac{1}{n}(x_{1}+x_{2}+...+x_{n})]

V[\hat{\mu}] = \frac{1}{n^{2}}V[(x_{1}+x_{2}+...+x_{n})]

Where did that come from? Was it lesson 27? V[aX]=a^{2}V[X]

V[\hat{\mu}] = \frac{1}{n^{2}}(V[x_{1}]+V[x_{2}]+...+V[x_{n}])

Because the samples are independent, the covariance terms do not exist in the summation.

 V[\hat{\mu}] = \frac{1}{n^{2}}(\sigma^{2}+\sigma^{2}+...+\sigma^{2})=\frac{1}{n^{2}}*n \sigma^{2}=\frac{\sigma^{2}}{n}

The variance of the sample mean V[\hat{\mu}] = V[\bar{x}] =\frac{\sigma^{2}}{n}.

The standard error of the sample mean is \frac{\sigma}{\sqrt{n}}. It is a measure of the precision of the estimator.


Joe has a question

Do you remember last week’s dilemma?

Joe wants to choose a path; an estimator for the true population mean.

We tried to help him by computing the bias of each of the three estimators. The idea was to choose the one that has no bias. But it turned out that all three estimators are unbiased.

How then, can we select the best estimator?

Look at the estimators again.

\hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\hat{\mu_{2}} = x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}

\hat{\mu_{3}} = median(x_{1},x_{2},...,x_{n})

I told you that estimators should be understood as probability distributions and that we can compute the expected value E[\hat{\theta}] for the bias, and V[\hat{\theta}] to measure of the spread of the distribution.

Let’s compute the variance of these three estimators.

V[\hat{\mu_{1}}] = V[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

\hat{\mu_{1}} is the sample mean. We just derived its variance as V[\hat{\mu_{1}}] = \frac{\sigma^{2}}{n}.

So let’s work on the second estimator.

V[\hat{\mu_{2}}] = V[x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}]

V[\hat{\mu_{2}}] = V[ x_{[1]} + \frac{1}{2}x_{[n]} - \frac{1}{2}x_{[1]}] = V[\frac{1}{2}x_{[1]} + \frac{1}{2}x_{[n]}]

V[\hat{\mu_{2}}] = \frac{1}{4}\sigma^{2} + \frac{1}{4}\sigma^{2}

V[\hat{\mu_{2}}] = \frac{1}{2}\sigma^{2}

The variance of the third estimators \hat{\mu_{3}} will be \sigma^{2} if the sample size is odd-numbered, or \frac{1}{2}\sigma^{2} if the sample size is even-numbered.

So, the three estimators \hat{\mu_{1}}, \hat{\mu_{2}}, \hat{\mu_{3}} are all unbiased. Their variances are as follows:

V[\hat{\mu_{1}}]=\frac{\sigma^{2}}{n}

V[\hat{\mu_{2}}]=\frac{\sigma^{2}}{2}

V[\hat{\mu_{2}}]=\sigma^{2} or \frac{\sigma^{2}}{2}

The three estimator probability distributions are centered on the truth (\mu), but their spreads are different. If the sample size n is greater than 2 (n>2), estimator \hat{\mu_{1}} has the lowest variance.

They will look like this visually.

Among all the estimators that are unbiased, we choose the one which has minimum variance. \hat{\mu_{1}} is the minimum variance unbiased estimator. It is more likely to produce an estimate close to \mu, the truth.

Among all estimators which are unbiased, choose the one which has minimum variance. This chosen \hat{\theta} is called the minimum variance unbiased estimator (MVUE) of \theta, and it is most likely among all unbiased estimators to produce an estimate close to the truth. This principle is called the principle of minimum variance unbiased estimation.


You must be wondering what if an estimator has low variance, but is biased. Like this.

Isn’t \hat{\theta_{1}} more likely to produce an estimate close to the truth?

Perhaps.

So there should be a way to combine bias and variance into one measure to help us with the selection.

The mean squared error is that measure. Let’s see how to derive this measure.

We have an estimator \hat{\theta} for the true paramter \theta.

The error of this estimator from the truth is \hat{\theta}-\theta.

The squared error is (\hat{\theta}-\theta)^{2}.

The mean squared error (MSE) is the expected value of the squared error.

E[(\hat{\theta}-\theta)^{2}]

MSE = E[(\hat{\theta}-\theta)^{2}]

MSE = V[\hat{\theta}-\theta] + (E[\hat{\theta}-\theta])^{2}

Can you tell how we wrote the above expression?

MSE = V[\hat{\theta}] + V[\theta] + (E[\hat{\theta}] - E[\theta])^{2}

MSE = V[\hat{\theta}] + (E[\hat{\theta}] - \theta)^{2}

This is because, E[\theta]=\theta and V[\theta]=0.

V[\hat{\theta}] is the variance of the estimator.

E[\hat{\theta}] - \theta is the bias of the estimator.

So, MSE = Variance + Bias^{2}

For unbiased estimators, the mean squared error is equal to the variance of the estimator.

We can use this measure to compare two estimators. If MSE(\hat{\theta_{1}})<MSE(\hat{\theta_{2}}) we can say that \hat{\theta_{1}} is a better estimator, a more efficient estimator in producing the truth.

The ratio \frac{MSE(\hat{\theta_{1}})}{MSE(\hat{\theta_{2}})} is called relative efficiency of \hat{\theta_{2}} to \hat{\theta_{1}}.

If you can create an estimator that has low variance, but you have to compromise on the bias, you can go for it, as long as the reduced variance is greater than the squared bias.

In estimating true parameters, there is always a tradeoff between bias and variance.

To close off for the week, let me ask you three questions.

If \frac{MSE(\hat{\theta_{1}})}{MSE(\hat{\theta_{2}})} is the relative efficiency, what is consistency?

Is the estimator for variance \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}} unbiased? Did you do your homework?

When was the last time you were on RStudio?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 67 – Bias

“Nooooo,” “No way…” “Ooooh,” “Yessss,” “I told ya,” “I knew it.”

These are some of the funny reactions from our high-school students when the true age is finally revealed; because the bias in their estimation is also revealed.

Every year, I start off the data analysis boot camp with this game from the bag of tricks for teaching statistics by Andrew Gelman.

The students sit in groups of four and look at rolling pictures to guess their ages. They have their own formulas (estimators) for the age!

What you saw in the opening image are the guessed ages of the former Chairman of the Federal Reserve, Jannet Yellen and the great economist Thomas Sowell.

Do you see anything peculiar?

Yes. The estimated age of Chairman Yellen is somewhat close to her true age (as per Wikipedia) and the estimated age of Dr. Sowell is not even close (again, as per Wikipedia).

Let’s compute the average of all the guesses, the expected value of the estimated age.

For Chairman Yellen it is 69.69 years, a difference of 1.31 years from her true age.

Pretty close to the truth. The students’ estimate has very little bias.

For Dr. Sowell, it is 61.88 years, a difference of 25.13 years from his true age.

Far from the truth. The students’ estimate is biased.

Let’s spend a little more time on the estimated ages.

There are 16 groups that guessed the ages based on a picture of the person. Let us use a notation \hat{\theta} for this estimate. So, after heavy thinking and vehement deliberation within the group, the estimate from group A is \hat{\theta}=73 for Chairman Yellen.

Group B is also filled with thinkers and negotiators, so their estimate for Chairman Yellen’s age is \hat{\theta}=70.

Likewise, the 16 groups provided a distribution of estimates. The expected value E[\hat{\theta}] of this distribution is 69.69 years.

The true age \theta is 71 years.

The bias is E[\hat{\theta}]-\theta=-1.31

Look at this visual. I am taking the 16 estimates and plotting them as a histogram (or a probability density). The true age is also shown in the histogram. The estimates are distributed about the true age. The mean of this distribution is very close to 71.

Dr. Sowell on the other hand, somehow seemed younger in the minds of the high-school students. The expected value of the estimate E[\hat{\theta}] is 61.88 years. The true age is 87 years, a bias (E[\hat{\theta}]-\theta) of around 25 years.The bias could be because of the picture I am using, his pleasant smile or his aura!

This idea of measuring the closeness of the estimate to the true parameter is called the bias of the estimate.

An estimator (\hat{\theta}) (i.e., a formulation to compute the true population parameter) is unbiased if on average it produces values that are close to the truth. E[\hat{\theta}]=\theta. In other words, if we have a distribution of the estimates, and if their mean is close (or centered) to the truth, the estimator is unbiased.

♦Θ♦


In lesson 63, we learned that the maximum likelihood estimator of p, the true probability of success for a Binomial distribution is \hat{p}=\frac{r}{n}, where r is the number of successes in a sequence of n independent and identically distributed Bernoulli random numbers. Do you think \hat{p} is unbiased?

Let’s check out.

The bias of \hat{p} is defined by E[\hat{p}]-p.

E[\hat{p}]=E[\frac{r}{n}]=\frac{1}{n}E[r]=\frac{1}{n}*np=p

The key point in the above derivation is that the expected value of the number of successes (r) for a binomial distribution is np.

The estimate \hat{p}=\frac{r}{n} is an unbiased estimate of the true probability of success.

The distribution of the estimate \hat{p} will be centered on the true probability p.

§⇔§

The maximum likelihood estimators for the Normal distribution are \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} and \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. Do you think they are unbiased?

\hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is the sample mean. Let’s compute the bias of this estimate.

E[\hat{\mu}] = E[\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}]

E[\hat{\mu}] = E[\frac{1}{n}(x_{1}+x_{2}+...+x_{n})]

E[\hat{\mu}] = \frac{1}{n}E[(x_{1}+x_{2}+...+x_{n})]

E[\hat{\mu}] = \frac{1}{n}(E[x_{1}]+E[x_{2}]+...+E[x_{n}])

Since x_{1}, x_{2}, ..., x_{n} are random samples from the population with true parameters \mu and \sigma, E[x_{i}]=\mu and V[x_{i}]=\sigma^{2}. In other words, since x_{i}s are independent and identically distributed, they will have the same expected value and the variance as the population.

So, E[\hat{\mu}] = \frac{1}{n}(\mu+\mu+...+\mu)=\frac{1}{n}*n \mu=\mu

The sample mean \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is an unbiased estimate of the population mean \mu.

What about the variance estimator \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}?

You know the chore now. Compute the bias of this estimator and find out for yourself. Passive minds have to wait until next week.

♦Θ♦


Joe is a smart chap. The other day, he asked me if he could use any other estimator to get the population mean (truth). I probed him further, and he came up with three estimators to estimate the true mean \mu.

\hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\hat{\mu_{2}} = x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}

\hat{\mu_{3}} = median(x_{1},x_{2},...,x_{n})

The first estimator \hat{\mu_{1}} is known to all of us. It is the sample mean.

The second estimator \hat{\mu_{2}} is computing the central tendency as the sum of the smallest value x_{[1]} and half the range of the data \frac{x_{[n]}-x_{[1]}}{2}. It is a reasonable way to tell where the center will be.

The third estimator \hat{\mu_{3}} is the median of the sample.

Can you help Joe choose the correct path?

Your immediate reaction would be, “compute the bias for each estimator and select the one that has the smallest bias.”

Fair enough. Let’s try that then.

We already showed that the first estimator \hat{\mu_{1}} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} is unbiased.

Employing the bias computation for the second estimator, we can see that

E[\hat{\mu_{2}}] = E[x_{[1]}+\frac{x_{[n]}-x_{[1]}}{2}] = E[x_{[1]}] + \frac{1}{2}(E[x_{[n]}]-E[x_{[1]}])

E[\hat{\mu_{2}}] = \mu + \frac{1}{2}(\mu-\mu)=\mu

The second estimator is also unbiased.

Now, depending on whether there is even number of samples or an odd number of samples, you can derive that the third estimator is also unbiased.

How then, can Joe select the best estimator?

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 66 – Spare a moment

Hello! Do you have a moment?

What is the first thing that comes to your mind when you hear the word “moment?”

For some of you, it is “let me capture and share the moment.”

For some of you, it is the coffee cup cliches — “live in the moment,” “enjoy every moment,” and “be happy for this moment.”

Some of you may actually be believing in the “power of the present moment” or “don’t wait for the perfect moment, take the moment and make it perfect” stuff!

How many of you remembered torque from your high school physics lesson?

The torque of a force, or moment of a force about the axis of rotation?

\Gamma = \textbf{r} * \textbf{F}

Moment of a force is the product of the force and its distance from an axis. It is a measure of the tendency of the force to cause a body to rotate.

I am sure you are thinking that the moment you spared for me is not worth it. You don’t want to recollect those moments from the horrid highschool days.

Bear with me for a few more moments. You will see why we are talking moments in data analysis.

You know by now that we can use a sample to estimate the true value of the parameter of the population. The formulation used to get that estimate is called an estimator, and there are methods of estimation of these parameters. In lesson 63, lesson 64 and lesson 65, we dabbled with the method of Maximum Likelihood Estimation (MLE).

Another method of estimating population parameters from the sample is the Method of Moments.

Let’s see how the idea of the moment from physics is related to the Method of Moments.

Assume that we have a sample of five data points x_{1}, x_{2}, x_{3}, x_{4}, x_{5} from a population with a probability density function f_{X}(x). Let’s place them on a number line like this.

Imagine that each number is represented by an equal-weighted ball. Since each data point is independent and identically distributed, i.e., equally likely, it is reasonable to assume that the probability of their individual occurrences (\frac{1}{n}) is the mass of the data point.

For any data point, the torque, or the moment about the axis of rotation is \Gamma = \textbf{r} * \textbf{F}.

For example, a data point at a distance of x_{i} from the axis has a moment of x_{i}*\frac{1}{n}. I am dropping the standard gravity term, and you will see why shortly.

Since there are more than one forces acting on the sample number line, we can compute the total torque acting as \Gamma = {\displaystyle \sum_{i=1}^{n}r_{i} * F_{i}}.

In other words, the total moment of the sample data about the axis of rotation is {\displaystyle \sum_{i=1}^{n}x_{i}*\frac{1}{n}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}.

Did you notice that this is the equation for the sample mean?\bar{x}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. The centroid of the data points.

The population has a probability density function f_{X}(x).

We saw in Lesson 64 on how to divide the space into equal intervals of length dx. The probability that a given sample data point x_{i} falls in the range dx is f_{X}(x_{i})dx for an infinitesimally small range dx. Like in the sample case, we can assume this is the mass of any subdivision. Hence, the moment of this infinitesimal body about the rotation axis is x_{i}*f_{X}(x_{i})dx.

For the entire population, the total moment is {\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx}.

I know. You will tell me that this is the expected value E[X], or the centroid of the probability density function.

E[X] = {\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx}

If the five data points x_{1}, x_{2}, x_{3}, x_{4}, x_{5} are originating from a population with a probability density function f_{X}(x), it is reasonable to equate the total moment of the sample data to the total moment of the population data. The centroid of the sample is equal to the centroid of the probability density functions.

{\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx} =\frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}


E[X] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}


Let me pester you again!

What is the moment of inertia?

No worries. Let’s activate your deep memory neurons. Imagine that the sample data points are rotating on the plane about the axis with a velocity v = r\omega, where \omega is the constant angular velocity of the plane. In our example of five data points, the velocity of any given point will be v_{i} = x_{i}\omega.

If the body is rotating, it will be subject to a tangential force of m\frac{dv}{dt}. For our sample point which is at a radial distance of x_{i} from the axis, the tangential force will be \frac{1}{n}\frac{dv}{dt} = \frac{1}{n}x_{i}\frac{d \omega}{dt}.

To simplify life, let’s call \frac{d \omega}{dt}=\alpha, the angular acceleration of the point.

The tangential force is \frac{1}{n}x_{i}\alpha, and the torque of the tangential force applied on one point is x_{i}*\frac{1}{n}x_{i}\alpha = \frac{1}{n}x_{i}^{2}\alpha.

The total torque of all the tangential forces = {\displaystyle \sum_{i=1}^{n}\frac{1}{n}x_{i}^{2}\alpha}.

The term {\displaystyle \sum_{i=1}^{n}\frac{1}{n}x_{i}^{2}} is called the moment of inertia of the points.

Just looking at the equation, you can say that the moment of inertia of points far apart is larger than the moment of inertia of the points closer to the axis.

This moment of inertia is called the second moment and it gives us a sense of how spread out the data points are about the axis.

Sample moment of inertia = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}.

By extension to the population, we can write the population moment of inertia, or the population second moment as

E[X^{2}]={\displaystyle \int_{- \infty}^{\infty}x^{2}f_{X}(x)dx}.

Notice the generalization E[X^{2}] for the second moment, just like E[X]={\displaystyle \int_{- \infty}^{\infty}x*f_{X}(x)dx} is the first moment.

Population moments are the expected values of the powers of the random variable X.

I can sense that you are charged up by the word “spread.” Yes, the second moment will be used to get the second parameter, the variance of the distribution.

We can equate the sample moment of inertia to the population moment of inertia.


E[X^{2}] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}


This form of equating sample moments to the population moments is called the Method of Moments for parameter estimation.

It was developed by Karl Pearson in the early 1900s. Pearson demonstrated this method by fitting different forms of frequency curves (probability density functions) to data by calculating as many moments from the sample data as there are parameters of the probability density function.

A generalized representation of the Method of Moments can be done like this.


E[X^{k}] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{k}}


To estimate the k parameters of the probability density function, we can equate the k population moments to the k sample moments.

If the probability density function has one parameter, the first moment equation is sufficient. If the function has two parameters, we have the first and the second moment equations — two unknowns, two equations to solve. If the function has three parameters, we go for three moment equations.

Time for some action

Let’s solve for the parameters of the normal distribution.

There is a sample of n data points x_{1}, x_{2}, x_{3}, ..., x_{n} that we think originated from a normal distribution (population) with a probability density function f_{X}(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}.

x_{1}, x_{2}, x_{3}, ..., x_{n} \sim N(\mu, \sigma)

\mu and \sigma are the parameters of the normal probability density function. Let’s apply the Method of Moments to estimate these parameters from the sample. In other words, let’s formulate the equations for \mu and \sigma.

The first population moment is E[X^{1}]. The first sample moment is \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{1}}.

For the normal distribution, we know that the expected value E[X] is the mean \mu.

\mu = E[X] = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

The second population moment is E[X^{2}]. The second sample moment is \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}.

E[X^{2}]. Hmm. You have to go down the memory lane, to Lesson 26 to realize that E[X^{2}]=V[X] + (E[X])^{2}.

.
.
.
Did you see that? Yes, so E[X^{2}]=\sigma^{2} + \mu^{2}. Equating this to the sample second moment, we get

\sigma^{2} + \mu^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}}

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}} - \mu^{2}

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}^{2}} - (\bar{x})^{2}

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - n\bar{x}^{2}\Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2n\bar{x}^{2} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2\bar{x}n\bar{x} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - 2\bar{x}\sum x_{i} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - \sum 2\bar{x} x_{i} + n \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \Big ( \sum x_{i}^{2} - \sum 2\bar{x} x_{i} + \sum \bar{x}^{2} \Big )

\sigma^{2} = \frac{1}{n} \sum (x_{i}^{2} - 2\bar{x} x_{i} + \bar{x}^{2})

\sigma^{2} = \frac{1}{n} \sum (x_{i} - \bar{x})^{2}


Let’s spend a few more moments and solve for the parameters of the Gamma distribution.

The probability density function of the Gamma distribution is

f(x) = \frac{\lambda e^{-\lambda x}(\lambda x)^{r-1}}{(r-1)!}

It has two parameters, r and \lambda.

\lambda is called the scale parameter. It controls the width of the distribution, and r is called the shape parameter because it controls the shape of the distribution.

We know that the expected value of the Gamma distribution E[X] = \frac{r}{\lambda}. If you forgot this, solve the integral \int xf(x)dx for the Gamma distribution. Equating this first population moment to the sample moment, we can see that

\frac{r}{\lambda} = \frac{1}{n}\sum x_{i} = \bar{x}

r = \lambda \bar{x}

Then, we know that V[X] for the Gamma distribution is \frac{r}{\lambda^{2}}.

Using this with the second population moment, E[X^{2}]=V[X] + (E[X])^{2}, we can see that E[X^{2}] = \frac{r(r+1)}{\lambda^{2}}.

We can now equate the second population moment to the second sample moment.

E[X^{2}] = \frac{1}{n}\sum x_{i}^{2}

\frac{r(r+1)}{\lambda^{2}} = \frac{1}{n}\sum x_{i}^{2}

\frac{\lambda \bar{x} (\lambda \bar{x}+1)}{\lambda^{2}} = \frac{1}{n}\sum x_{i}^{2}

\frac{\bar{x}(\lambda \bar{x}+1)}{\lambda} = \frac{1}{n}\sum x_{i}^{2}

 \bar{x} + \lambda \bar{x}^{2} = \frac{\lambda}{n}\sum x_{i}^{2}

\lambda (\frac{1}{n}\sum x_{i}^{2} - \bar{x}^{2}) = \bar{x}

\lambda = \frac{\bar{x}}{\frac{1}{n}\sum(x_{i}-\bar{x})^{2}}

So,

r = \frac{\bar{x}^{2}}{\frac{1}{n}\sum(x_{i}-\bar{x})^{2}}

Very tiring, I know. You will rarely find all the algebra in textbooks. So, the moments are worth it.

I asked you to spare a moment. Now you are going “this is quite a moment.” What do you think? Would you remember the Method of Moments next time you spare a moment?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 65 – Most Likelihood

“Maybe there is one estimate.”

Jill is staring at the plot on his desktop. The 6×6 cubicle and the gloomy weather outside is certainly not helping his concentration. Jill, the new intern at the Department of Environmental Protection, is working on developing estimates for the probability of water quality violations in summer at Pier 84, NYC’s Hudson River Park. To toy around with the idea, his boss gave him a ‘sample data’ for seven days during August 1 – August 7, 2016.

“I should perhaps develop two estimates; the probability of violation during the day and probability of violation during the night.”

Dissolved oxygen (DO) is used as an indicator by the New York State Department of Environmental Conservation to measure ecological health. A typical standard on acceptable water quality is a DO greater than 5 mg/L.

Jill saved the excel file, unplugged the pen-drive from his computer, tossed it in his backpack along with a few other folders, and did something on his phone. He took a deep breath, got up and walked out to the front door of his office.

It was around 6 pm when he stepped out.

♦♦♦

A black ford with an uber sticker on the front pane stopped in front of an apartment building on 96th street and Amsterdam Avenue. Jill is back at his apartment. Joe, his roommate, has already ordered Chinese. They plan to work on Jill’s problem over the evening along with writing the end of the semester project reports. The floor in the living room is filled with books, papers of all sort, pens, and markers. Along the glazed window facing Amsterdam Avenue, there are half-burnt candles. It is their finals week.

“I think two separate estimates will be appropriate,” said Jill to Joe as they both look at the plot on Jill’s laptop.

“The sample data I have has 672 data points of measured DO every 15 minutes. I have to develop estimates for p = P(DO < 5). I am thinking of dividing the data into daytime observations and nighttime observations and provide p_{day} and p_{night}. But, a single estimate for all the data, p_{full} may also be okay.

“Is there reason to believe that there can be two separate estimates?” asked Joe.

“Yes. Did you see that DO is higher during the day and is low during the night? I have done some study on this and found that such a daily cycle is due to the day-night change in phytoplankton growth. The DO increases at sunrise as phytoplankton begins photosynthesis and produces oxygen. During later afternoon and evening, the chlorophyll starts decreasing, and during the night, it is low.”

“I started making some simple assumptions on the day and night measurements. Morning 6 am to evening 6 pm is daytime and 6 pm to 6 am in the morning is nighttime. In the ‘sample data,’ there are 364 daytime measurements and 308 nighttime measurements. During the daytime, there are 185 violations, i.e., when DO is less than 5 mg/L. During the nighttime, I found 233 violations. So,

p_{day} = \frac{185}{364} = 0.51

p_{night} = \frac{233}{308} = 0.76.

However, a small part of me is saying, I count all data and provide

p_{full} = \frac{418}{672} = 0.62,

since the overall violations are 418 for the 672 measurements.

I am not able to decide.”

Why don’t you test both your assumptions using the maximum likelihood approach?” Joe suggested. “Since the maximum likelihood estimation method computes the parameters of the population distribution which agree most closely with the sample observations, you can test your data with these two formulations. The formulation that gives the most likelihood will win. What say?”

“That is an interesting thought. How do we set up such a test?” asked Jill as he cracks up his fortune cookie.

“Simple. Since you are looking for violations, it is safe to assume a Bernoulli distribution for the data. X = [0,1], 0 if DO > 5 mg/L and 1 if DO < 5 mg/L is the distribution with a parameter p, the probability of the violation. Your sample data [x_{1}, x_{2}, ..., x_{n}] can be used to find the optimal parameter that maximizes the likelihood (joint probability of occurrence) of observing the sample.”

As Joe stretches to grab a piece of paper, Jill speaks up.

“Okay, let’s say there are two different estimators to estimate the probability of violation; p_{full} assumes that all the data comes from the same Bernoulli distribution, and p_{day},p_{night} assumes that there are different Bernoulli distributions for nighttime samples and daytime samples. You mean to say, we can test these two theories?”

“Precisely,” said Joe as he shows his scribble on the paper.

Assumption 1: Full data comes from the same population.

Assumption 2: Partition the data into night and day.

Jill seems to be excited. He is already visualizing himself presenting this testing framework to his boss. “Let’s compute the log-likelihood for the first assumption.” He pulls up his laptop to open lesson 63 of the data analysis classroom where Joe and Devine discussed the likelihood function for the Bernoulli distribution.

“The log-likelihood function for Bernoulli distribution is ln(L) = rln(p)+(n-r)ln(1-p). n is the number of samples, 672 in our study, r is the number of 1s in the sample data, 418 total violations, and p is the parameter, the probability of violation. p_{full}=0.62.”

“Yes, and the log-likelihood is -445.59,” said Joe as he completes the equation on the paper.

Assumption 1: Full data comes from the same population.

ln(L_{1}) = r_{full}ln(p_{full})+(n_{full}-r_{full})ln(1-p_{full})

ln(L_{1}) = 418*ln(0.62)+(672 - 418)ln(0.38)

ln(L_{1}) = -445.59

Jill gives an agreeable nod. “How can we write the log-likelihood equation for the second assumption?” he asks immediately.

Joe wastes no time. He pens down a few quick lines.

L_{2} = {\displaystyle \prod_{i=1}^{n_{day}}f(x_{i}|p_{day})}{\displaystyle \prod_{i=1}^{n_{night}}f(x_{i}|p_{night})}

L_{2} = p_{day}^{r_{day}}*(1-p_{day})^{n_{day}-r_{day}}*p_{night}^{r_{night}}*(1-p_{night})^{n_{night}-r_{night}}

“If the data is partitioned into night and day, there will be two functions, one for the day and one for the night. If the sample is independent and identically distributed, the overall likelihood is the product of the two equations,” he says. Be watchful with the assumption of independence. You have to check for this.

Jill stares at it for a bit, looked up the ceiling and seemed to concur with the thought.

“Let me write the log-likelihood function for this,” he said as he starts writing the equation on the paper.

ln(L_{2}) = r_{day}ln(p_{day}) + (n_{day}-r_{day})ln(1 - p_{day}) + r_{night}ln(p_{night}) + (n_{night}-r_{night})ln(1 - p_{night}) 

ln(L_{2}) = 185ln(0.51) + (364 - 185)ln(0.49) + 223ln(0.76) + (308 - 223)ln(0.24)

ln(L_{2}) = -434.76

“So, ln(L_{1}) = -445.59 and ln(L_{2}) = -434.76. ln(L_{2}) > ln(L_{1}). Clearly, Assumption 2 seems to be more appropriate based on the data and the maximum likelihood method.”

“Yes. You can show this work tomorrow and ask for more data from your boss,” said Joe as he grabs some papers from the floor.

“Absolutely. I will prepare a few graphics with data and equations and show it to him. I think he will like it.”

So, as the night grows, Joe starts his project report, and Jill is preparing himself for his meeting tomorrow.

♦♦♦

A thought on the precision of the estimate has come to mind as Jill is typing up his equations.

“The maximum likelihood estimation method seems to be an elegant way to come up with parameters for probability distributions. The likelihood itself can be used to test assumptions, as we did here. But, are we confident enough that the maximization produces precise estimates? I mean, maximization is like finding the peak of a function. How can we be confident that the value where we find the peak is very precise?” asked Jill.

“Well, since you have 600 data points, your estimate will be precise. Precision is a function of sample size. Smaller sample sizes will yield less precise estimates.” Joe said uninterrupted from his writing.

“How do you mean?”

“Okay. Let’s take your example. The log-likelihood function for Bernoulli distribution. ln(L) = rln(p)+(n-r)ln(1-p). \hat{p}=\frac{r}{n} is the maximum likelihood estimate of p. For a given sample X = [x_{1}, x_{2}, ..., x_{n}], \hat{p} maximizes the likelihood (joint probability of occurence) of observing the sample. However, for smaller sample sizes, the parameter is not estimated very precisely.”

“Can you illustrate it?”

“Sure. Give me a second.” Joe types up a few lines of code in his RStudio console and creates a figure. “Look at this,” he said.

“This graphic shows the log-likelihood (L_{1}) for various values of p and a sample size n = 10. Imagine you had a sample of 10 observations with 6 or 7 violations. The peak of the function is at p = 0.6 or so, but around that region, the function is flat. So, pinpointing the peak at 0.62 is not possible. Less precise estimate.”

“To see this better, I computed the slope of the points from the maximum value of the log-likelihood. The difference in the log-likelihood at every point from the peak. This difference (slope) is another function that shows the precision. At the best estimate, 0.6 ish, the slope function should be very steep to say that the estimate is precise. The function is peaky at the best estimate. If the slope function is flat, the estimate is less precise.”

“I created this difference function for n = 20, 50, and 200 to see how the slope changes with increasing sample size. This is the graphic you see here with four different function lines. Can you see any pattern?”

Jill squints his eyes as he looks at the figures. “Yes, I see the slope function is flat for n = 10 and gets peaky as n increases. It is very steep at 0.62 for n = 200. The more sample we have, the more precise estimates we can get.”

“Should I include this too when I talk tomorrow?”

“Why not. The more information you present, the better. It shows your thought process and rigor.”

“Also, show that the maximum likelihood estimation method has an invariance property.

“What is that?”

“If \hat{\theta} is the maximum likelihood estimator for a parameter \theta, then the maximum likelihood estimator of any function g(\theta) is g(\hat{\theta}), the same function g of the estimator.”

“Think about variance and standard deviation for example. We know that the maximum likelihood estimator for variance \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}. Standard deviation \sigma = \sqrt{\sigma^{2}}, is a function of the variance. So, the maximum likelihood estimator for standard deviation is \hat{\sigma}=\sqrt{\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}}.”

“Since you are having large sample sizes, while you discuss properties, you can throw in the fact that the maximum likelihood method produces estimates (\hat{\theta}) that are approximately normally distributed, and approximately unbiased estimates with minimum variance.”

“Wait, you just threw in a few new words like bias and minimum variance, and you are asking me to throw it on my boss!”

Jill looks clearly lost. It may be late in the night also. They both go back to doing their works, hoping to sail through hurricane FINALS.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 64 – More Likelihood

J: The maximum likelihood estimation method provides the formulation for computing the parameters of the population distribution which agree most closely with the sample observations. For a given sample X = [x_{1}, x_{2}, ..., x_{n}], the optimal parameter maximizes the likelihood (joint probability of occurrence) of observing the sample. Last week, we derived the parameters for the Binomial and the Poisson, both discrete distributions. I wonder if we can use the same logic for continuous distributions?

D: Hello Joe. You look recharged and ready to go. So let’s cut to the chase. Yes, we can apply the same joint probability logic to the continuous probability functions. Let’s take a random sample X = [x_{1}, x_{2}, ..., x_{n}] from a population with a probability density function f_{X}(x) as in this figure.

Assume we can divide the space into equal intervals of length dx. Can you tell me the probability that a given sample data point x_{1} falls in the range dx?

J: That would be the integral of f_{X}(x_{1})dx over the range.

D: Correct. For an infinitesimally small range dx, we can approximate this area as f_{X}(x_{1})dx. Since the sample observations are independent and identically distributed, the joint probability of observing them, or the likelihood function is

L = f_{X}(x_{1})dx*f_{X}(x_{2})dx*...*f_{X}(x_{n})dx

L = {\displaystyle \prod_{i=1}^{n}f_{X}(x_{i})}(dx)^{n}

Maximizing the above function is equivalent to maximizing L = {\displaystyle \prod_{i=1}^{n}f_{X}(x_{i})} since we have equal intervals of length dx.

J: I see. So, it reduced to the same likelihood we did last week; the product of the probability density function for different values of the sample. We can choose the value of the parameter(s) that maximizes this quantity.

D: Precisely. Here are Fisher’s own words from his paper.

J:  Shall we try some continuous distributions today?

D: We shall. Begin with Exponential Distribution.

J: Let me solve it. The probability function for the exponential distribution is f_{X}(x) = \lambda e^{- \lambda x}. So the likelihood function is

L = {\displaystyle \prod_{i=1}^{n}\lambda e^{- \lambda x_{i}}

L = \lambda^{n}e^{-\lambda (x_{1}+x_{2}+...+x_{n})}

The log-likelihood will be

ln(L) = nln(\lambda) - \lambda (x_{1}+x_{2}+...+x_{n})

\lambda is the parameter we are trying to estimate for maximum likelihood. So take the derivate of the function ln(L) with respect to \lambda and equate it to 0 to solve for \lambda.

\frac{d ln(L)}{d \lambda}=0

\frac{n}{\lambda} - \sum_{i=1}^{n}x_{i}=0

\lambda = \frac{1}{1/n \sum_{i=1}^{n}x_{i}} = \frac{1}{\bar{x}}=\frac{1}{E[X]}

The maximum likelihood estimator for the exponential distribution is \hat{\lambda} = \frac{1}{\bar{x}}.

D: Excellent. Another continuous function you can try is the Rayleigh density. It is the probability distribution of the distance r = \sqrt{x^{2}+y^{2}} from the origin to a point with coordinates (x , y) in an x-y plane. Both X and Y should have a normal distribution.

f(r) = \frac{r}{\theta}e^{-\frac{r^{2}}{2 \theta}}

\theta is the parameter.

Do you recognize this function? We learned a probability distribution which is similar to this.

J: Aaah… Chi-square? Sum of squares of normals.

D: Correct. Rayleigh is the distribution of the distance. The square root of the Chi-square distribution.

J: Let me go through the steps of estimating \theta using the maximum likelihood method.

L = {\displaystyle \prod_{i=1}^{n}\frac{r_{i}}{\theta}e^{-\frac{r_{i}^{2}}{2\theta}}

L = \frac{1}{\theta^{n}} e^{- \frac{1}{2 \theta}\sum_{i=1}^{n}r_{i}^{2}} {\displaystyle \prod_{i=1}^{n}r_{i}}

ln(L) = -nln(\theta) + {\displaystyle \sum_{i=1}^{n}ln(r_{i})}-\frac{1}{2 \theta}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

We have one parameter to estimate, so \frac{d ln(L)}{d \theta}=0.

-\frac{n}{\theta}+\frac{1}{2 \theta^{2}}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}=0

\theta = \frac{1}{2n}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

The maximum likelihood estimator for the Rayleigh distribution is \hat{\theta} = \frac{1}{2n}{\displaystyle \sum_{i=1}^{n}r_{i}^{2}}

D: Very well done. Now let’s flex our muscles. How can we estimate two or more parameters?

J: Like for example, the parameters of the normal distribution, \mu and \sigma?

D: Yes.

J:

D: Simple extension. Since there will be two parameters in the likelihood function of a normal distribution, we have to solve the set of equations derived through the partial derivates. Look at these steps for the normal distribution.

f_{X}(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}}

{\displaystyle \prod_{i=1}^{n}} \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2}(\frac{x_{i}-\mu}{\sigma})^{2}}

L = (2\pi)^{-n/2}\sigma^{-n}e^{-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-\mu)^{2}}

ln(L)=-\frac{n}{2}ln(2\pi) - nln(\sigma) - \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

Now we can have two equations based on the partial derivatives of the log-likelihood function.

\frac{\partial ln(L)}{\partial \mu}=0

\frac{\partial ln(L)}{\partial \sigma}=0

Let’s take the first partial derivative.

\frac{\partial ln(L)}{\partial \mu}=0

\frac{\partial}{\partial \mu}(- \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}) =0

\frac{\partial}{\partial \mu}(- \frac{1}{2\sigma^{2}}({\displaystyle \sum_{i=1}^{n} x_{i}^{2}} + {\displaystyle \sum_{i=1}^{n}\mu^{2}} - 2\mu{\displaystyle \sum_{i=1}^{n}x_{i}})) = 0

-\frac{1}{2\sigma^{2}} {\displaystyle \sum_{i=1}^{n}2\mu} + \frac{1}{\sigma^{2}} {\displaystyle \sum_{i=1}^{n}x_{i}}=0

n\mu = {\displaystyle \sum_{i=1}^{n}x_{i}}

\mu = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

Now, take the second partial derivative.

\frac{\partial ln(L)}{\partial \sigma}=0

\frac{\partial}{\partial \sigma}(-\frac{n}{2}ln(2\pi) - nln(\sigma) - \frac{1}{2\sigma^{2}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}) = 0

-\frac{n}{\sigma}+\frac{1}{\sigma^{3}}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}=0

\sigma^{2} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

The maximum likelihood estimators for the Normal distribution are \hat{\mu} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}} and \hat{\sigma^{2}}=\frac{1}{n}{\displaystyle \sum_{i=1}^{n} (x_{i}-\mu)^{2}}

J: That makes it very clear. There is some work involved in solving, but the procedure is straightforward.

D: You can try estimating the parameters of the lognormal distribution for practice.

J: Well, lognormal has a very similar probability density function. Let me try Gumbel. It has a double exponential distribution. Will be fun to solve.

D: Okay. Go for it.

J: The cumulative function for the Gumbel distribution is F(z)=e^{-e^{-\frac{z-\alpha}{\beta}}}. \alpha and \beta are the parameters.

Taking the derivative of this function, we can get the probability density function for the Gumbel distribution as

f(z)=\frac{1}{\beta}e^{-\frac{z-\alpha}{\beta}}e^{-e^{-\frac{z-\alpha}{\beta}}}

The likelihood function is

L = {\displaystyle \prod_{i=1}^{n} \frac{1}{\beta}e^{-\frac{z_{i}-\alpha}{\beta}}e^{-e^{-\frac{z_{i}-\alpha}{\beta}}}}

L = \beta^{-n} e^{-{\displaystyle \sum_{i=1}^{n} (\frac{z_{i}-\alpha}{\beta} + e^{-\frac{z_{i}-\alpha}{\beta}}) }}

ln(L) = -nln(\beta) -{\displaystyle \sum_{i=1}^{n} (\frac{z_{i}-\alpha}{\beta} + e^{-\frac{z_{i}-\alpha}{\beta}}) }

We can get the partial derivatives with respect to \alpha and \beta and solve the system of equations.

\frac{\partial ln(L)}{\partial \alpha} = \frac{n}{\beta} - \frac{1}{\beta}{\displaystyle \sum_{i=1}^{n}e^{-\frac{z_{i}-\alpha}{\beta}}}=0

\frac{\partial ln(L)}{\partial \beta} = -\frac{n}{\beta} + \frac{1}{\beta^{2}}{\displaystyle \sum_{i=1}^{n}(z_{i}-\alpha)} - \frac{1}{\beta^{2}}{\displaystyle \sum_{i=1}^{n}(z_{i}-\alpha) e^{-\frac{z_{i}-\alpha}{\beta}}}=0

D: Yes, I know. There is no easy solution for this system. You have to use iterative numerical techniques. Do you remember Newton-Raphson method?

J:

D: It’s okay if you do not remember. We will let R do it for us. The moral is that MLE method is not a panacea. We also get into issues of discontinuity for likelihood functions sometimes. Think uniform distribution for example. As you know, we cannot use calculus methods for discontinuous functions. A combination of the MLE method with numerical techniques and approximations is a good approach.

J: I can see how things are building up. Are there any such pitfalls I need to pay attention to?

D: Yes, there are some properties, limitations and real-world applications we can learn. Do you want to do it now?

J: Let’s do it next week. I am afraid these equations are going to be my sweet dreams tonight

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 63 – Likelihood: modus operandi for estimation

D: Often interchangeably used for probability (guilty as charged), the term likelihood is to be used for inferring the parameters of the population. 

J: Are we talking about using a sample to compute the true value of the parameter? You said last week that there are methods of estimation of these parameters. I was able to grasp that there are sample records, i.e., the data we have at hand, and there is a hypothetical population from where these samples originate.

D: That is correct. As you remember, there was an example of extreme wind speed data (sample). Our goal is to understand this bulk data using some probability function (GEV in this case) with few control parameters. The probability function represents the entire information contained in the sample data. It is the hypothetical population you mentioned, and the data we observe is a random sample from this population.

If you compute the mean (\bar{x}) and variance (s^{2}) of the sample data, they are good guesses (estimates or estimators) of the mean (\mu) and variance (\sigma^{2}) of the population. The formulation used to get the estimate is called an estimator.

J: Then what is likelihood? You are saying that it is a well-known method for estimation.

D: It is the joint probability of occurrence of the observations in any distribution. But before we delve into the mathematical expose, I think it will help if we take a simple example and work it out step by step using some graphics.

J: I like that idea. It helps if we know how the concept builds up, so we can go back to its roots when required.

D: So let’s begin. Imagine you ordered ten shirts from Amazon. Just imagine. Don’t even think of actually doing it 

J:

D: On arrival, you find that the fourth (red shirt), fifth (black shirt) and eighth (white shirt) is defective.

J:

D: Can you convert these ten observations into a Bernoulli sequence?

J: Sure. I will use 1 for defective shirts and 0 for shirts that are not defective. Then, we have

X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0].

D: Great. Now, if these numbers (sample shirts) were coming from a hypothetical population (Bernoulli trials), we can say that there is a certain probability of defective items. Let’s call it p.

J: So, p will be the true probability of occurrence for this Bernoulli trial. What we have is a sample ten values of 0’s and 1’s from this population.

D: Correct. Can you estimate p using the sample data?

J: Based on the ten data points one guess would be 0.3 since there are three defective items in ten. But, how do we know if 0.3 is the best estimate of the true probability?

D: Good question. For that, we need to know about likelihood. The joint probability of observing the sample we have.

J: Can you elaborate.

D: You told me that the Bernoulli sequence of these ten observations is X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0]. Can you tell me what the joint probability of observing this sequence is?

J: Sure. If we assume p for the probability of success, P(X = 1) = p and P(X = 0) = 1-p. Then, the joint probability of seeing X = [0, 0, 0, 1, 1, 0, 0, 1, 0, 0] is

P(X=0)*P(X=0)*P(X=0)*P(X=1)*P(X=1)*P(X=0)*P(X=0)*P(X=1)*P(X=0)*P(X=0)=p^{3}(1-p)^{7}.

They are independent and identically distributed, so the joint probability is the product of the individual probabilities.

D: Great. Let’s write it concisely as L = p^{3}(1-p)^{7}.

I am using L to describe the likelihood.

We can now choose a value for p that will maximize L, i.e., that value of p that gives the most L. The brute-force way of doing it is to substitute possibles values of p, compute L and find the value of p where L is the highest.

J: Yes. Let me try that first.

L is maximum at \hat{p}=0.3.

D: That is correct. Now let’s try solving it as a maximization problem. You have the equation L = p^{3}(1-p)^{7}. How do you find p that maximizes L?

J: We can differentiate L. At the maximum or minimum, the slope \frac{dL}{dp}=0. We can use this property to solve for p.

D: Yes. But it is easier to differentiate ln(L).

J: Why is that?

D: Because ln(L) is a monotonic function of L. The largest value of L is proportional to the largest value of ln(L).

Finding p that maximizes ln(L) is equivalent to finding p that maximizes L. Moreover, taking the logarithm of the likelihood function converts the product terms into summation terms.

ln(L) = 3ln(p)+7ln(1-p)

Now we can find the derivative of the function and equate it to 0 to find the answer.

\frac{3}{p} - \frac{7}{1-p}=0

3 - 3p = 7p

p = \frac{3}{10}

Can you generalize this for any number of samples?

J: Yes, I can do that. Assuming there are r 1’s in a sample of n, the likelihood function will be L = p^{r}(1-p)^{n-r}

Log-likelihood function is ln(L) = rln(p)+(n-r)ln(1-p)

\frac{dln(L)}{dp} = 0

\frac{r}{p}-\frac{n-r}{1-p}=0

p=\frac{r}{n}

\frac{d^{2}ln(L)}{dp^{2}} = -[\frac{r}{p^{2}} + \frac{n-r}{(1-p)^{2}}]

A negative second derivative is sufficient to ensure maxima of the function.

D: Very nice. \hat{p}=\frac{r}{n} is called the maximum likelihood estimate of p. For a given sample  X = [x_{1}, x_{2}, ..., x_{n}], \hat{p} maximizes the likelihood (joint probability of occurence) of observing the sample.

R. A. Fisher pioneered this method in the 1920s. The opening image on likelihood is from his 1922 paper “On the Mathematical Foundations of Theoretical Statistics” in the Philosophical Transactions of the Royal Society of London where he first proposed this method.

Here is how L and ln(L) compare when finding the value of p that gives the most likelihood or log-likelihood.

J: That makes it very clear. Some form of visualization always helps better grasp the abstractions. So, this is for Bernoulli sequence, or by extension for the Binomial distribution. Shall we try this maximum likelihood estimation method for one more distribution?

D: The floor is yours. Try it for the Poisson distribution.

J: I will start with the fact that there is a sample of n observations x_{1}, x_{2}, ..., x_{n} that is assumed to have originated from a Poisson distribution with the probability function f_{X}(x) = P(X=x) = \frac{e^{-\lambda t}(\lambda t)^{x}}{x!}. Our sample is counts, i.e., the number of times an event occurs in an interval. Let’s assume a unit time interval, t = 1.

The likelihood L of observing the sample x_{1}, x_{2}, ..., x_{n} is P(X=x_{1})*P(X=x_{2}) ... P(X=x_{n}). We can represent this concisely using the product operator \prod.

L = {\displaystyle \prod_{i=1}^{n} \frac{e^{-\lambda}(\lambda)^{x_{i}}}{x_{i}!}}

L = e^{-n\lambda}{\displaystyle \prod_{i=1}^{n} \frac{(\lambda)^{x_{i}}}{x_{i}!}}

ln(L) = -n\lambda + {\displaystyle \sum_{i=1}^{n} (x_{i}ln(\lambda) - ln(x_{i}!))}

\frac{dln(L)}{d\lambda} = 0

-n + {\displaystyle \sum_{i=1}^{n}\frac{x_{i}}{\lambda} = 0

\lambda = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}

\frac{d^{2}ln(L)}{d\lambda^{2}} < 0

The maximum likelihood estimator is \hat{\lambda} = \frac{1}{n}{\displaystyle \sum_{i=1}^{n}x_{i}}. It is the mean of the n observations in the sample. Ah, I see a connection. The expected value of the Poisson distribution E[X] is its parameter \lambda.

D: Very clearly described. Do you have the energy for one more?

J: Most likely no.

To be continued…

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 62 – Knowing the unknown: Inferentia

Think about your recent wine tasting retreat. While you were relaxing and refreshing, you were also tasting/testing wine samples and guessing the quality of the product. You were knowing the unknown using samples and the process of induction. In other words, you were inferring about the true quality of the product based on the taste of the wine sample. A method of generalization from observations.

 

Think about your recent visit to the theater after being impressed by the trailer. You must have judged how good the movie will be, based on the sample they showed you. If you are like me, you would have told yourself that this is the last time you are watching a movie based on the trailer. But you keep going back. Here again, you are inferring about the film based on the sample teaser. Knowing the unknown via inference.

 

Do those robocalls for election survey or ‘get out the vote’ campaign bother you. Be nice to them. They are taking a sample survey to estimate/guess the election results. They are trying to know the unknown. Inferring the true outcome from the sample data.

 

Abraham Wald in 1942 used the sample of fighter planes that returned with bullet holes to judge/infer where to put the armor plates. It is where the holes aren’t there. The truth is that the planes with bullet holes in the wings returned, while the others did not.

Statistical Inference

Deriving general conclusion from what we observe (sample data) is central to increasing our knowledge about the world. Statistical inference helps us with this.

Imagine we wish to know about incomes or IQ levels of adult population in the US. The ideal thing to do is to test every adult for IQ and record their income levels. In other words, get the data for the entire population. But this is not practical. We can, however, take a sample that represents the population, know the IQ and incomes of this sample. We can make a general statement about the population from the sample. The statement could be “the average IQ of adults in the US is 100.” We get this based on a large enough representative sample. The average IQ from the sample is an estimate of the true IQ for the population. Since we are basing it on the sample, the estimate will be sensitive to how much data (sample size) we have.

Imagine we wish to know about the climate of a region we want to relocate to, i.e., what is the temperature distribution and how much does it rain or how many hurricanes does it get per year. Proper measurements of rainfall and temperature are a recent phenomenon. Maybe the last 80 years with good quality. We don’t have an observational record of what happened before that. But, we can take the data we have (sample of climate observations) and estimate the temperature distribution and the hurricane counts and make a statement about the overall climate in this place so you can relocate. The generalization can be “the average temperature of this place is 60F with four hurricanes per year.” Don’t complain after you move that the area is getting more hurricanes. The sample may not have represented the climate history. Our inference is based on what we had. We can always update it.

Come 2020, the media and polling companies will predict the outcome of the presidential election. They will make voter calls or take online surveys to infer which side the election will swing. Ask for sample breakdowns and survey questions. You will know what biases are built into it.

In the most recent lessons on the generalized extreme value distribution, we assumed that the data on extreme wind speed (block maxima) would most likely follow an extreme value distribution since maximum values converge to an extreme value distribution in the limit. So the data we observed is a sample that represents the true GEV functional form

G(z) = e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}_{+}}

This compact GEV function is the population. A true form of all extreme wind values in this place. The data we observed for a few years is a sample that represents this function (population). We can use the sample to draw conclusions (inference) about this GEV function.

The true location, scale and shape parameters for this GEV function are \mu, \sigma and \xi. Based on the data we observed, an estimate (a good guess) of this location, scale and shape parameters is 47.44, 5.03 and 0. We are drawing inference about the true function from the sample data we have at hand.

Hence, the objective of inference is to use a ‘sample’ to compute a number which represents a good guess for the true value of the parameter. The true population parameter is unknown, and what we estimate from the sample will enable us to obtain the closest answer in some sense.

The general rule, or a formula which is used to get the estimate, is called an estimator.

For example, if you compute the mean (\bar{x}) and variance (s^{2}) of the data, they are good guesses (estimates or estimators) of the mean (\mu) and variance (\sigma^{2}) of the population.

A different sample will yield a different estimate for the parameter. So we have to think of the estimate as a range of values, an interval or a probability distribution instead of a single value. The truth may be in this interval if we have good representative samples.

Over the next few lessons, we will learn some methods of estimation from data, how reasonable the estimates are, or the criteria for being good, their range or intervals, and many more concepts.

In summary, statistical inference is about understanding the entire population from samples and sample statistics.

The sample has to be representative of the population for good inference.

In other words, the sample distribution must be similar to the population distribution. Extreme winds data would be a good representation of an extreme value function as population, not normal distribution.

 

 

Joe and his girlfriend broke up. They inferred their future based on the sample dates. Is it a good inference?

 

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 61 – Resto


It is a journey, long haul. Let’s take an exit, stretch our legs, get a coffee and trace our tracks.

Seeing Data and Making Friends with R

Lesson 1: When you see something, say data

Lesson 2: R is your companion

Sets and Properties

Lesson 3: Sets

Lesson 4: Visualizing Sets

Probability, Axioms, And rules

Lesson 6: Defining Probability

Lesson 7: Axioms of Probability

Lesson 9: Conditional Probability

Lesson 10: Independent Events

Lesson 12: Total Probability

Lesson 13: Bayes Rule

More R: Shortcuts, for, if-else, and functions
Lesson 5: Let us 'R'eview

Lesson 8: The search 'for' w'if'i

Lesson 11: Fran, the functionin' R-bot
Exploratory Data Analysis

Lesson 14: Order Statistics

Lesson 15: Frequency Plots

Lesson 16: Average

Lesson 17: Variance

Lesson 18: Skewness

Lesson 19: Outliers

Lesson 20: Coefficient of Variation

Random Variables, Expectation and others

Lesson 22: Random Variable

Lesson 23: Probability Distribution

Lesson 24: Expectation Operation

Lesson 25: More Expectation

Lesson 26: Variance Operation

Lesson 27: More Variance

Lesson 28: Standardization

More R: data exploration and large number games
Lesson 21: Beginners guide to summarize data in R

Lesson 29: Large number games in R
Discrete Probability Distributions

Lesson 31: Bernoulli Trials

Lesson 32: Binomial Distribution

Lesson 33: Geometric Distribution

Lesson 34: Return Period

Lesson 35: Negative Binomial Distribution

Lesson 36: Poisson Distribution

Lesson 37: Still Poisson Distribution

Lesson 38: Hypergeometric Distribution

Continuous Probability Distributions

Lesson 41: Continuous Functions

Lesson 42: Beta Distribution

Lesson 43: Exponential Distribution

Lesson 44: Memoryless Property

Lesson 45: Gamma Distribution

Lesson 46: Convolution

Lesson 47: Central Limit Theorem Visuals

Lesson 48: Central Limit Theorem Derivation

Lesson 49: Normal Distribution

Lesson 50: The Standard Normal Distribution

Lesson 51: Guest Blog: Prof. Upmanu Lall from Columbia University

Lesson 52: Lognormal Distribution

Lesson 53: Chi-square Distribution

Lesson 54: The Connection of Probability Distributions to Bernoulli

More R: Discrete and Continuous Distributions
Lesson 39: Discrete Distributions in R Part I

Lesson 40: Discrete Distributions in R Part II

Lesson 55: Continuous Distributions in R Part I

Lesson 56: Continuous Distributions in R Part II
Extreme Value Distributions

Lesson 57: Visual Understanding of Extreme Values

Lesson 58: Basics of Extreme Value Distributions

Lesson 59: Generalized Extreme Value Distribution

More R: Extreme Value Distributions
Lesson 60: Extreme Value Distributions in R
Inference

This is a classroom – Use it.

This is a resource – Take it.

This is a free source – Bookmark it.

This is knowledge, share it.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

 

Lesson 59 – The Generalized extreme value distribution

Recap

Two weeks ago, we met Maximus Extremus Distributus. He was taking different forms depending on the origin (parent) distribution.

Last week, Mumble, Joe, and Devine met to discuss the central ideas behind extreme value distribution. They derived the exact function for it.

F_{Y}(y)=P(Y \le y) = P(X_{1} \le y)P( X_{2} \le y) ...P(X_{n} \le y) = [F_{X}(y)]^{n}.

From this cumulative function, we can derive the probability function.

f_{Y}(y) = n[F_{X}(y)]^{n-1}f_{X}(y)

Mumble pointed out the issue of degeneration of the exact function as n \to \infty and introduced the idea of normalizing constants to stabilize the function.

If there are two normalizing constants a_{n}>0 and b_{n}, we can create a normalized version of Y as Y^{*} = \frac{Y-b_{n}}{a_{n}}.

Before we closed off for the week, we knew that Y^{*} converges to three types, Type I, Type II and Type III extreme value distributions.

If there exist normalizing constants a_{n}>0 and b_{n}, then,

P(\frac{Y - b_{n}}{a_{n}} \le z) \to G(z) as n \to \infty.

G(z) is the non-degenerate cumulative distribution function.

Type I (Gumbel Distribution): G(z) = e^{-e^{-\frac{z-\alpha}{\beta}}}. Double exponential distribution.

Type II (Frechet Distribution): G(z) = e^{-(\frac{z-\alpha}{\beta})^{-\gamma}} for z > \alpha and 0 for z \le \alpha. This is single exponential function.

Type III (Weibull Distribution): G(z) = e^{-[-\frac{z-\alpha}{\beta}]^{\gamma}} for z < \alpha and 1 for z \ge \alpha. This is also a single exponential distribution.

The convergence

Let’s get some intuition on why the parent distributions converge to these three types. The mathematical foundation is much more in-depth. We will skip that for some later lessons and get a simple intuition here.

Exponential origin: Let’s take Joe’s wait time example from last week. We assume that the arrival times between successive vehicles has an exponential distribution. Let’s call this random variable X.

X \sim Exp(\lambda)

There are n arrival times between successive vehicles that can be shown as a set of random variables.

X_{1}, X_{2}, X_{3}, ..., X_{n}

The maximum wait time is the max of these numbers. Let’s call this maximum time, Y.

Y = max(X_{1}, X_{2}, X_{3}, ..., X_{n})

The cumulative function for an exponential distribution is F_{X}(x) = 1 - e^{-\lambda x}. Hence, for Y (maximum wait time), it will be

F_{Y}(y) = [1 - e^{-\lambda y}]^{n}

For simplicity, let’s assume a value of 1 for \lambda and take the binomial series expansion for F_{Y}(y) = [1 - e^{-\lambda y}]^{n}.

F_{Y}(y) = 1 - ne^{-y} + \frac{n(n-1)}{2!}e^{-2y} - ...

As n \to \infty, this series converges to e^{-ne^{-y}}, an asymptotic double exponential functions.

Norming Constants: Now let’s derive the double exponential function with the idea of norming constants. We know F_{Y}(y) = [1 - e^{-y}]^{n} for \lambda = 1.

Let’s introduce a variable z = y + ln(n) and evaluate the function F_{Y}(y) at z. We are adding a constant ln(n) to y.

F_{Y}(y+ln(n)) = [1 - e^{-(y+ln(n))}]^{n}

F_{Y}(y+ln(n)) = [1 - \frac{1}{e^{ln(n)}}e^{-y}]^{n}

F_{Y}(y+ln(n)) = [1 - \frac{1}{n}e^{-y}]^{n}

As n \to \infty, F_{Y}(y+ln(n)) converges to e^{-e^{-y}}, a double exponential function. If you observe the equation carefully, it is of the form [1+\frac{1}{n}x]^{n}, which in the limit is e^{x}. In our case, x = -e^{-y}. Hence,

F_{Y}(y+ln(n)) =e^{-e^{-y}}

If we replace y = z - ln(n), we get,

F_{Y}(z) =e^{-e^{-(z-ln(n))}}

So, with appropriate scaling (stabilization/norming), we see a double exponential function when the origin is an exponential function.

Power law origin: Let’s try one more parent function. This time, it is a power law function with a cumulative density function F_{X}(x)=1 - x^{-a}. We did not learn this distribution, but it is like a decay function with a controlling the degree of decay. The distribution of the maximum value Y of this function is

F_{Y}(y) = [1 - y^{-a}]^{n}

We will assume a new variable z = n^{a}y and evaluate the function at z.

F_{Y}(n^{a}y) = [1 - (n^{a}y)^{-a}]^{n}

=[1 - n^{-1}y^{-a}]^{n}

=[1 + \frac{1}{n}(-y^{-a})]^{n}

In the limit, as n \to \infty, F_{Y}(n^{a}y) = e^{-y^{-a}}

Hence, F_{Y}(z)=e^{-(\frac{z}{n^{a}})^{-a}}

So, the origin functions with a power law type of functions converge to single exponential Type II Frechet distribution. Similar norming constants can be observed for other distributions that converge to Type III Weibull distribution.

To summarize, the three types are e^{-e^{-\frac{z-\alpha}{\beta}}}, e^{-(\frac{z-\alpha}{\beta})^{-\gamma}}, e^{-[-\frac{z-\alpha}{\beta}]^{\gamma}}.

They are of the form e^{-e^{-y}}, e^{-y^{-\gamma}}, e^{-(-y)^{\gamma}}.

Here’s a visual of how these three distributions look.

If the right tail is of exponential type, the extreme value distribution is a Gumbel distribution. Here the parent distribution (or the distribution of X) is unbounded on the right tail. Extremes of most common exponential type distributions such as normal, lognormal, exponential and gamma distributions converge to the double exponential Gumbel distribution. It is most commonly used to model maximum streamflow, maximum rainfall, earthquake occurrence and in some cases, maximum wind speed.

The Frechet distribution, like the Gumbel distribution, is unbounded on the right tail and is much fatter. Extremes from Pareto distribution (Power Law) and Cauchy distributions converge to Frechet Distribution. Rainfall and streamflow extremes, air pollution and economic impacts can be modeled using this type. Notice how the red line (Frechet distribution) has a heavy tail and is bigger than the black line (Gumbel distribution).

If the right tail converges to a finite endpoint, it is a Weibull distribution. The parent distribution is also bounded on the right. Remember the extremes of Uniform, which is bounded converged to a Weibull. It is most widely used in minima of the strength of materials and fatigue analysis. It is also used in modeling temperature extremes and sea level.

The Generalized Extreme Value Distribution (GEV)

The three types of extreme value distributions can be combined into a single function called the generalized extreme value distribution (GEV). Richard von Mises and Jenkinson independently showed this.

G(z) = e^{-[1+\xi(\frac{z-\mu}{\sigma})]^{-1/\xi}_{+}}

\mu is the location parameter. \sigma > 0 is the scale parameter. \xi is the shape parameter.

When \xi \to 0, GEV tends to a Gumbel distribution. It is the same form [1+\frac{1}{n}x]^{n}, which in the limit is e^{x}. For GEV, the exponential term goes to e^{-(\frac{z-\mu}{\sigma})} hence yielding the double exponential G(z) = e^{-e^{-(\frac{z-\mu}{\sigma})}}.

When \xi > 0, GEV tends to the Frechet distribution. Replace \xi = 1 and see for yourself what you get.

When \xi < 0, GEV tends to the Weibull distribution. Replace \xi = -1 and check.

\mu and \sigma are the surrogate norming variables and \xi controls the shape.

GEV folds all the three types into one form. The parameters \mu, \sigma and  \xi can be estimated from the data, hence negating the necessity to know which Type a parent distribution or data converges to. The function has a closed form solution to compute the quantiles and probabilities. GEV also has the max stable property about which we will learn in some later lessons.

All these concepts will become more concrete once we play with some data. Aren’t you itching to do some GEV in R? You know what is coming next week.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 58 – Max (Min): The language of extreme value distribution

Mumble, Joe, and Devine meet to discuss extreme values and extreme value distribution. Extremus Distributus.

 

 

J: Extreme value distribution? Are we talking about the pictures we saw last week? The maximum of any distribution following specific probability distribution. How does my driving experience today morning fit the discussion?

M: Can you describe a bit more about your experience?

J: Yes. I was waiting at the merging lane near GW Bridge. The arrival time between successive vehicles was pretty fast; I was left there stranded.

M: We can assume that the arrival times between successive vehicles has an exponential distribution. As you have shown in your graphic, let’s call this random variable X.

X \sim Exp(\lambda)

Let’s say there are n+1 vehicles so that there are n arrival times between successive vehicles that can be shown as a set of random variables.

X_{1}, X_{2}, X_{3}, ..., X_{n}

This means your wait time to merge between successive vehicles could have been any of these n random variables. The maximum wait time is the max of these numbers. Let’s call this maximum time, Y.

Y = max(X_{1}, X_{2}, X_{3}, ..., X_{n})

Y follows an extreme value distribution and it is related to the distribution of X_{1}, X_{2}, X_{3}, ..., X_{n}.

D: Since in your illustration, you have an exponential wait time as the parent distribution, Y, the maximum of X’s will have a distribution that is related to the exponential distribution. We can derive the exact function of Y based on what we know about X.

M: In fact, we use exponential distribution as the parent distribution to model maximum wait time between earthquakes, floods, droughts, etc. to price our insurance products.

J: How can we derive the probability function of Y from X? Can we go through that?

D: Sure. Mumble showed before that Y = max(X_{1}, X_{2}, X_{3}, ..., X_{n}).

The cumulative distribution function F_{Y}(y)=P(Y \le y). Because Y is the maximum X_{1}, X_{2}, X_{3}, ..., X_{n}, y should be greater than all these values. In other words,

F_{Y}(y)=P(Y \le y) = P(X_{1} \le y \cap X_{2} \le y \cap ... X_{n} \le y)

We assume that X_{1}, X_{2}, X_{3}, ..., X_{n} are statistically independent; hence, the cumulative distribution of Y is the product of the cumulative distribution of individual parent distributions.

F_{Y}(y)=P(Y \le y) = P(X_{1} \le y)P( X_{2} \le y) ...P(X_{n} \le y) = [F_{X}(y)]^{n}

From this cumulative function, we can derive the probability function.

f_{Y}(y) = \frac{d}{dy}F_{Y}(y) = \frac{d}{dy}[F_{X}(y)]^{n}

f_{Y}(y) = n[F_{X}(y)]^{n-1}f_{X}(y)

M: If we know the distribution of X (parent), we can substitute that in our equations here and get the distribution of Y. Do you want to try it for the exponential wait times?

J: Sure. The cumulative function for an exponential distribution is F_{X}(x) = 1 - e^{-\lambda x}. The probability function is f_{X}(x)=\lambda e^{-\lambda x}. For Y it will be

F_{Y}(y) = [1 - e^{-\lambda y}]^{n}

f_{Y}(y) = n[1 - e^{-\lambda y}]^{n-1}\lambda e^{-\lambda y}

D: Excellent. Did you also notice that the cumulative and probability functions are dependent on n, the number of independent random variables X_{1}, X_{2}, X_{3}, ..., X_{n}.

Here is how they vary for exponential parent distribution. They shift to the right with increasing values of n.

J: We can derive the functions for the minimum in a similar way I suppose.

M: That is correct. If Y_{min} is the minimum of the random variables, then, each of these independent X_{1}, X_{2}, X_{3}, ..., X_{n} are greater than y_{min}.

P(X_{1}>y_{min}, X_{2}>y_{min}, ..., X_{n}>y_{min}) = 1 - F_{Y_{min}}(y_{min})

Using this equation, we can derive F_{Y_{min}}(y_{min}) = 1 - [1-F_{X}(y_{min})]^{n} and f_{Y_{min}}(y_{min})=n[1-F_{X}(y_{min})]^{n-1}f_{X}(y_{min}).

J: This is cool. All we need to know is the function for X, and we know everything about the extremes.

M: That is true to an extent. But what will we do if we do not know the form of the parent distribution with certainty? Remember Lesson 51? Did you check what happens to the functions with increasing values of n, i.e., n tends to \infty?

J: Let me plot the cumulative function for increasing values of n and see.

D: Joe, as you can see, F_{Y}(y) = [F_{X}(y)]^{n} tends to 0 as n tends to \infty. The function degeneates to a point. It is unstable.

J: Hmm. How do we stabilize it then?

M: Good question. We can work with a normalized version of Y instead of Y.

If there are two normalizing constants a_{n}>0 and b_{n}, we can create a normalized version of Y.

Y^{*} = \frac{Y-b_{n}}{a_{n}}

Proper values of these two normalizing constants a_{n}>0 and b_{n} will stabilize Y for increasing values of n. We need to find the distribution that Y^{*} can take in the limit as n tends to \infty.

D: Look at this visual to understand what stabilizing (norming) for increasing value of n means. It is for an exponential parent, our example. I am assuming a_{n} = 1 and b_{n} varies with n.

The normalized function does not degenerate for large values of n.

M: It turns out that the limit distribution of Y^{*}, the normalized version of Y, can be one of the three types, Type I, Type II and Type III.

If there exist normalizing constants a_{n}>0 and b_{n}, then,

P(\frac{Y - b_{n}}{a_{n}} \le z) \to G(z) as n \to \infty.

G(x) is the non-degenerate cumulative distribution function.

Type I (Gumbel Distribution): G(z) = e^{-e^{-\frac{z-\alpha}{\beta}}}. Double exponential distribution.

Type II (Frechet Distribution): G(z) = e^{-(\frac{z-\alpha}{\beta})^{-\gamma}} for x > \alpha and 0 for x \le \alpha. This is single exponential function.

Type III (Weibull Distribution): G(z) = e^{-[-\frac{z-\alpha}{\beta}]^{\gamma}} for x < \alpha and 1 for x \ge \alpha. This is also a single exponential distribution.

The normalized version of Y converges to these three types, also called the family of extreme value distribution. It is said that the distribution F_{Y}(y) is in the domain of attraction of G(z). \alpha, \beta, \gamma are the location (central tendency), scale and the shape controlling parameters.

J: This convergence theorem sounds like the central limit theorem.

D: Yes, like the sum of the independent random variables converges to the normal distribution in the limit when n \to \infty, the max of independent random variables converges to one of these three types, Gumbel, Frechet or Weibull depending on the parent distribution. Generally, exponential tails, for example, exponential and normal parent distribution tend to Gumbel, and polynomial tails tend to Frechet distribution.

You have already seen how the Gumbel looks like when exponential wait times converge. Here is an example Frechet and Weibull distribution.

Fisher and Tippet derived these three limiting distributions in the late 1920s. Emil Julius Gumbel, in chapter 5 of his book “Statistics of Extremes” explains the fundamentals of deriving the distributions.

J: This is plentyful. Are there any properties of the three types? Can we guess what distribution is used where? The converge of extremes to these three types seems very powerful. We can model outliers without much trouble. Some examples in R might help understand these better.

You already have a handful to digest. Let’s continue next week after you have processed it.

Be cautious. Coming up with the values for \alpha, \beta, \gamma in the real world involves uncertainty. The extreme value distribution need not be an elixir for outliers.

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

error

Enjoy this blog? Please spread the word :)