Lesson 52 – Transformation: The language of lognormal distribution

A cluttered desk with scattered papers, piles of binders, and an open reference textbook with a pencil on top welcomes Mumble to his day at work.

“Clean up the workspace,” said Mumble in angst, staring at the checklist for the day. It was that time of February when his last New Year resolution of having an ambient workspace had already taken a backburner.

He has two essential data analysis tasks and two meetings for the day. Besides, he set himself two chores. If you think Mumble is doing his miscellaneous tasks at works, you are mistaken. He is like you and me when it comes to writing checklists.

He begins his first task: Analyze flood damage data.

The 2017 flood insurance claims are starting to come in. Mumble, who was the chief designer of the product was assigned the task of corroborating the flood damage claims with the reported flood damage data. He was also assigned the task of estimating the probability of damage exceeding a certain threshold set for the year by the company.

“Open Dartmouth Flood Observatory Database. They will have the archive for recent floods, the area affected and people displaced. I can use that data to verify if the claims from these areas are valid or not,” thought Mumble as he types his password to log in.

“Time for some coding and analysis. The best part of my day. Fire up RStudio. Check. Read the file into R workspace. Check.” He completes the essentials and is now ready to analyze the data. You can also look at the data if you want.

“Let me first filter the database to select and plot only the floods for 2017.” RStudio lets him make the map too.

“Aha. The locations match the countries from where we got the claims. So let me look at the data for people displaced and develop the probability distribution function for it.” He quickly types up a few lines of code as he sips his coffee.

“There are 123 events,” he thinks as he scrolls down his monitor. “Some of these floods must be minor. There are 0’s, no people displaced in 40 such events. There are some huge numbers though. Possibly created a lot of damage. Let me plot them to see their arrangement on the number line.”

“They do not look symmetric. The few large numbers are skewing the data. I wonder how the frequency plot will look.”

“Ugly. Not sure what probability distribution fits this data. Those large numbers are dominating, but they are also most of the insurance claims.” He took another sip, leaned back, and with his hands at the back of his head, stared at the ceiling.

After about 30 seconds, it struck him that the scale of the observations is obscuring the pattern in the data.

“What if I transform the data? If I re-express the data on a different scale, I can perhaps alter the distances between the points. Log scale?”

So he applied log transformation $y = log(x)$ and plotted their frequency.

“Amazing. A simple mathematical transformation can make a skewed distribution more symmetric. The application of “logarithm” has shrunk the large numbers on the right side and moved them closer to the center. The frequency plot looks like a normal distribution, ” he thought, as he typed another line into the code to update the plot.

He realized that the log of the data is a normal distribution. “Lognormal distribution,” he said.

If $Y = log(X)$ is a normal distribution with mean $\mu_{y}$ and standard deviation $\sigma_{y}$ , then X follows a lognormal distribution with a probabilty density function

$f(x) = \frac{1}{x} \frac{1}{\sqrt{2\pi\sigma_{y}^{2}}} e^{\frac{-1}{2}(\frac{ln(x)-\mu_{y}}{\sigma_{y}})^{2}}$

Remember, $\mu_{y}$ and $\sigma_{y}$ are the mean and standard deviation of the transformed variable Y.

The parameters (mean $\mu_{x}$ and the standard deviation $\sigma_{x}$ ) of X are related to $\mu_{y}$ and $\sigma_{y}$ .

$\mu_{x}=e^{\mu_{y}+\frac{\sigma_{y}^{2}}{2}}$

$\sigma_{x}^{2}=\mu_{x}^{2}(e^{\sigma_{y}^{2}}-1)$

From $\sigma_{x}^{2}=\mu_{x}^{2}(e^{\sigma_{y}^{2}}-1)$ , we can also deduce that $\sigma_{y}^{2}=ln(1+CV_{x}^{2})$ . $CV_{x}$ is the coefficient of variation of the lognormal variable X. You may recall from Lesson 20 that coefficient of variation is the ratio of the standard deviation and the mean. $CV_{x}=\frac{\sigma_{x}}{\mu_{x}}$ .

You must have noticed already that the probabilities can be computed using the transformed variable (Y) and the usual method of standard normal distribution, i.e., convert y to z using $z = \frac{y-\mu_{y}}{\sigma_{y}}=\frac{ln(x)-\mu_{y}}{\sigma_{y}}$

$P(X \le x)=P(e^{Y} \le x)=P(Y \le ln(x))=P(Z \le \frac{ln(x)-\mu_{y}}{\sigma_{y}})$

“So the lognormal distribution can be used to model skewed and non-negative data.” He sits up in his chair excited, types a few lines of code with enthusiasm and write down the following in his notepad for the meeting later.

$P(Displaced > 300,000) = 0.0256$

Can you tell how?

He then plots the flood-affected area using similar plots.

“Hmm, the original data is skewed with a large affected area to the right. But, the log-transformed data is not symmetric. It looks like there are numbers to the left that are far away from the log-transformed data.” He keeps staring at his screen as Aaron knocks on the door. “Meet now?”

“Sure.” Mumble walks out with his notepad and few other things. The flood-affected area plot is still pestering him, but he needs to take care of other issues for the day.

“Maybe logarithm is one type of transformation. Maybe there are other transformations?” He keeps thinking during his meetings.

As his 3 pm meeting comes to an end, he still has to do quality checks on the claims data. As is usual, his checklist stays incomplete. The filing taxes just got moved to the next day. He was however resolute to catch up on Transformers before Bumblebee hits the theaters this year. So transformers is on for the evening. It is timely as he transforms the data.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Enjoy this blog? Please spread the word :)