Thursday, August 1, 2013

Tutorial: How Fat-Tailed Probability Distributions Defy Common Sense and How to Handle Them

This post is related to the Grey Swans post, but is a good topic to present on it's own.

For random time series, we often ask general questions to learn something about the probability distribution we are dealing with:
  1. What's average?  What's typical?
  2. How much does it vary?  How wide is the "spread"?  Is it "skewed" to one side?
  3. How extreme can the outcomes be?
  4. How good are our estimates, given the sample size?  Do we have enough samples?
If we have a good sized sample of data, common sense tells us that "average" is somewhere in the middle of the sample values and that the "spread" and "extreme" of the sample are about the same as those of the underlying distribution.  Finally, common sense tells us that after we have good estimates, we don't need to gather any more sample data because it won't change our estimates much.

It turns out the that these common-sense answers could all be flat wrong, depending on how "fat" the tail of the distribution is.  Now that's surprising!

The Tail(s) of Three Distributions

To demonstrate this, I'm going to walk you through a simulated example with three time series of 200 samples each (click on any image to see larger version):

Here are the three generating distributions:
  1. Truncated Normal distribution with mean = 0, standard deviation = 54, truncated below 10
  2. Truncated Log-normal distribution with mean = 3.49, standard deviation = 0.9, truncated below 10
  3. Truncated Pareto distribution, Type 1, with a = 1, b = 1, truncated below 10
("Truncated" just means that the minimum value of the time series is 10.  The probability distribution curve is "cut off" at 10 and then the whole distribution is raised in probability so that the total probability sums to 1. This is a common approach when one wants to focus attention on the tail of the distributions.)

I chose these three distributions on purpose.  The Normal distribution (Gaussian) is the most famous "thin-tailed distribution" and is assumed in many common (frequentist) statistical methods.  The Log-normal distribution qualifies as a "fat-tailed distribution" that appears in many empirical settings, plus it has a "family relationship" with the Normal distribution.  Finally, the Pareto distribution is probably the most famous of the "really-fat-tailed distribution".  It holds the most surprises compared to our common sense.

I chose the parameter values of these three distributions so that the'd be fairly comparable with small sample sizes.  More on that, below.

Visualizing Differences in Probability Distributions, Especially the Tails

To show the differences between these distributions, look at the following figures.  I generated them based on closed form formulas and parameters.  The top chart uses a linear scale and bottom chart uses a log-log scale.

The value of the log-log plot becomes apparent when you look at the tails of the distributions.  In the upper chart with the linear scale, they all look alike.  Common sense might tell us that above x = 200, we can pretty much ignore the rest of the distribution in all three cases.

But looking at the log-log chart shows that, if we look out to x > 500, there is indeed a very big difference.  The Normal distribution approaches zero very fast.  The Log-normal distribution approaches zero more slowly.  And the Pareto distribution never curves downward, so it never approaches very close zero.

But what does this mean to our four questions, above?  Let's investigate!

Comparing Distributions of Sample Data

Here are histograms of the sample data, shown above as time series. What do you notice first?

 First, they all have similar sample means (about 50) and they are all skew right.  The Normal and the Pareto have very similar standard deviations. Also the Log-normal and Pareto both have some extreme values, while the Normal distribution has none.  Common sense might tell us to toss them out as "outliers", especially the single datum of 1413 in the Pareto time series (bottom).

If 200 samples was all we had for these three and we had to make decisions based on the summary statistics, it might be tempting to call them all "approximately equal", or at least last two.  To find out if this is prudent, let's see what happens to our estimate if we get more sample data -- a LOT more data.

What Happens If We Get More Data?

The following charts show how our estimates of mean, standard deviation, and maximum value change as the sample size ranges from 10 to 100,000,000 = 108 = one hundred million data points!  The estimated statistic is on the vertical axis and the sample size is on the horizontal axis.  Each of these are plotted on a log-log scale to make it easier to see what happens when sample sizes grow really large.  What do you notice on each of these?

First, it's obvious that common sense is right about the Normal distribution and mostly right about the fat-tailed Log-normal distribution.  For both Normal and Log-normal, estimates of mean and standard deviation don't change much after getting a decent sample size.  "Decent sample size" appears to be somewhere between 500 and 1000 data points.  However, for the maximum datum in the sample, the Normal and Log-normal look somewhat different. For the Log-normal more samples mean that the largest value will gradually increase.

Thus, the thin-tailed Normal distribution and the somewhat fat-tailed Log-normal seem "well behaved" to our common sense.

Second, notice the region below 100 samples.   The orange line of the Pareto Distribution is below the purple Log-normal line.  Thus, if you had a small sample, you might be led to believe that the third stochastic process had a lower mean, lower standard deviation, and lower "worst case".  Such an inference would be completely wrong!

Finally, notice that the Pareto distribution is a very different beast!  In all three charts, the estimated parameter keeps growing as our sample size gets bigger, and (except for "bumpiness") doesn't really look like it will level out.    If our estimates grow without bound as we get more sample data, does that imply that there is no stable estimate for them?  YES!  I didn't say this earlier, but both the mean and standard deviation of the Pareto Distribution are undefined for the chosen parameter values.  (They are "undefined" for the reason that dividing by zero is undefined in ordinary arithmetic.)

This leads to the following anti-common sense answers to our four questions, listed at the start of this post:
  1. There is no "average" or "typical" value for a Pareto Distribution*.
  2. We can't say how much it varies if we only consider the "spread" measured by standard deviation.  There is no standard deviation.
  3. We can't say what the maximum value will be if the sample size continues to grow.
  4. We can never get enough data to arrive at a stable estimate for these summary statistics.
*Note: These statements apply to a subset of Pareto Distributions, namely those with exponents below 2.   Above that critical level, we might say the tail is "fat but not too fat", while below that level tails are "really fat"!

Proper Handling

So... if we face a time series generated by a Pareto Distribution, are we screwed?

No.  Emphatically: NO!

It is not that we can't know anything about these types of stochastic processes.  They are not completely unknown unknowns.  Instead, we just have to avoid the "unruly" aspects through smarter handling.  Come to think of it, handling very fat tailed distributions is a lot like handling poisonous snakes.  With the right tools, experience, and precautions, you can handle them safely and effectively.

I'm reminded of this narration line from the movie The Gods Must Be Crazy:
"In this world of theirs, nothing is bad or evil. Even a poisonous snake is not bad. You just have to keep away from the sharp end. Actually, a snake is very good - in fact, it's delicious. And the skin makes a fine pouch."
The main difference is that with poisonous snakes, it's the the head you have to worry about.  In our case, it's the tails.  <groans and  uneasy laughter>

Here's a list of things that analysts and decision makers can do to successfully cope with the unruliness of very fat tailed probability distributions:
  1. To the method of frequentist statistical analysis of historical data, add other methods and other data.  Simulations, laboratory experiments, and subjective probability estimates by calibrated experts are just three alternative methods that can fill in for the limitations of frequentist methods with limited sample data.
  2. Resist using colloquial terms like "average", "typical", "spread", or even "worst case".  Using them will only add to confusion, misunderstanding, and mis-set expectations.
  3. Communicate and decide using quantiles, not the usually summary statistics mean, standard deviation, etc.  If any summary statistics are used as decision criteria or in models, use quantiles.
  4. Balance cautiousness with expediency.  With any limited sample of data, be careful about tossing out "outliers".  But also avoid the opposite error of being too cautious.  Not every stochastic process has a heavy tailed distribution, and even those that do aren't necessarily "very fat" like the Pareto Distribution.
  5. Avoid any statistical methods that assume an underlying normal distribution or thin tails.   There are other methods that make few or no assumptions about the underlying distribution.  They aren't as powerful, aren't as well known, and have other assumptions, but they are still useful.
  6. Put in some effort to estimate the "fatness" of the tail, either parametrically or non-parametrically.  Even a not-very-good fat tail model is much better than one based on thin tails.  There are ways to test how good the alternative models are.   In my opinion, the best academic paper on this is "Power-law distributions in empirical data".

That's all for this tutorial.  Hope it helps!

1 comment:

  1. It is easier to visualize fat tails by using normal quantile-quantile plots than by using histograms or density plots. As you note, the tails are near zero in all cases (even for fat tails), so it is hard to compare them visually. On the other hand, it is very easy to see the differences when you use normal quantile-quantile plots because the tail behavior is greatly amplified.