Why are normal distributions so normal? The answer is partly because of selective reporting. In physical systems, an indefinite number of random (as in “unaccounted for”) factors combine to affect an outcome. These combinations are sometimes, but not always, additive. For example, height can be seen as the sum of genetic and dietary contributions. The combined effect is the same as the sum of separate effects, as if you were exposed to both separately. Such systems are called “linear” and they dominate education and our intuitions because they are so simple and learnable. In nature, phenomena typically behave linearly only over restricted ranges: eating may increase height, but only until obesity starts affecting your posture. To the extent that such a quantification makes sense at all, only a small subset of all effects are linear.
Others are non-linear, in which an increase in input will not have a proportionate effect, because internal system boundaries truncate linearities. “Synergistic” and “antagonistic” effects are greater and less than the sum, respectively, as if the diet could meddle with your DNA to enhance or reduce its own influence on your height. Non-linear systems are thus characteristically unpredictable, and often not a tractable object for scientific study.
Therefore, if normal distributions seem pervasive out in the wild, it is because the underpinning system’s single, strong phase space attractor makes them salient to us. It gives them a characteristic scale in which extremes outcomes are extremely rare (i.e. the difference between the tallest and shortest scores is quite small, which is not true for e.g. power-law distributions), and, crucially, a symmetrical unimodal shape with a meaningful average to which the system regresses over time.
The mathematical idealization of the normal distribution can be understood by considering a Quincunx, a mechanical device developed by Sir Francis Galton in the 1800s. Balls are dropped on a triangular array of pins on a vertical board, so that the ball bounces either left or right with 50/50 probability as it hits a pin. After the last level, the ball falls into a bin, where they stack up. The stacks that result will roughly form a binomial distribution. This is because all the 2^(nr of levels) possible paths in the system, though of equal length, differ in the number of lefts and rights, and there are more possible paths than there are possible L-R combinations.
If we represent L and R by 0 and 1, we may think of combinations as sums. For example, only one path, corresponding to only rights, leads to the rightmost path, but many different paths have half L, half R. The binomial distribution gives us the expected frequencies all these possible combinations (sums). Each pin in a path (i.e. level, the number of which is n) is a trial with a particular probability (p). The distribution is defined by these two parameters – n and p – with a notation of B(n,p). As n increases, the proportions of balls in each bin for B(n,0.5) define what we call the normal distribution. The mathematics of the binomial distribution is explained below.
The quincunx provides physical evidence for a general statistical fact about systems in which many factors contribute to a quantity additively, known as the “Central Limit Theorem”. It states that, for any random variable, as the number of outcome data collected becomes large, the distribution of their sums will approach a normal distribution. To understand this, consider how, for a discrete random variable such as a die, some sums will be more common than others, as a consequence of there being more ways in which they can occur. Thus, for two fair dice, 7 is a more likely sum than 12.
This is, maybe counter-intuitively, regardless of the initial probability density distribution. The die may be so biased that, in 95% of all cases, 6 will come up, and 1% each for the rest, making (6,6,6) an abundantly likely outcome. Nevertheless, as the number n of samples that you sum increases, the other outcomes will begin to assert themselves. For example, if you sum 100 samples, 1,2,3,4 and 5 will all appear once on average, resulting in a sum slightly less than 6*100. Sometimes, your n=100 sum will be more or less than this, but this will be the most common sum, and form the middle value in the normal distribution that results as you collect more and more n=100 sums, for it is the most frequent sum in the space of all possible n=100 sums, just as 7 is most common for n=2 sum in fair dice.
The fact that a phase space of possible outcomes translates, over time, as additive effects accumulate, to a clear, central value is more famous as the “Law of large numbers”. It states that a larger sample is less likely to be affected by random variation, since fluctuations will cancel each other out. Like superimposing many diffuse images, so that the randomness will be averaged away and the signal pierce through. So there are two distinct reasons for why statistics as a discipline is so strongly associated with the normal distribution. Confusing the two can cause a lot of headache:
- It is common, but by no means universal, in Nature “out in the wild”, because the variables that we find salient and are curious about are naturally those with a characteristic scale and typical value, which generally is the result of additive effects by many, random events.
- Statisticians may, for a distribution that is not normal, collect samples and sum them, and this distribution will be normal.