Populations and variables
There would not be much point to science unless it produced knowledge that was otherwise not available to us. The goal of science is to generalize into the past, present and future so as to reliably infer things that we cannot observe directly. Yes, some areas of science are of a localized, fine-grained variety, but the goal is ultimately to use this in order to detect patterns that latch onto system regularities, allowing us to re-engineer Nature in adaptive, self-serving ways.
As part of this pragmatic means-end analysis, and in a truly astounding feat by human cognition, we mentally parse our events into meaningful categories, called populations, upon which we project variables – possibility spaces of potential states. “Mankind” is an example of a population, with a variable such as “gender”, but so is “upper-class Norwegians”, “sword fish”, “all earthquakes” and “all theoretically possible coin tosses”. Moreover, member states can be distinguished categorically – based on how you interact with them in qualitatively different ways – or numerically, if they differ on a quantitative dimension. Statisticians have tried to erect complex taxonomies for the different types of variables, but as products of our brains, there is a degree of arbitrariness in how we frame a variable. For example, the variable “color” can both be considered categorical (including blue, green, red…) or numerical, based on electromagnetic frequency. Indeed, any continuous measurement can be chunked into separate groups, like “tall” and “short”, to simplify analysis, at the cost of less fine-grained results.
In a population, a variable’s possible states may vary in their relative frequencies. Based on an observed subset of a population, known as sample, we hope to find a way to predict what state a system will assume, or, at the very least, how certain we should be about a particular outcome, to support our decision-making. We are, in other words, looking for probability distributions and how different factors change them.
The shape of a probability distribution reflects – albeit very indirectly and cryptically – logical properties about the generating mechanism underneath. An economist may infer from a national income distribution whether the economic system is socialist or capitalist. A gambler may infer that a roulette wheel is biased. Some shapes are rather ubiquitous and mathematically elegant, indicating general organizing principles in Nature. Two populations could generate a probability distribution of the same shape, though one could be wider and the other taller, and they may apply to different types of scales. Quantities that define the particulars of a distribution, apart from its general shape, are known as parameters. Based on corresponding quantities of the sample, known as statistics, the hope is to infer the population parameter, which, given the shape, holds the key to the probability distribution we are looking for.
Given a set of samples of a variable, we are curious about whether or not we should re-carve our reality and regard them as separate categories from different populations. As usual, whether this is meaningful or not depends on if it affords us any predictive power. Because to be member of different categories means having different parameters, parameters can be thought of as manipulable knobs, whose settings remain fixed as the system changes dynamically. An experiment effectively asks whether an observed change is attributable to different knob-settings.
Indeed, the metaphor of it as a man-made contraption has a strong appeal: we may imagine ourselves as an archeologist who unearths a complex, mechanical device that has no obvious purpose. You search for ways of adjusting it (you distinguish and alter its “parameters”) and observe its effects (the population is its behavior at each instant in time). You may turn a knob and find that the machine emits a sound, suggesting that the alteration implied a change in parameter values. Or, it could be due to some other change: maybe, just as you turned the knob, a confluence of events independent of your action caused the sound. Ideally therefore, we would like to rewind time and see if it would happen in the absence of our turning the knob, but because the laws of spacetime won’t allow such counterfactual exploration, we assume that the times that you do not turn the knob represent this scenario.
This idea – of creating “fake counter-factuals” – is central to experimental designs. The manipulation made by the researcher is known as “independent variable”, while the results are measured along a “dependent variable”. To fake parallel universes, the same subjects would have to undergo all the different treatments at different points in time, making it a “within-subject variable” or different subject subjects would only undergo one treatment, which could be done simultaneously, making it a “between-subject variable” where there are no dependencies between data. What design is preferred depends on the variables – in general, the former leads to less noise (since the subject is constant across the conditions) but the latter has fewer confounds (since there are no practice effects and the like).
While it is practically impossible to account for all conceivable influences, our failure to do so won’t pose a problem if their aggregated influence is balanced across all conditions. This is ensured by randomization, in which an experimenter assigns subjects to conditions using some form of random number generator, so that each subject subjectively has equal probability to end up in either condition. No category of subjects will be systematically biased to receive one treatment over the other. Uncontrolled, “extraneous” differences in gender, mood, height, etcetera, would thus be cancelled out by chance. Hence, an uncontrolled extraneous variable is not a valid criticism of an experiment. Balancing-by-randomness, however, is not guaranteed, so group allocations are typically examined afterwards to make sure that they are not grossly imbalanced on the most plausible confounds. Sometimes random assignment is impossible. If, for example, gender is manipulated as a between-subject variable, height and other variables will co-vary with this, since men generally are taller, making it “quasi-experimental”. If height is a priori judged to be potential confound, the groups will therefore have to be matched instead.
For a better understanding of aggregated randomness, let us now focus on natural populations with the most famous distribution of all – the normal distribution.
The concept of additive effects
Why are normal distributions so normal? The answer is partly because of selective reporting. In physical systems, an indefinite number of random (as in “unaccounted for”) factors combine to affect an outcome. These combinations are sometimes, but not always, additive. For example, height can be seen as the sum of genetic and dietary contributions. The combined effect is the same as the sum of separate effects, as if you were exposed to both separately. Such systems are called “linear” and they dominate education and our intuitions because they are so simple and learnable. In nature, phenomena typically behave linearly only over restricted ranges: eating may increase height, but only until obesity starts affecting your posture. To the extent that such a quantification makes sense at all, only a small subset of all effects are linear.
Others are non-linear, in which an increase in input will not have a proportionate effect, because internal system boundaries truncate linearities. “Synergistic” and “antagonistic” effects are greater and less than the sum, respectively, as if the diet could meddle with your DNA to enhance or reduce its own influence on your height. Non-linear systems are thus characteristically unpredictable, and often not a tractable object for scientific study. Therefore, if normal distributions seem pervasive out in the wild, it is because the underpinning system’s single, strong phase space attractor makes them salient to us. It gives them a characteristic scale in which extremes outcomes are extremely rare (i.e. the difference between the tallest and shortest scores is quite small, which is not true for e.g. power-law distributions), and, crucially, a symmetrical unimodal shape with a meaningful average to which the system regresses over time.
The central limit theorem
The mathematical idealization of the normal distribution can be understood by considering a Quincunx, a mechanical device developed by Sir Francis Galton in the 1800s. Balls are dropped on a triangular array of pins on a vertical board, so that the ball bounces either left or right with 50/50 probability as it hits a pin. After the last level, the ball falls into a bin, where they stack up. The stacks that result will roughly form a binomial distribution. This is because all the 2nr of levels possible paths in the system, though of equal length, differ in the number of lefts and rights, and there are more possible paths than there are possible L-R combinations. If we represent L and R by 0 and 1, we may think of combinations as sums. For example, only one path, corresponding to only rights, leads to the rightmost path, but many different paths have half L, half R. The binomial distribution gives us the expected frequencies all these possible combinations (sums). Each pin in a path (i.e. level, the number of which is n) is a trial with a particular probability (p). The distribution is defined by these two parameters – n and p – with a notation of B(n,p). As n increases, the proportions of balls in each bin for B(n,0.5) define what we call the normal distribution. The mathematics of the binomial distribution is explained below.
The quincunx provides physical evidence for a general statistical fact about systems in which many factors contribute to a quantity additively, known as the “Central Limit Theorem”. It states that, for any random variable, as the number of outcome data collected becomes large, the distribution of their sums will approach a normal distribution. To understand this, consider how, for a discrete random variable such as a die, some sums will be more common than others, as a consequence of there being more ways in which they can occur. Thus, for two fair dice, 7 is a more likely sum than 12. This is, maybe counter-intuitively, regardless of the initial probability density distribution. The die may be so biased that, in 95% of all cases, 6 will come up, and 1% each for the rest, making (6,6,6) an abundantly likely outcome. Nevertheless, as the number n of samples that you sum increases, the other outcomes will begin to assert themselves. For example, if you sum 100 samples, 1,2,3,4 and 5 will all appear once on average, resulting in a sum slightly less than 6*100. Sometimes, your n=100 sum will be more or less than this, but this will be the most common sum, and form the middle value in the normal distribution that results as you collect more and more n=100 sums, for it is the most frequent sum in the space of all possible n=100 sums, just as 7 is most common for n=2 sum in fair dice.
The fact that a phase space of possible outcomes translates, over time, as additive effects accumulate, to a clear, central value is more famous as the “Law of large numbers”. It states that a larger sample is less likely to be affected by random variation, since fluctuations will cancel each other out. Like superimposing many diffuse images, so that the randomness will be averaged away and the signal pierce through. So there are two distinct reasons for why statistics as a discipline is so strongly associated with the normal distribution. Confusing the two can cause a lot of headache:
- It is common, but by no means universal, in Nature “out in the wild”, because the variables that we find salient and are curious about are naturally those with a characteristic scale and typical value, which generally is the result of additive effects by many, random events.
- Statisticians may, for a distribution that is not normal, collect samples and sum them, and this distribution will be normal.
The parameters of the normal distribution
What, then, are the parameters of the normal distribution? An idealized, normal distribution is fully determined given two quantities: its midpoint and its width. As already mentioned, normal distributions are special, because their symmetry and short tails means that it has a “central tendency” that can be used as a model to predict its value and summarize a dataset. When the distribution is skewed, it becomes useful to distinguish between different measures of central tendency (mode, median and mean), but for a bell-shaped symmetrical one, these are the same, and coincide with the mid-point. However, because empirically derived distributions are never perfect, the arithmetic mean, which takes the most information into account, is the one used, and therefore the mid-point population parameter is the mean, even though it equals the others.
The mean, effectively, is like the fulcrum of a balanced pair of scales poised at the center of all deviations from itself. It is defined as the point where deviations from it sum to zero. Its width, the average deviation, is called “standard deviation”. The wider the distribution, the poorer the mean will be as a predictor, and the more noise in the data. The intuitive explanation for its formula leaves it slightly under-determined – it is motivated by the mathematical equation of the normal distribution – but is shown below.
Given a population that is known to be normal, statisticians can use the mean and the standard deviation to calculate a density function that associates a particular value-range with a probability (a single point-value has an infinitesimally small probability, so we can only consider intervals). Because regardless of the parameters, the same location of a range relative the mean (expressed in terms of standard deviations – a length unit, remember) has the same probability, it makes sense to re-express a certain value in terms of “distance from the mean” for a standard distribution. It is the equivalent of two tradesmen translating the value of their goods to a shared currency, and is called z-scores.
Now recall that the Central Limit Theorem implies that large and unbiased samples will resemble the population from which it comes. The probability that any sample will deviate greatly from the population is low. Now imagine a weirdly distributed population, of known mean and SD, real or just a theoretical phantom, and, for a certain sample size, consider the space of all possible samples (i.e. subsets) that could be drawn from it. Take the mean of all those samples, erect a histogram, and call this the “sample distribution of the mean”. You may regard it as a population in its own right. Because of the central limit theorem (some averages are more common than others), this distribution will be more bell-shaped than the original distribution. The bigger sample size, the smoother it will be, and the better any theoretically based estimate will be.
For reasons that are intuitively clear, the population of the means of all possible samples drawn from a population distribution will have the same mean as itself. This means that there is a high likelihood that the samples you draw will have statistics that are similar to the parameters. Therefore, if you have your sample and know the population parameters, you may estimate how likely it is that the sample is indeed drawn from this population. If your sample mean has a score far to the right of the mean, the conditional probability P(score far away| score from population) is low.
Now suppose instead that the population distribution is unknown. Usually, your sample is the only information you have. Again, our goal is to construct a sample distribution of the mean to gauge where our sample falls on it. To do this, we need to know its width, that is, how much samples are expected to vary, in other words, its standard deviation of the sample means. We know that:
- For a large sample size, there will be less variability (since noise will be cancelled out, causing means to cluster tightly).
- For a large population SD, there will be more variability (since how much it is expected to vary depends on how much it actually varies). As said, the population SD is not available, so we have to base it on sample SD instead.
This calls for a mathematical formula with the sample SD as numerator (so that big SD causes a bigger value) and sample size as denominator (so that big N means a smaller value), and it goes by the name “Standard error of the mean” (which has the added quirk that the square root is taken from the denominator, because of the mathematics of the normal distribution).
To find how well parameter values predict a certain sample we need to calculate the equivalent of the z-score. However, this time, since our evidential support is so weak (we, for example, use sample mean to estimate population mean), modelling the sample distribution of the mean as a normal distribution would give us overconfidence in our estimates. We need a higher burden of proof. For this purpose, statisticians have come up with a distribution that has fatter tails (implying a larger z* critical value), which, cleverly, approaches the normal distribution for large sample sizes.
Thus, we have a way for estimating the conditional P(data|hypothesis) in a way that depends on sample size. It is from this point onwards that the different statistical procedures diverge in their prescriptions.