The Bayesian Approach
A Bayesian analysis of the data from a simple experimental design is straightforward. Recall that the theorem implies that the posterior is the likelihood weighted by the prior. Because a hypothesis corresponds to a hypothesized probability distribution, you therefore have to, for each probability in your expected distribution, multiply it by the corresponding probability in your obtained distribution. For example, when trying to estimate a population, a “hypothesis” could be “height is normally distributed and the mean height is 170 cm and SD=5”. This would entail that, for example, 5 random samples all being above 200 cm has a very small probability, such that the likelihood would cause the posterior distribution to shift the mean to the right. Let us take it step by step:
- Suppose you have a between-group design with a treatment group, to which administered a certain dose of a drug, and a control group, that received no potent substance.
- You assume a priori, based on previous literature, that it will be roughly normally distributed, so that values vary in their probability in a symmetrical manner. You also believe, a priori, that the central value will be 3 and that a value of +-1 would have a 68% chance of occurring (i.e. the standard deviation is 1). This defines your prior distribution. (Note that prior distributions are in practice difficult to construct. Psychological theories, for example, rarely imply a particular P(effect|theory) distribution. )
- You obtain your sample distribution, with its own mean and standard deviation (the standard error). This is your likelihood distribution. Each value on its horizontal axis is its own hypothesis of the mean, so the evidence clearly favors the hypothesis population mean=sample mean the most.
- You multiply prior and likelihood for every parameter value (and adjust to make sure that the posterior distribution’s area is equal to 1). This is your posterior distribution. Note that there are simple formulae available to skip the hassle of calculating posteriors for each parameter value.
- The posterior distribution can be summarized by a “credibility interval” – the range that has a 95% probability of including the true effect, which, if we assume it is centered on the mean, is (M1 – 1.96xS1) to (M1 + 1.96xS1).
- If we are interested in different hypotheses – that is, different possible mean values – we can compare them using a ratio (the “Bayes factor”): P(H1|D) / P(H0|D) = P(D|H1) / P(D|H0) x P(H1)/P(H0). If more than 1, it supports the numerator hypothesis more than the denominator hypothesis, nudging your belief in the former’s direction. It is often informative to report how the Bayes factor depends on different priors, so that other researchers can choose Bayes factor value based on their own priors.
Before proceeding, we should highlight a few conceptual points:
- Lacking any prior means that you have no idea of the mean. In terms of prior distribution, this means that its standard deviation should be infinite. We call this a “flat prior”. There is a mathematical caveat to this, which is that when we apply a non-linear transformation of that variable (for example, its inverse) it will not remain flat. For example, if our variable is the male:female ratio of a country’s population, and we assume it as flat out of complete ignorance, then the female:male ratio will not be uniformly distributed. Hence, complete ignorance should mean an inability to assign any kind of prior (and in science, information corresponding to the prior is typically unavailable). There is an approach called “objective Bayesian” that tries to solve this.
- If your prior is wide (i.e. diffuse), the posterior will be dominated by the likelihood, and if it were flat, the posterior would be the same as the likelihood.
- The credibility interval of the posterior will normally be narrower than that of the prior, and will continue to do so as you collect more data. With more data, you therefore gain in precision, and different priors will converge on the same posterior.
- A Bayes factor around 1 means that the experiment was not very effective in differentiating between the hypotheses, in picking up a difference (it is insensitive). It may be as a result of vague priors. Intuitively, this penalty makes sense, for in a vague theory, data very different from it can support it, and because the Bayes factor is low, you will not be inclined to accept it.
- Your posterior only concerns hypotheses you have conceived of. As we learned in the introduction, this does not exhaust the space of possible theories.
But the most important take-home point is that, in the Bayesian approach, beliefs are adjusted continuously. You may continue to collect data for as long as you want to, until the credibility interval has the precision you require. The only information you need to update your belief is the likelihood (a controversial idea known as the “likelihood principle”). As a consequence, you can quantify the subjective probability of the hypotheses you have considered, but have no black-and-white criteria to support your decision making. However, you could stipulate, somewhat arbitrarily, that “to confirm a theory, the Bayes factor must be 4” and then collect data until you have 4 (or ¼, in support of the other hypothesis).
The Likelihood Analysis Approach
The Bayes factor assisted you in optimal guesswork: how much you should, subjectively, favor one hypothesis over the other, in light of the available evidence. We also saw that assigning priors is theoretically and, above all, in practice very difficult. An important point here is that, irrespective of priors, we could quantify the obtained data’s impact on how we update probabilities based only on the likelihood ratio. In other words, we can sidestep the prior-problem by simply considering how much the evidence favors one hypothesis over the other, and this measure, the relative evidential weight, is conceptually distinct from that of subjective beliefs.
For example, let us say that we have two different hypotheses concerning the probability distribution of some population. One of them is that the proportion of males is 50%, the other hypothesis states that it is 100%. These are your hypotheses – there are no priors involved now. You take a 5 random samples, all of which are male. You proceed to calculate the likelihood ratio P( 5 male | 100% men)/P(5 male | 50% male) = 1/(0.55)=32. This means that the evidence favors the numerator hypothesis 32 times more than it favors the denominator hypothesis, because that hypothesis predicted the data 32 more strongly. According to Richard Royall, a strong proponent of the likelihood approach, 32 happens to be a reasonably threshold for “strong relative evidence”, with 8 being “fairly strong”.
We may plot a hypothesis’ likelihood to see how it changes as a function of sample size (for example, if the above example had a hypothesis of 0.8, this function would be the Bernoulli function 0.8n0.2m-n, where the sample has size m and n of them are the outcome believed to have probability 80%), and given sample size, we can plan how many samples need to have a certain outcome in order to get a certain likelihood ratio. For example, given a sample of 20 people, to get a likelihood ratio greater than 8, 15 people need to get the 80% outcome to favor that hypothesis ((0.8150.25)/(0.5150.55) > 8). For two hypotheses that differ by more than 0.3, fewer subjects would be required for this strength of evidence.
We could also hold sample size fixed, and let the parameter-values vary (like the function θ5(1- θ)5). Given this latter likelihood function, we could calculate the equivalent of a credibility interval. Going by Royall’s recommendation that a ratio of 32 constitutes strong evidence, then all the θ-hypotheses with a likelihood value of more than 1/32 of the highest likelihood would have roughly the same strength of evidence. The “likelihood interval” therefore indicates the range of strongly favored hypotheses.
Again, there are a couple of conceptual points to highlight:
- The ratio only requires two bits of information: the two likelihoods. It is unaffected by other changes in the population distribution (which is not the case of the Neyman-Pearson approach we will soon describe).
- Again, you can continue to collect data for as long as you want, since likelihoods only depends on the product of each event’s probability.
- The evidence, of course, can always be misleading, even though we interpret it in an appropriate way. However, the probability of obtaining a ratio of k in favor of the false hypothesis is bounded by 1/k, since this is the inverse.
- If a likelihood interval does not include a hypothesis, this is not simply evidence against it, since it is relative to some other hypothesis, so we cannot reject it on those grounds.
The Neyman-Pearson Approach
The basic logic
We have seen the Bayesian schema inform us about how to update our beliefs, and the likelihood analysis schema disregard the priors and instead tell us about relative evidential strength. While it is possible to set up decision criteria thresholds, such as “Bayes factor above 4” or “likelihood ratio above 32”, an argument could be made that these approaches do not support behavioral, binary decisions very well – such as confirming or rejecting a hypothesis – because, conceptually, wise decision-making should not necessarily be based on beliefs or relative evidence. Rather, the top priority could be to control how often, on average, we will make the wrong decision. This is the logical basis of the methodologically more complex yet by far most widespread statistical procedure, known as the Neyman-Pearson approach, or “Null hypothesis significance testing”. It was institutionalized in the 1950s and often described as a backbone of statistics, when, in reality, it is just one instrument of many in the toolbox that is statistics.
Central to this logic is the notion of the testing procedure itself being part of an infinite population, called a “reference class”. For example, the toss of a coin could be considered as a sample from a collective of all possible outcomes. If, in this reference class, “head” occupies 40%, then this will be reflected in a 40% long-run frequency of heads. This long-run frequency is what defines probability, such that a single event cannot meaningfully be said to have one – only the reference class has a probability attached to it. This idea is often described as meaning that probability is “objective”, but how the reference class is conceived is dependent on our subjective information. If we knew the aerodynamics surrounding the coin, the reference class would be narrowed down to the indefinite number of realizable tosses with those particular wind conditions. More importantly, this conceptualization implies that, because a hypothesis has no reference class (it is not part of a well-defined space of possible hypotheses), it does not have an objective probability. It is either true or false. Our test’s reference class meanwhile, has one, so our decision as to whether a hypothesis is true or false will have a long-run error attached to it. In essence, therefore, we calculate “If we were to conduct the exact same test and procedure again, with the same amount of subjective ignorance and information, an infinite number of times, how often would the true/false test give the wrong answer?”.
The true/false test in question is performed not on the hypothesis we are curious about, called the “alternative hypothesis”, but on the “null hypothesis”. This is occasionally defined as the one that is most costly to reject falsely – our default explanation that we don’t abandon unless we have to, because it is simple and approximates well. In practice, it is often, but far from always, the hypothesis of no difference (a nil hypothesis such as “all the samples from different conditions come from the same population”). As such, the test is somewhat equivalent to a mathematical type of proof known as reduction ad absurdum, in which a statement is assumed to be true, and if it is legally manipulated so as to result in self-contradiction, this means that the original statement must have been false. Except, instead of self-contradiction, the rejection-criterion is now based on extreme improbability. In other words, if the outcome is very improbable by happenstance, i.e. to have occurred as a result of chancy fluctuations, then this suggests that the experimental manipulation (or some other, unknown nonrandom influence) must be held accountable for it. We may say that an observed difference is significant, meaning that we reject the null, without directly quantifying the probability of alternative hypothesis, since – given that only reference classes have probabilities – this would be meaningless.
We thus actively seek disconfirming evidence, and place the burden on the alternative explanation, which costs more in terms of parsimony compared to the default (“it’s all due to chance” is more parsimonious), something that we intuitively are very poor at. Rejecting the default hypothesis on the basis of low probability means that we again are concerned with a conditional probability. However, whereas in Bayesian and likelihood approaches, we calculated the likelihoods P(observed difference| hypothesis is true), holding the observed difference constant and letting the hypothesis vary, we are now concerned with the conditional probability P(getting as extreme or more extreme data | null is true), in which the hypothesis is fixed as null so that we don’t consider every single parameter value, and the “possible data” is allowed to vary.
Moreover, we are no longer interested in the heights, but in areas, since the probability of the obtained data for a continuous density function is infinitesimally small, so we are interested in a low-probability range. Somewhat arbitrarily, this “rejection region” has by convention been set to 5%. This refers either to the two extreme 2.5% in both directions, in which case the test is two-tailed, or the most extreme 5% in one direction, in which case the test is one-tailed, and any effect in the other direction won’t be picked up. This makes the calculated probability dependent on unobserved things (choice of test), so that identical data can result in different conclusions. In effect, these are two different ways of calculating the conditional probability P(getting as extreme or more extreme data | null is true).
We – given the population distribution – calculate the probability of obtaining our sample or a more extreme one, and if probability, called the p-value, is less than 0.05, then if we reject the null now when actually the null is true, we will do this mistake only 5% in any future replications of this test. The 5% level, called alpha, therefore is an a decision criterion set by ourselves in advance that gives us the number of false alarms we would get if we were to perform this test again and again if the null were true. It is an objective probability. The long-term behavior, alpha, is everything we know. Note that it is not correct to define alpha as “the probability of false alarm error”, since this could be interpreted to mean the probability of false alarm after rejecting the null. If you obtain p < 0.011, it is a mistake to say that “My false alarm rate is less than 1.1%” or that “98.9% of all replications would get a significant result” or that “There is a higher probability that a replication will be significant”, since a single experiment cannot have a false alarm rate. Alpha is a property of the procedure, not of one particular experiment.
Confusing the p-value with the probability of a fluke is called the “base rate fallacy”. It ignores the fact that the probability of a significant result being a fluke depends on the prior distribution of real effects. If you perform 100 tests at alpha=0.05, then you would expect 5 false positives. However, in total you obtained 15 significant results, so the fraction of them that are truly false (a number known as the “false discovery rate”) would be 5/15=33%. The lower the base rate (the fewer cases in which the null is true), the more opportunities for false positives. Therefore, in data-heavy research areas like genomics or early drug trials, because the base rate is so low, the vast majority of significant results are flukes.
There is a similar risk of failing to detect real effects, called beta and expressed as P(accepting the null | null is false). Like alpha, beta is an objective probability decided upon in advance of the experiment. Given the beta, the expected effect size, and the expected amount of noise, you can calculate the sample size required to keep the beta at this predetermined level. Because large sample sizes normally are expensive, the relationship between alpha and beta is usually a tradeoff: we can reduce the probability of false alarms by requiring a p-value below 1%, but only by increasing the probability of missing a true effect. Though in practice they often are, alpha and beta are not meant to be picked as a mindless ritual, but carefully chosen based on an experiment-specific cost-benefit analysis. False alarms are typically presumed to be more costly, but this is not always the case: in quality control, failure to detect malfunctioning is often a higher priority.
The complement of beta (1 – beta) is the probability to pick up an effect of the expected size. It is called “power” and can be thought of as the procedure’s sensitivity. Usually, a power of 0.8 or higher is considered sufficient, but in practice it is seldom achieved or even calculated. We mentioned previously that the hypotheses tested in the most common statistical tests tend to be low in content, because they only predict “there will be no difference” without specifying the effect size. The effect size is therefore not only needed for power-calculations, but also makes the theory more falsifiable. In neuroscience, the median study has a power of only 0.2, with the hope that meta-studies aggregating the results will compensate for it. A non-significant result in an underpowered study is meaningless, because the study never stood a chance of finding what it was looking for. Underpowered studies also run the risk of “truth inflation” – true effect sizes vary randomly, and if the power is low, only effects that by chance are very large will be statistically significant. By reporting the effect size, you are over-estimating its future magnitude, when it regresses towards the mean.
Details of the reference class
The essence of null-hypothesis testing thus is to control long-term error, and to do this, the test procedure needs to be specified in finest detail, so that the reference class – the infinite number of replications that never happened – is well-defined. This includes the number of tests for different hypotheses we perform as part of the whole testing procedure – the “family” of test. Each test does not stand on its own. The reference class is then comprised of an infinite number of replications of this whole family, so if we still wish to have false alarms in only 5% of our replications, then we need to provide for the fact that several tests increase the chance of a misleading significant result being found somewhere in the procedure. This fact, known as the “family-wise error rate”, can be understood by how, if for each test in the family, the probability of not making an error is (1-alpha) and the probability of making at least 1 error in k tests is the complement of (1-alpha)k, i.e. 1-(1-alpha)k, which increases with k. Thus, the overall long-term false alarm rate is no longer controlled. Intuitively, by exposing ourselves to more chance events for each trial, there is more opportunity for chance to yield a significant result, and to curb this, we enforce much more conservative criteria. The most common and least sophisticated way to control this error rate is to let each test’s alpha be 0.05/k, something known as Bonferroni correction.
The presence of multiple comparisons can be subtle. Consider, for example, how in neuroscientific studies, you test for increased activity in every 3D-pixel of the brain – the required multiple comparison correction is massive. There is, furthermore, an intuitively disturbing aspect to the concept of a test family. Bonferroni correction reduces statistical power dramatically. If the researcher had planned a priori to only perform one comparison (and pre-registered her statistical protocol), then the reference class can be re-defined as “Collect the data, perform the planned comparison at the set alpha and then perform the other tests for curiosity’s sake, Bonferroni-corrected”. Also, if the same comparisons had been part of different experiments, there would be no need for correctives. Again, the reason for this eerie feature is that the reference class needs to be well-defined.
Another reference class specification is that of when to stop testing, “stopping rules”. Suppose we recruit subjects gradually and perform tests as we go along – a procedure known as “sequential analysis”. On the one hand, it is obvious that testing until you have a significant effect is obviously not a good stopping rule, since you are guaranteed to get a significant effect if you wait long enough, but in medical research this is often ethically required, and certain correctives are used to account for multiple testing. Still, we cannot tell whether we stopped due to luck or due to an actual effect, so the effect is likely to be inflated. Notably, whereas in the Bayesian and likelihood approaches, we could continue to collect data for as long as we wanted, now, if we get a non-significant result for our 40 subjects and decide to test another 10 subjects, the p-value would refer to “Run 40 subjects, test, if not significant, run another 10”. This p-value is doomed to be above 0.05, since at n=40, there was a 5% chance of false alarm, and at n=50 there was another opportunity for false positives. Meanwhile, if a colleague decides to test 50 patients in advance, there is no need for adjustments, because his test is member of another reference class. Different stopping rules that coincidentally agree to stop at the same size can thus lead to different conclusions.
The p-value is the probability for obtaining such data, the probability of the t-statistic. Therefore, if very small, this does not indicate a larger effect size. A small difference can be significant while a large difference is insignificant. Nor is the p-value a measure of how important it is. Another frequent mistake is to interpret a p-value as a Bayesian posterior (as a probability of the hypothesis being true), or as an indicator of evidential strength. Both the virtue and weakness with the Neyman-Pearson approach lies in just how mercilessly binary it is. If p=0.051, accept null. If p=0.049, reject null. If p=0.00.., reject null too. There is no such thing as “different degrees of significance”, and you are not justified to switch alpha to 0.001 if you get a p-value below that value, as this would undermine its purpose. For the Neyman-Pearson framework, the p-value contains no information other than whether it is past the critical threshold. Because it depends on things other than strength, such as stopping rules and whether the test is one-tailed or two-tailed, it cannot be used as an index for evidential strength. If you decide to test more after a non-significant result, the p-value would increase, indicating less evidential strength, even though intuitively evidential strength also would also increase.
Another important aspect is that two tests, say A vs. C and B vs. C, cannot be interpreted as “A was significantly better than B, while B was not significantly better than C, thus A is better than B” without directly comparing A and B. This is because of the arbitrariness of alpha-level (one could be slightly below, the other slightly above) and the statistical power is limited.
There is an alternative, more informative and straightforward reporting strategy known as “confidence intervals”. By calculating the set of mean values that are non-significantly different from your sample mean at your chosen alpha, mean +- 1.96 S.E., you will obtain an interval that, 95% of times that you replicate the procedure, will include the true population mean. In essence, it gives you the range of answers consistent with your data. If it includes both the null-predicted mean and the alternative-predicted mean, the result is non-significant, but it also tells you directly that the sensitivity is insufficient. Moreover, the width of the interval indicates your precision. If it is narrow and includes zero (null), the effect is likely to be small, while if it is wide the procedure may be too imprecise to make any inference. Widths, like data, vary randomly, but at the planning stage of the experiment, it is possible to calculate required sample size so that the interval will be of a desired width 95% of the time (a number known as “assurance”). Confidence intervals are particularly useful for comparing differently sized groups, since the certainty of a large sample size will be reflected in that interval’s precision. For these reasons, confidence intervals are preferred, but even in a journal like Nature, only 10% of the articles report confidence intervals, perhaps because the finding may seem less exciting in light of its width, or because of pressures to conform.
Comparisons between approaches
In his book “Rationality for Mortals”, psychologist Gerd Gigerenzer once compared the three approaches to the three Freudian selves in unconscious conflict. The Bayesian approach corresponds to the instinctual Id, who longs for an epistemic interpretation and wishes to consider evidence in terms of the hypothesis-probabilities. The likelihood (or “Fisher”) approach corresponds to the pragmatic Ego, which, in order to get papers published, ritualistically applies an internally incoherent hybrid-approach that ignores beta and power-calculations and determines samples by some rule of thumb. Finally, the Neyman-Pearson approach corresponds to the purist Superego, which conscientiously sets alpha and beta in advance, reports p-values as p<0.05 rather than p=0.0078 and is aware of that it does not reflect degree of confidence, but rather a decision-supporting quality control.
The most important conceptual difference between Bayes/likelihood and Neyman-Pearson is that the former espouse “the likelihood principle” – the likelihood contains all the information required to update a belief. It shouldn’t matter how the researcher mentally groups tests together, or what sample size he plans. Stopping rule, multiple testings and timing of explanation don’t matter, and credibility or likelihood intervals do not need to modulated by them. The likelihood interval will cluster around the true value as data is gathered, regardless of the stopping rule. However, standards of good experimental practice should still apply. Suppose, for example, your Bayesian brain hypothesized “It is dark in the room” and sample information only when the eyes are closed. Due to a failure to randomize sampling and differentiate between the two hypotheses “Eyelids down” and “Dark outside”, it is likely to lead to the wrong beliefs.
Importantly, the approaches sometimes lead to very different results. As shown below, the Neyman-Pearson approach could accept the null hypothesis in cases where the evidence clearly supports the alternative hypothesis.
Selection effects in scientific practice
A number of criticisms have been raised against the Neyman-Pearson approach. One is that an internally incoherent hybrid-version is often taught, in which p-values are reported as p=0.0001, even if the alpha a priori was set at 0.05, which is the only thing that should matter. Another is that the hypotheses it tests are typically low in content. A third is that it invites cheating, since the Bonferroni penalties of collecting more data after a non-significant result means a lot of expensive data may be wasted. Also, because it is poorly implemented without prior power-calculations, and the “truth inflation” phenomenon means that reported effect sizes that are exaggerations of real effects may be unfairly dismissed, because the replication will have a power based on the initial effect size, not the degraded one, and consequently be underpowered.
A fifth is that the binary nature of the Neyman-Pearson approach also invites “publication bias”, in which both scientists and journals prefer unexpected and significant results, as a kind of confirmation bias writ large. If an alpha=0.05 experiment is replicated 20 times, 1 of them will be significant due to chance. If 20 independent research teams performed the same experiment, one would get a significant result, simply by chance, and that team certainly would not feel very lucky. If journals choose only to publish significant results, neglecting the 19 insignificant replications, it fosters something like a collective illusion. Moreover, it is not unusual to tweak, shoehorn and wiggle results – be selective about what data to collect in the first place – to make it past that 5% threshold, “torturing data until it confesses”.
How can we unravel this illusions quicker? How can we attain maximum falsifiability, maximum evolvability, to accelerate progress?
- Meta-research: There are statistical tools for detecting data-wiggling, and a lot of initiatives aiming to weed out dubious results.
- Improving peer-review: Studies in which deliberately error-ridden articles have been sent out for peer-review indicate that it may not be a very reliable way to detect errors. Data is rarely re-analyzed from scratch.
- Triple-blindedness: Let those who perform the statistical tests be unaware of the hypothesis.
- Restructuring incentives: Replications are extremely rare because they are so thankless in an industry biased towards pioneering work. Science isn’t nearly as self-corrective as it wishes it were. We must reward scientists for high-quality and successfully replicated research, and not the quantity of published studies.
- Transparency: To avoid hidden multiple comparisons, encourage code- and data-sharing policies.
- Patience: Always remain critical of significant results until they have been robustly and extensively validated. Always try not to get too swept away by the hype.
- Flexibility: Consider other statistical approaches than Neyman-Pearson.
But ultimately, we have to accept that there is no mindless, algorithmic slot-machine path to approximate truth. Science writer Jonas Lehrer put it as follows: “When the experiments are done, we still have to choose what to believe.”