The Post Hoc and the A Priori
If you ever find yourself forced to summarize the ideas of western philosophy in a single metaphor, then the concept of a “self-modifying filter” may be your safest bet. Observation and measurement, according to this idea, are acts of filtering. A filter has a built-in bias – it lets some elements pass through but not others – and the categories embodied in this bias are the source of features and constancies in a universe that ultimately bathes in undifferentiated and structureless flux.
In the philosophical literature, the filter-metaphor hides under the distinction between “a priori”/”deduction” and “a posteriori”/”induction”. The former involve facts that are definitional in nature, like mathematical proofs or how all bachelors are unmarried, while the latter refers to facts derived from experience. For an appreciation of just how central they are, consider the following intellectual heavyweights:
- René Descartes, of “cogito ergo sum” fame, is known as a “rationalist” in how he emphasized the importance of pre-formed knowledge in the acquisition of new one.
- George Berkeley, in the early 1700s, advocated a view in which external reality did not exist unless perceived, and this stimulated different versions of “idealism”, according to which we actively create our world through mind-dependent categories.
- David Hume, writing a little bit later, pondered in his Treatise of Human Nature how sense impressions are separate events in the mind, with causal relationships not directly perceived but projected upon them. His “empiricist” brand is known for rejecting a priori principles altogether.
- Immanuel Kant’s prescient “Critique of Pure Reason” from 1781 is about the mind’s a priori knowledge of space and time organizes our sensations.
- Noam Chomsky entered the academic consciousness in the 1960s with his idea that babies are born with knowledge about how to transform linguistic expressions, and that their mother tongue is merely a parameter to fine-adjust this innate grammar.
The 20th century brought us more sophisticated tools to study the learning process in detail, and upon doing this, the deduction-induction dichotomy tends to work best if we consider it as a snapshot in time of what is actually a continuous feedback-loop. From a psychological viewpoint, a priori knowledge could be said to refer to the offline mental manipulations we sometimes perform on our cognitive entities to lay bare facts latent in the way our concepts are stored. For example, if we, via environmental interactions, form the category of “Men” with the core property of being mortal, and then proceed to categorize Socrates as a man, then the mortality associated with men will also pertain to Socrates. If Socrates turns out to be immortal, it becomes a matter of either adjusting the category properties, or brushing it over.
The most natural way to think of a filter is as a passive separator, but there is nothing to prevent us from reversing figure and ground and instead conceptualize the filter as an active inquirer. If so, the bias can be thought of as a hypothesis that the filter asks the impinging dynamic to feed back a response to, indicating what category it belongs to. And importantly, a filter is limitless in what categorizations it could perform. Just like how the mesh-size of a fishing net determines the size of the fish caught, and the polarity of a cell membrane determines what particles may pass, we may divide mankind into genders or ethnicities, or maybe “people who like broccoli” and “people who don’t”, and in an infinite number of other ways.
For an inquirer, the kind of hypotheses he can ask is equally limitless. There is an indefinite number of random variables whose outcome frequencies we may keep track of, of data we could collect, of relationships to potentially explore, but we cannot and should not, for that would undermine its very purpose. It would be equivalent to a filter that lets everything pass through and as a result accomplishes nothing – the filtrate would not possess any more structure than the dynamic in its raw form.
Data therefore are selection effects, restricted by the finite number of hypotheses we select for our inquiries. This is a very, very important point, central to most of science, because believing our hypotheses to be exhaustive of the space of all conceivable hypotheses has historically led to fallacious conclusions, on matters than span everything from the atomic to the cosmic and theological. So, before we address the more mundane matters, like statistical procedures used at our own scale of existence, we might as well start off on a grandiose note.
The ill-defined possibility space
Firstly, selection effects have the potential to resolve much of the disquietude we feel regarding quantum nondeterminism. If a quantum experiment is repeated, such as firing a photon on a screen with two slits it could pass through – then the relative frequencies of different outcomes (left slit versus right slit) can be predicted, but each individual instance appears to be irreducibly governed by chance, with any “hidden variables” cleverly ruled out. The idea is that the particle passes through both slits in a state of “superposition”, described by a mathematical entity called the “wavefunction”, which states the probability of it being found anywhere in the universe. The wavefunction unfolds deterministically according the Schrödinger equation. Then, upon observation (e.g. by turning on a particle detector behind the slits), the Copenhagen interpretation by Bohr and Heisenberg states that the wavefunction “collapses” into a unique state.
However, in 1957 Hugh Everett pointed out that nothing in the mathematics implies a collapse. Schrödinger’s equation could continue to evolve the wavefunction. The universe does not metaphysically “split” into non-interacting branches – the superposition remains as a single wavefunction – but as soon as it transfers information to something else, like an air molecule, it in effect becomes unobservable. By becoming correlated with the environment, it thus “decoheres”, and because the particle superpositions in neurons decohere faster than they fire, we cannot experience parallelism at macro-scales.
To adapt an analogy by physicist Max Tegmark, we could imagine ourselves being unknowingly cloned into ten copies while asleep, with each clone being placed to wake up in a room with a different number on the wall (ranging from 0 to 9). Upon waking up, then subjectively the number 6 on the wall would seem random, but if we had access to the parallel worlds, then finding that each numeral is represented would make it feel deterministic. Similarly in quantum experiments, we only see one out of all the logically conceivable outcomes contained in the wavefunction. While the “Many worlds” interpretation may not yet be empirically testable, it (bizarrely) has the virtue of parsimony, and it has become increasingly respectable.
Telescoping up to cosmic scales, awareness of selection effects forms one of the key methodological principles of cosmology. The cosmological origin story is that the early Universe was maximally simple, symmetric, and – at least in some places – low in entropy, but that expansion and consequent temperature fall caused these symmetries to break. According to the “inflationary hypothesis”, a special form of matter accelerated the expansion, causing some regions to inflate more than others. Therefore, in a different kind of multiverse theory, there could logically be an infinite number of other universes beyond the visible horizon, and ours just happened to be one that inflated enough to allow sentient, carbon-based life to evolve.
Point is: regardless of how intrinsically improbable it is, we would necessarily find ourselves in a Universe that can support us. In what is known as the “Weak Anthropic Principle”, the observed universe should not be considered as coming from some unconstrained space of possible universes, but from the life-supporting subset thereof. If this principle is neglected, erroneous conclusions will be drawn. For example, Paul Dirac wanted to revise the law of gravitation in light of a coincidence between a constant of Nature and the age of the Universe, but without that coincidence, there would have been no Paul Dirac there that could be preoccupied with such fine tunings!
Selection effects also figure heavily in discussions regarding the eerie nature of mathematics, and as physicist Eugene Wigner said, its “unreasonable effectiveness” in physical predictions. Newtonian physics gives the impression that we live in a universe of perfect spheres and parabolas. There are many examples of mathematical curiosities that have been collecting dust for decades and suddenly find themselves elegantly applied to some newly discovered phenomenon, and high-energy physics portrays the fundamental laws to be astoundingly tidy and integer-laden.
However, the Universe is more than deep symmetries – it would not be fully specified without its initial conditions. Something in their interaction appears to have generated a cosmos of perplexingly rich structure, dynamical systems and nested hierarchies, and unlike the deep symmetries themselves, their outcome is far from mathematically elegant. With the advent of computers and big data, there is a growing appreciation for just how disorderly the universe is. In biology and sociology, attempts at mathematical formalisms are taken as cartoonish simplifications.
Philosopher Reuben Hersh argues that vision of a clockwork-universe is an illusion arising from how we disproportionately focus on phenomena that are amenable to mathematical modelling. The concept of cardinal numbers (“put in 5 balls, then 3 into a container. The prediction is that it will contain 8 balls”) would break down for water drops or gases. The prediction would hold only if humans went on to invent the concept of mass and volume. The tools were selected based on their predictive abilities. As Mark Twain said: “To a man with a hammer, everything looks like a nail.”
The problem is that there is no way for us to quantify all the “phenomena” to calculate what proportion of these are “orderly”. Nor can we conceive of a possibility space of different, logically consistent laws of physics that were not simple integers, to see just probable our universe is. Instead, we are restricted to the illusion selected for us, by quantum decoherence, post-Big Bang inflation, and our own perceptual affinity for mathematical elegance – forever to wonder what’s on the other side of the filter.
We are already familiar with Bayes, and the theorem that, simply put, expresses the probability that a hypothesis is true as fitness with evidence weighted by its prior probability.
Importantly, when we lack a clear idea of possible explanations (of the other side of the filter) and try to apply Bayes’ theorem to our reasoning, we are faced with the fact that there is no such thing as an unbiased prior. There is a more sophisticated mathematical reason for this, but we can see it more intuitively in theological arguments for the existence of God.
I was once told by a creationist: “If you were to shake a handful of sand and throw it all up in the air, and if it is all mindlessly random – as you say – then obviously it would be improbable to the point of impossible for it to fall by pure happenstance into something as well-organized as an organism?” The reasoning behind this watchmaker-argument goes that “Given that there is no god, the structure we see would be improbable. Therefore, because the world is so stunningly assembled, we must conclude that there is a conscious God”. Here, the observation that we do exist is our data. Because we have no priors, we will have to invoke equiprobable priors, such that “God exists” and “God does not exist” having 50% each. As for likelihoods, a creationist would say that P(humans|no God) has a vanishingly low probability, like a one-trillionth and P(humans|God) is significantly higher, maybe a millionth.
The argument may be problematic because when we assigned 50% each to God/no God, we partitioned the vast space of possible hypotheses in an inevitably biased way. We could fragment “no god” into an indefinite number of alternative hypotheses, such as “Hindu gods”, or that we are a quantum computer simulation, which would be compatible with an orderly universe too. To this type of fundamental questions, in which we have no good conception of the possibility space, probability theory cannot be meaningfully applied.
In philosophy of science, the fact that the hypothesis space is never exhaustively specified is known as “under-determination of theory by data”. This refers to the idea that we can never know whether another theory would account for the evidence equally well. Hence, science cannot be said to constitute truth. This is consistent with Bayesian reasoning, where only hypotheses deemed worthy a priori are investigated. Yes, a demon tampering with your mind could explain your current sensory experience, but this prior probability is so low it becomes negligible.
Interestingly, a similar argument has been advanced by philosopher Hillary Putnam in defense of science, called the “No Miracles” argument. Science’s successes are improbably due to luck, and therefore science must deal in truth. The argument is a conditional probability, where P(science’s success| science is unrelated to truth) is considered low. Bayes’ theorem tells us that this is meaningless unless we consider the base rate, which is the relative frequency of false theories, but again, this incidence cannot be quantified. However, if an evolutionary view of science is taken, the “No Miracles” argument amounts to “Survival of the fittest”. Science is successful, precisely because hypotheses that do not perform well are eliminated.
Confirmation bias in terms of Bayes
If we let the unconscious cognitive processes constitute a filter, we find not only the filtration of how the brain collects data, but also biases in what hypotheses it focuses evidence acquisition on. That is: even when we are aware of a whole set of hypotheses, we have a tendency of gathering information about only one of them. In the literature, this is known as “confirmation bias”. It has been explained in various different ways, like how we feel good about being right and value consistency, but is perhaps most elegantly accounted for with reference to Bayes.
According to German psychologist Gerd Gigerenzer, scientific theories tend to be metaphors of the tools used to discover and justify them (the “tools-to-theories heuristic”). It changes the Fragestellung. For example, the cognitive revolution was fueled by the advent of computers and statistical techniques, which soon became theories of mind. Thus, the brain is often thought of as a homunculus statistician that collects data and plugs them iteratively into Bayes’ theorem. And as already stated, the brain cannot collect all kinds of data – a completely Bayesian mind is impossible, for the number of computations required for optimization (i.e. finding the best hypothesis) would increase exponentially with the number of variables. The brain must therefore allocate computational resources dynamically, based on how promising a hypothesis seems. By limiting the sample size (the working memory and attentional window), we become more likely to detect correlations, because contingencies are more likely to be exaggerated. Presumably, these benefits outweigh the dangers of confirmation bias.
In Bayesian terms, confirmation bias would mean that we fail to take the likelihood ratio P(D|H1)/P(D|H2) into account, particularly when H1 and H2 are each other’s opposites. The fact that our evidence acquisition is partial means that, for the hypotheses we favor a priori, we over-weigh the prior (our preconceptions), while for the hypothesis we initially disbelieve, we over-weigh the likelihood, being either too conservative or too keen in our belief revisions. Because this tendency persists over time, the hypothesis is positively reinforced, leading to what psychologist Wolfgang Köhler called “fixedness” and “mental set”, the inability to think “outside the box”, which has been implicated in many clinical conditions, from depression to paranoia. Iterated Bayesian inference is like a self-modifying filter in which our belief in a hypothesis is continually revised in light of incoming data. The posterior probabilities correspond to the widths of filter meshes – the more coarse-meshed the filter, the more receptive we are to the hypothesis, and the bigger its impact on future interpretations.
Our biased evidence-collection is most evident in the type of experiment pioneered by Peter Wason in 1960. Subjects are given a triplet of numbers, such as “2,4,8”, hypothesize the generating rule, and ask the experimenter about the correctness of other triplets in order to infer it. What subjects tend to do is to state the hypothesis “2 + 2x” and then only look for confirmatory evidence by asking about, for example, “4,8,10”. As a consequence, they will never infer a generating rule for which these triplets are a subset, for example “Any even numbers” or “Numbers in increasing order”. By analogy, if you want to know if a person is still alive, you immediately go for his pulse, rather than, say, if the eyelids are down, because the pulse is a better differentiator. We want evidence that surprises us, which in information-theoretical terms has a high self-information content. It would have been more rational, more efficient, if instead the subject asked about triplets that would distinguish between the working hypotheses and more general candidates, like “4,5,7”, in order to, as Richard Feynman put it, “prove yourself wrong as quickly as possible”. Instead, we are like the drunkard looking for our key under the streetlight, because that is where we can see.
This insight, that science will progress faster if it focuses on differentiating evidence, is extremely profound, and the most important concept in the philosophy of science. It is known to many as “Falsificationism” and associated with Sir Karl Popper, who argued that good scientific practice is defined by presenting theories alongside criteria by which they can be rejected, so that all effort can be focused on trying to refute them by testing the potentially falsifying predictions derived from them. A strong theory is a theory that points towards its own Achilles heel, and because (according to Popper) for Freud and Marx there are no conceivable “ugly facts” that could show them to be incorrect, they are closed to criticism and worthless as predictors.
Science essentially serves to gauge the predictive power of a theory. According to “instrumentalism”, this is all there is to it – theories are nothing but means of making predictions. A relatively good theory therefore is one that:
- Is precise in its predictions, so as to forbid relatively more outcomes, thus providing more scenarios by which it can be rejected. For example, if you have two dots on a Cartesian plane, the hypothesis “The relationship is linear” is easier to falsify (just test a new dot not aligned with the two others!) than “The relationship is quadratic”, because the third dot is consistent with more quadratic data patterns. A vaguer theory is less strengthened (“corroborated”) by confirmatory data than a more specific one, because its Bayesian likelihood P(data|hypothesis) is lower.
- Makes predictions across a wide range of domains. For example, psychological theories are much less precise than physical ones, and the hypotheses derived from are typically a modest statement like “What condition a participant is under will make a difference”, without specifying the size or direction of this difference. However, psychological theories can often give rise to a wider variety of predictions, like how the theory about confirmation bias may partly predict both depression and paranoia.
The two criteria may appear to be contradictory, but they are really just about grain and extent, the same old filter-dimensions we have spoken of before. A good theory is fine-grained and wide-eyed.
Moreover, in order to truly gauge its value how well it predicts, as opposed to how accurately it fits with evidence, it matters whether the theory is stated given the evidence (“post hoc”) or in ignorance of the evidence (“a priori” – typically, but not necessarily, before the evidence is gathered). A parallel distinction is often drawn between “data fitting”, such as finding a mathematical function that approximates a given dataset, and “ex ante prediction”, where the function is stated in advance. The reason for why this is important is because with post hoc explanations it is impossible to know how much of the data is influenced by random factors – noise – and if the explanation takes these into account, it will fare worse as a predictor, since in future circumstances, the noise will be different. If a prediction fails, the theory must be modified, according to Popper by suggesting new tests, so as to increase its falsifiability.
In evolution, if the ecological structures changes for a species, an adaptation may no longer be beneficial, no matter how useful it has been in the past. There is a similar concept called the “problem of induction”, associated with David Hume, that states that any empirically based generalizations (such as “all swans are white”) are vulnerable to refutation simply by counter-example (finding a black swan). With such a disquieting asymmetry, a theory would be hard to build but easy to dismantle. Popper too argued that how well a theory has generalized in the past does not matter for its correctness, and if it ever fails, it is no longer serviceable.
However, falsifying evidence should not be accepted without caution. Philosopher Pierre Duham has pointed out that a theory cannot be completely rejected, since the apparatus or the background assumptions could have caused the result. When in 2011 physicists at CERN detected neutrinos travelling at a speed faster than light, they did not immediately declare Einstein’s special relativity theory false, but instead –based on the theory’s previous track record – tenaciously sought after sources of experimental error (which they eventually found). The interpretation of any hypothesis presupposes facts that themselves are conjectural in nature, causing a web-like regress of uncertainty, but, Bayesian reasoning goes, pre-existing knowledge does and should matter in how we update our beliefs.
We may thus think of science has a hierarchy of priors. If we are highly confident in a theory a priori, we won’t let an anomalous result refute it before the anomalies have accumulated to a degree so high they can no longer be ignored. Given the positive feedback of confirmation bias this tends to engender, and the volumes of discrepant results required to outweigh it, the occasional theoretical discontinuities seen in science history – dubbed “paradigm shifts” by Thomas Kuhn – seem intrinsic to the dynamic of an intelligent system. But in the short term, it raises vexing questions about the rational course of action: for how long should we persevere in a once-promising-but-now-challenged belief? When does the in-hindsight-so-romanticized persevering maverick become a fanatic madman?
Confirmatory and exploratory
We saw with the brain-as-a-Bayesian-statisticians that the number of hypotheses we can entertain is limited, and we saw with Falsificationism that the best hypotheses are those that differentiate between candidate theories. According to science writer Howard Bloom, for intelligent behavior to emerge, a system does not only need “conformity enforcers” (to coordinate the system), “inner judges” (to test hypotheses), “resource shifters” (to reward successful hypotheses) and “intergroup tournaments” (to ensure that adaptations benefit the entire system), but also “diversity generators” – we must make sure that new hypotheses continually are generated.
In the brain as in science (scientists have brains), this can be thought of as a random, combinatorial play. Activity spreads stochastically with varying degrees of constraint through neural networks to find ideas to associate, and scientists are similarly, via their social environment, exposed to random ideas they encode in their own neural networks. Institutionally, to safe-guard against theoretical blindness and confirmation bias, it is therefore encouraged within science to maintain a free market of ideas and always conceive of alternative explanations in an article’s end discussion.
Different research methods vary along a continuum in how constrained the observational filter is. In “qualitative” research, such as interviews, the prior probabilities are weak, and hypotheses emerge over time as promising leads are picked up and shadowy hunches in the minds of the researchers are gradually reinforced. This exploratory, data-driven, “bottom-up” kind of research is necessary in the absence of robust theories. But when we do have a high-prior hypotheses available, we may test these using quantitative methods, such as experiments, which by comparison are deductive, confirmatory and “top-down”. The process can be described as a a “parallel terraced scan”.
If we need to scrap the hypothesis following expensive, focused experimentation, we may have to revert to square one, to the cheap and unfocused information processing of qualitative research. Just like how a brain’s attention can be concentrated or vigilant, science needs both. The important thing, following exploration, is not to double-dip in the same data, since the hypothesis would be selected by virtue of its fitness with that dataset, so to gauge its predictive power, the dataset would have to be fresh.
Different statistical frameworks
Intuitive as Bayesian conceptualizations seem, a fascinating twist to this story is that the Bayesian nature of science is implicit. Scientists do not literally quantify their priors and plug the relevant numbers into Bayes’ equation – you are extremely unlikely to find this in a research article. The influence of priors is manifest only in the introductory and concluding discussion sections of research articles, in which researchers rationalize why they have performed this particular query (because it had a high prior) and interprets the query result through the lens of the pre-existing belief system.
Instead, science in general and experimental psychology in particular, is dominated by a different framework, called “null-hypothesis significance testing” or “the Neyman-Pearson approach”. To undergraduate students it is often presented as the only approach – as a single, objective structure – when actually it is heavily criticized within statistics, and only one approach among many. But Bayes and Neyman-Pearson are not necessarily in conflict – they are internally coherent, but conceptualize probability differently to serve different purposes. Nor should we hasten to conclude that both types of computations cannot occur alongside each other in the brain. Neyman-Pearson too has been used for theories of the mind, like Harold Kelley’s causal attribution theory and signal detection theory. And just like in science, the brain is capable of one-trial learning without sampling, so we should be wary of presenting either as some grand general-purpose mechanism.
There is meanwhile a third approach that is in minority but gaining popularity, called “likelihood analysis”. We will now consider the nuts and bolts of these procedures, but to do so, we will first need a good understanding of probability distributions.