OTOOP Part II.II: From Hierarchies to Networks


Towards the close of childhood, there typically comes a moment of realization that our time is finite while the knowledge out there isn’t. It becomes harder and harder to nourish our pet hobbies, and sustain any youthful obsessions with dinosaurs, pirates or Egyptology. Days pass faster, and we realize that, to absorb so much as a droplet of those surging swells of information, we would either have to drill deep into something narrow, or dabble in the shallow expanses for a superficial grasp.

Scientific knowledge grows exponentially…

The Earth’s population is estimated to have generated around 8 zettabytes in 2015 (8 trillion gigabytes), a figure that doubles at an annual rate. About 100 million books have been printed so far, with 1 million new books per year. The Library of Congress adds more than 10,000 objects daily. Meanwhile, the growth of scientific productivity is proportionately explosive. Derek John de Solla Price, who founded scientometrics in 1963, pointed out an empirical law that still applies: the scientific literature grows exponentially. There are about 100.000 journals currently in the US and every 15 years, this number is doubled. Every 20 years, so does the number of universities.

More scientists are alive today than have ever been before and they face a life in fierce competition for a decreasing number of postgraduate positions. With all the low-hanging fruit discoveries gone, reaped in the long-gone days of the lone genius, squeezing facts out of reality is today largely a team effort, with scientists networking at international congresses, assembling teams on the fly, and churning out publications in order not to perish.

… while our cognitive faculties remain constant.

Meanwhile, the capacities of human cognition remain constant, and cannot keep pace with the spiraling orders of magnitude at which the scholarly output grows. Computational implants may one day become reality, but until then, we are confined to an attentional aperture that on average can keep at most 7 chunks of information in register at a time. The result is an information overload that makes it extremely difficult to keep abreast of research frontiers and coordinate science, with the consequence that progress could hit a sort of carrying capacity plateau: the rate of research may increase, but the rate of progress may not.

What we see is a tremendous waste of resources: repeated wheel-reinvention and the same old ideas re-emerging again and again under new neologisms. Given how replete science history is with examples of multiple independent discovery (evolutionary theory, oxygen, preferential attachment and integral calculus all had several discoverers) the nagging worry of being scooped in some potentially career-enhancing discovery undoubtedly looms large.

Moreover, with the specialization of scholarship, this means that findings on different sides of an academic department wall may never come into contact with each other. One article could contain “A implies B” and another “B implies C”, but the leap that “A implies C” may forever remain latent, because the two ideas will never flicker across the same human cortex.

Knowledge visualization is part of the solution.

In order to increase the chance of the right idea hitting the right mind at the right time, information has to be managed. This has many aspects to it, such as standardization of methods and nomenclature for easier communication, and serendipity-supporting apps like StumbleUpon that randomly present content in order to prevent intellectual myopia. But the most important aspect of information management is to organize information in a way that minimizes the effort that the reader must exert in order to understand it, and reduce his computational burden, or “cognitive load”. This is where knowledge visualization comes in.

An important tool for this is hierarchical categorical systems.

For a long time, knowledge visualization was dominated by the archetype of knowledge as a tree – which, by dint of its recursive branching schema, represents a hierarchical ordering, called a “taxonomy”. Why does this metaphor feel so natural to us? Of course, we have already discussed hierarchies as a general organizing principle, like Herbert A. Simon’s concept of “near-decomposability”, Arthur Koestler’s “arborization”, as well as the notion of “algorithmic probability”. In the brain, category hierarchies are believed to originate in how the distributed representations of neural networks may overlap. However, non-overlapping networks may still be connected to each other, and to say that human categorization is strictly hierarchical would be a gross over-simplification. Categorization is known to be an extremely complex affair, in which perceivers flexibly re-carve their reality, constructing new ontologies on the fly to serve present goals and contexts. Nevertheless, some knowledge hierarchies are remarkably universal. Anthropologists have, for example, found that preliterate cultures across the world independently of each other have developed a biological classification system of seven levels, just like Linnaeus.


Tree charts, in essence, show unity split into multiplicity. We find them in genealogical kinship portrayals, and in feudally flavored depictions of all things as having a “natural order”, with inanimate minerals at the bottoms and humans (or God) at the top (an idea called scala naturae, or the “Great Chain of Being”, which was first codified by Aristotle and illustrated by Porphyry). Charles Darwin, via his famous notebook sketch from 1837 as well as illustrations in his first Origin of Species edition, made the tree the go-to diagram for evolutionary relationships, which not only shows splitting, but also explains the mechanism that induces speciation. This image was then enduringly popularized by the artwork of German naturalist Ernst Haeckel. Furthermore, “cladistics” – the idea of classifying organisms based on shared characteristics traceable to a most recent common ancestor – derives from the Greek word for “branch”, and our language is full of tree metaphors.

The medieval monks were obsessed with knowledge combinatorics.

Tree visualization was particularly common in medieval manuscripts, not only because of its allegorical role in the Genesis, but because Christian monks were keen practitioners of the ars memorativa (the art of memory), where trees served as a mnemonic device. By imposing on information what modern eyes would consider as contrived, top-down metaphysical orderings, the knowledge became easier to retrieve. Most prominently, Majorcan polymath Raimundus Lullus published an influential encyclopedia in 1296 called Arbor scientiae, where he mapped out knowledge domains as sixteen trees of science – an image that lives on to this day in phrases like “scientific branches”.

Lullus, interestingly, is also considered a father of computation theory, by virtue of the many hypothetical devices he invented that mechanically aid both the retrieval of old knowledge, and the generation of new knowledge. Among them is the “Lullian circle”, in which discs inscribed with symbols around the circumference, representing elemental truths, could be rotated to generate new combinations. Inspired by Lullus, Gottfried Leibniz imagined a diagrammatic language (characteristica universalis) that represented atomic concepts by pictograms that can form compounds which, by a logical algebraic system (called Calculus ratiocinator) could mechanically be determined as either true or false. It would constitute what he called an “alphabet of human thought”. Clearly, it was understood already in mediaeval times that strict hierarchies may isolate knowledge from combining in fruitful ways.


Although medieval knowledge representations may strike a modern reader as archaic, naïve and a tad megalomaniac, the problem they address – that of information management – is a serious one, and the ontological challenges faced by librarians – to index and anticipate the space of all possible subjects – is of equally epic proportions. Alphabetical indexes have been around since the renaissance, but there is still the issue of how to classify a work by subject. The first widespread standardized cataloging system was devised by Melvil Dewey in 1876 and is called the Dewey Decimal Classification. Like his medieval forerunners, Dewey’s system is hierarchical, composed of ten classes that each is made up of ten subdivisions, with fractional decimals for further detail. However, it also “faceted” in how it allows categories to be combined, and is therefore not restricted to a fixed taxonomy.

Some have taken the idea of spatial metaphor literally.

Interestingly, Dewey also imagined libraries to base their architectural lay-out on his ontology. This notion has been entertained by many others. Giulio Camillo, a sixteenth century philosopher, wrote a book that described a “Theater of memory” – a seven-tiered amphitheater that one could enter and whose interior was full of boxes that could be opened by a gear apparatus to reveal words and visual metaphors inside. The boxes would arranged in a hierarchical manner of increasing abstraction to show how the facts are conceptually connected. While the theatre was never completed, informatics pioneer Paul Otlet came closer to achieve it with his Mundaneum, established in Brussels 1910. Described by scholars as an “analog World Wide Web”, it was a museum intended as a world encyclopedia, filled with drawers of index cards for every piece of intellectual property in the world. Otlet sought to organize its architecture around a central core of grand organizing principles, marking the unification of all knowledge, with colonnades radiating from it, leading to narrower subdomains.


Dewey, Camillo, and Otlet’s ideas may be regarded as physical realizations of the old Greek mind palace technique (“method of loci”), in which a person, in his mental eye, associates chunks of information with familiar physical locations in a certain order. By exploiting the fact that our spatial memory system is much more powerful than our semantic memory, the summoning of these faux episodic memories makes it easier to retrieve information from our own brains. It is interesting how reasoning about knowledge in general seems near-impossible without invoking spatial metaphor. Concepts appear to inhabit a topography, where they can be close or far away from each other, within-domain or between-domain, narrow or wide in area, high or low in abstraction. It invites notions of exploration and navigation, of salient landmarks and terra incognita, and hints at a possibility of uniting isolated islands and drifted continents into a primordial Pangaea.

Today, informatics is dominated by networks.

The method of loci, though orderly and hierarchical in nature, ultimately works by association – by increasing the number of retrieval cues for the idea you wish to memorize by embedding it in a context that you already have a very rich mental representation of in your neural networks. Our brains may obsess over information compression and squirt dopamine at the prospect of some grand synthesis of ideas, but wherever we look, hierarchies have given way to networks in information storage. For example, the first databases, developed by IBM in the 1960s, had a hierarchical, tree-like structure, such that relationships could only be one-to-many (a child node can only have one parent) so that, to retrieve data, the whole tree would have to be traversed from the root. Today, ontologies in informatics are primarily represented using Entity-Relationship diagrams, of networks of tables called “relational databases”. Similarly, programming languages generally support ways for objects to be represented in faceted, multidimensional ways (e.g. Java’s classes and interfaces). In fact, not even evolution itself can be characterized as a strict tree anymore, given the lateral gene transfer observed in bacteria.

For example, the Internet lacks a hierarchical architecture…

The World Wide Web is a minimally managed platform. While, at its core, it breaks data into smaller units stored on a server, its superstructure fundamentally lacks any hierarchical backbone. Nor does it have bidirectional links, or a single entry point per item, and with its meshwork of hyper-links and cross-references, it is as if many branches lead to the same leaf. By people of the Linked Data movement, this “free-for-all” approach is seen as a virtue: owing to the processing power of modern computers, ontologies could be inferred from the disorderly data itself, so as to programmatically derive themes and categories, instead of ordering it upon insertion. Many databases opt for user-driven classification systems (sometimes called “Folksonomies”), where users could associate materials with open-ended tags, like Twitter hashtags. This way, the data in its stored state would be completely disorganized, and it becomes the responsibility of the retrieval mechanism to structure it. Thus, classification boundaries become dynamic, self-organizing, and the distinction between data and metadata dissolves. Philosopher David Weinberger, in his book “Everything is Miscellaneous” has described it as “Filter on the way out, not on the way in”. For example, Google, a meta-application most of us use for retrieving information, is based on keyword analysis rather than a categorical scheme.

… but is complemented by semantic networks

Tim Berners-Lee proposed WWW while working as a researcher at CERN and was motivated by a concern with inefficient communication among scientists. He therefore went on to propose a more centralized, rigid version, called the Semantic Web, which imposed standard models, known as Resource Description Framework (RDF) for encoding data into subject-predicate-object relationships. This facilitates a more dynamic interaction with knowledge as well as the merging of data from different sources in a way that for example Wikipedia – which bases data presentation around the metaphor of a physical page and is meant to be read by humans rather than computers – does not support. Parallel computing expert Danny Hillis has presented a similar scheme for extracting “meaning” from static documents, and an implementation of it called “Freebase” was in 2010 sold to Google, and is used for the Wikipedia-like entries that pop up to the right for certain searches.

What does informatics tell us about the brain?

As pointed out by writer Alex Wright, the Internet makes increasing use of the “stream” metaphor of ephemeral content rather than static pages, which bears upon the century-old idea of our collectively intelligent knowledge network as a “global brain”, of bits flickering through its veins like a vital fluid. Science fiction writer H. G. Wells conceived of an “all-human cerebellum”, while philosopher Pierre Teilhard de Chardin entertained his somewhat obscure notion “nöosphere”, and popular science writers Ian Stewart and Jack Cohen have a similar concept of “extelligence”.

The analogy of the Internet as a brain may seem a bit too quasi-mystical, but the insights gained from studying how humans have optimized information management outside of the brain – from naïve hierarchies to disorganized networks – may give us important clues as to how it is done inside of the brain. Somehow, in the brain, hierarchies and networks coexist and give rise to each other…



Alex Wright’s Glut (2008) and Cataloging the World  (2014) overlap in content and cover the history of information management, with the latter focused on Paul Otlet’s Mundaneum

Manuel Lima’s Book of Trees (2014) is a beautifully curated coffee-table type of book about tree diagrams, from Ancient Mesopotamia to Big Data visualizations.

Katy Börner’s Atlas of Science (2010) and Atlas of Knowledge (2015) are about computer-generated visualizations of scientometric data.

Samuel Arbesman’s The Half-life of facts (2013) is a popular science book about scientometrics


Posted in Okategoriserade | Leave a comment

OTOOP Part II.I: The Redundant Scientometrician


Within a library’s soaring atrium, beneath its latticed skylight roof and the crown lace of a sycamore, young scholars lay supine in its dappled, purple shade, careful not to bother the familiar presence of an elderly woman, who perched on a tall ladder, stood with her small hands cupped around one of the sycamore’s innumerable twigs. And so it was a violation of rules when a young man, undeterred by the silence, shouted to the woman from the ground:

“Excuse me, Ma’am, but could you please point me towards the Department of Scientometrics?”

Rare as these interruptions were, it immediately brought to her mind memories from that other time, some two decades less than a lifetime ago, when another man had approached her at this very site. There was no tree here back then, only an unshaded bed of greenery, where she, one day, had sat and examined the flora, when the cast shadow of a hunched old man emerged from behind, and a voice asked her, timidly:

“Excuse me, Miss, but could you please tell me what it is that you study?”

“I study arborsculpture, sir – the art of tree shaping.”

“I was hoping you were.” the man said.

“Why so?”

“I have for many months now been looking for a young set of green fingers – coupled to a pair of ears that are unflinching to lofty ideals, and eyes that won’t tire, even when that idealism inevitably wears off.”

He sat down beside her. “I hope you won’t mind me disburdening myself a little bit. I happen to be the manager of this library, and, as is customary I suppose – now with my time here on Earth running out – I have lately begun to doubt the causes I have been committed to, the fruition of my labors, and ponder the possibility, that maybe, after all, my life has been an utter mis-investment. You see, I was the one who seeded this collection, who – as libraries began to sprout up across our country – ensured the unrestricted influx of literature, and saw it swell to its present proportions. At the time, I rejoiced at the inundation as if it were a cool rush of spring water, but today, as I walk around the library, what wells over me is instead a feeling of glut, surfeit and almost disgust.”

They both looked around. It was true that the library bore all the evidence of an ill-planned growth spurt, with aisle after aisle lined with motley-wooden bookcases stacked on top of each other, climbing towards the ceiling, overflowing with scrolls and papyri, folios and manuscripts. Here was represented every topic imaginable, and for every topic, many dozens of sumptuously illuminated volumes wearing grandiose titles, of –logies and –nomies and –graphies and all manner of suffices. And finally, on top, a uniform layer of dust, indicating that, for all the exalted thoughts and theories contained inside, this library was their dead end.

“So much knowledge”, he said “to so little avail. A repository of potential destined to forever remain just that – potential! Some nights I walk around the library by myself, when the moonlight reflects in its gilded details, and I imagine a wind swooping through it, freeing the ideas from their book-bound captivity. I imagine them as glowing globules that would rise from their shelves and congregate towards the atrium, where ideas equivalent to each other would fuse, and as the globules grew in size, they would swirl into the form of a spiral nebula, like by a centripetal force, and, ultimately, join together in a single great entity.” He smiled shyly. “And from this, every fact in this library could then be derived, first by discipline, then by subject, then by topic, quite like how a how a tree branches out, multiplicity sprung out of unity.”

“You want me”, the woman asked, “to represent your library’s collection as a physical tree?”

“It would work”, he hurried, “like a map. Presented with the tree, the scholar would no longer feel bogged down in a sea of unstructured knowledge, but immediately get a sense of overview and navigability, of progress and direction, with the ability to telescope back and forth, between the abstract and the granular. Instead of repeated reinvention, every idea would be original, and as surely as leaves align themselves in the canopy to claim their share of the sunlight, every unfinished idea would find itself a human mind to feed on.”

He turned toward her.

“It’s not a small favor that I am asking of you. What I am requesting is no less than your life-long devotion to a project whose tractability cannot be guaranteed. But if you, like me, are willing to delay your gratifications for a shot at something grand and enduring, I trust you will find the reward worth the sacrifice”.

The library manager, as he had predicted, soon walked his last nocturnal walk around the library, and the woman, touched and honored by his last wish, set about with her huge undertaking.

Adapting her previous knowledge about tree shaping was the easy part. Tree branches, she knew, had a natural tendency of grafting if their vascular tissues are joined together so that, by pruning and wiring new growing material to areas under the bark, she was able to control the tree’s growth. Her inquiries also led her deep into the realms of fluid mechanics and mathematics, where the fact that the total cross-sectional area of all child branches equals that of the parent branch meant that she had to calculate what trunk radius was required to give the tree enough capacity to represent the entire library. Sycamores were particularly prone to graft, so by the end of the first year, she had planted a sycamore sapling in the flower bed.

The hard part, and the part that would consume her subsequent seven decades, was that of organizing all the literature and mapping it onto actual tree branches. She got permission to close the library from the inflow of more literature, and began to, for every single work in the library holdings, wring out a hierarchical structure from its linear restrictions. She would represent its gist as a root-node, and let the ideas that supported it be its child branches. Then, she would see if the root figured as an internal node in another work’s tree, and affix it there, embedding each idea in its natural slot. Because the relationships varied in their meaning – sometimes an idea would progress from antecedent ideas, sometimes contradict it – she invented signs and qualifiers to express this, from which a whole grammatical system eventually would emerge.

With the passing of years, her personal note-collection bulged to a size worthy of its own library department, growing denser and denser in inserts, errata and marginalia. The sycamore would fan out in its full verdant glory, with classes of knowledge carved into its bark as bands of gold lettering, to which the scholars would turn repeatedly, the way a wanderer refers to his map. And, with the passing of many more, the woman’s hands would grow veiny and shriveled, her hair white, her back hunched, but her mind remained as unrelenting.

So there she stood on her ladder, wiring a twig into place, when the young man below commanded her attention. She answered him:

“Believe me, sir, I know this library inside and out, and there exists no discipline by that name.”

“Oh really? You see, I have come a long way after a friend told me that this place houses a rather sizeable collection of scientometrics literature, perhaps shelved under the wrong label. ”

“The labels, sir, you will find inscribed on this sycamore.”

He looked askance at first, but inspected the tree for a brief moment. Then, he broke out in laughter. Condescendingly he said:

Surely you wouldn’t expect to find scientometrics included in a pictorial representation of the sciences? This tree – poor soul, whoever wasted their time on such an obsolescent project – is itself a work of scientometrics! Scientometrics – the measuring and mapping of science! Ah, and I think I just found the books my friend was talking about.”

He approached her personal note-collection, and began to thumb perfunctorily through two decades less than a lifetime.

“Well, this was, I must say, terribly underwhelming. Some attempt at a bibliographical system of cataloging literature hierarchically, of which at least 26 variants have already been developed, independently of each other. What a waste of a journey.”

He sighed and unfolded a map from his breast pocket, on which he jotted an annotation.

“Trees are so sad, I think. Each branch valiantly ramifying on its own, unaware of others, never to re-connect, doomed to duplicate what has already been done.”

He then walked towards the exit.

“But Sir!” the woman screamed, her frail frame trembling, “if you are mapping the discipline of science-mapping, what does that make you?”

And for a sympathetic instant he turned around with a look that mirrored hers.

“A scientometric-metrician”, he said. “The only one of whom I am currently aware.”

Posted in Okategoriserade | Leave a comment

OTOOP Part I: Book list


The research behind the first room of the Menimagerie – the Entropical Conservatory – was very labor-intensive and very eclectic. The idea was to start with the impressionistic ideas of hierarchy theory and see how it underlies information theory, probability theory, selection effects and statistics. As promised in the beginning, it therefore starts with foregrounding the connective tissue of all these concepts, before going into mathematical detail. I personally think this is the most satisfying way of teaching things. People vary in their need for cognitive closure – some people are happy to study a statistics course by learning the procedures as isolated recipes – but I think I speak on behalf of many others that working with a method that we have a very fragile, shallow understanding of is very uncomfortable, and probably ineffective. I couldn’t find in a single textbook an explanation of why normal distributions are so common (explanation: selection effects) or why the standard deviation takes the square root after division, and I was very lucky to stumble upon books that opened my eyes to Bayesian and likelihood analysis. In this section, I list  (incompletely) the books that informed the previous section.

Information Theory

Philosopher Andy Clark explains “umwelt” and “affordance” beautifully in the books Being There (1998) and Mindware (2000).

James Gleick’s The Information (2012) is, at 500+ pages, very informative, but mostly concerns the history behind information theory, with a lot of biographical portrayals of the key figures. A plus is that it takes a very abstract view of it, for example starting with explaining ancient communication systems like African tribal drums.

John R. Pierce’s An Introduction to Information Theory (1980) is the go-to when it comes to conceptual explanation.

Vlatko Vedral’s Decoding Reality (2012) and Seth Lloyd’s Programming the Universe (2007) are both popular science books about quantum computing, but they are also the clearest explanations of information that I have found. Charles Petzold’s Code (2000) is about computation, but also very nice and concept-oriented.

Paul Davies and Niels Henrik Gregersen-edited Information and the Nature of Reality (2014) is a collection of philosophy essays on information that was extremely illuminating.

John D Barrow’s The Artful Universe (1996) explains noise.

Bayes and Probability Theory

For light introductions to probability theory, see Leonard Mlodinow’s The Drunkard’s Walk (2009) and John Heigh’s Probability: A Very Short Introduction (2012).

Bayes’ theorem is explained most clearly in James V Stone’s Bayes’ Rule (2013).

Predictive coding is explained non-technically in Jakob Hohwy’s Predictive Mind (2013). OBS: Andy Clark just came out with a book about this called “Surfing Uncertainty” that I have not yet read.

Base-rate fallacy is explained best by the man who discovered it: Daniel Kahneman’s Thinking, Fast and Slow (2012).

Philosophy of Science and Statistics

John D Barrow’s Theories of Everything (2008) explains selection effects in cosmology, while his Pi in the Sky (1993) explains it for the philosophy of mathematics.

Max Tegmark’s Our Mathematical Universe (2015) is probably the most digestible book about quantum theory and the Many-Worlds interpretation. He also happens to be Swedish, and drops references to things from my home country throughout, which is fun.

Tim Lewens’ The Meaning of Science (2015) really extracts the key concepts and present them without obscure jargon.

David Salsburg’s Lady Tasting Tea (2002) presents the history of statistics using interesting anecdotes. It is helpful to see just how fraught with conflict statistics is, and that it shouldn’t be seen as something objective or absolute, but as a useful artifact.

David P. Feldman’s Chaos and Fractals (2012) explains normal distributions in the best way I have seen.

Jordan Ellenberg’s How Not to be Wrong (2015) is an amazingly interesting book about mathematical thinking, which includes chapters about regression to the mean and Bayesian analysis.

Zoltan Dienes’ Understanding Psychology as a Science (2008) is an impossibly good book to which this blog is highly indebted. It has very clear introductions to Neyman-Pearson, Bayes and Likelihood. Really, every science student should be obliged to read it.

Gerd Gigerenzer’s Rationality for Mortals (2010) and Bounded Rationality (2002) are marvels of insight. Reading Gigerenzer’s books makes you happy.

Alex Reinhart’s Statistics Done Wrong (2015) explains common statistical errors in a non-technical way. Again, everyone should read it.

Posted in Okategoriserade | Leave a comment

OTOOP Part I.XII: Different Statistical Approaches

The Bayesian Approach

A Bayesian analysis of the data from a simple experimental design is straightforward. Recall that the theorem implies that the posterior is the likelihood weighted by the prior. Because a hypothesis corresponds to a hypothesized probability distribution, you therefore have to, for each probability in your expected distribution, multiply it by the corresponding probability in your obtained distribution. For example, when trying to estimate a population, a “hypothesis” could be “height is normally distributed and the mean height is 170 cm and SD=5”. This would entail that, for example, 5 random samples all being above 200 cm has a very small probability, such that the likelihood would cause the posterior distribution to shift the mean to the right. Let us take it step by step:


  • Suppose you have a between-group design with a treatment group, to which administered a certain dose of a drug, and a control group, that received no potent substance.
  • You assume a priori, based on previous literature, that it will be roughly normally distributed, so that values vary in their probability in a symmetrical manner. You also believe, a priori, that the central value will be 3 and that a value of +-1 would have a 68% chance of occurring (i.e. the standard deviation is 1). This defines your prior distribution. (Note that prior distributions are in practice difficult to construct. Psychological theories, for example, rarely imply a particular P(effect|theory) distribution. )
  • You obtain your sample distribution, with its own mean and standard deviation (the standard error). This is your likelihood distribution. Each value on its horizontal axis is its own hypothesis of the mean, so the evidence clearly favors the hypothesis population mean=sample mean the most.
  • You multiply prior and likelihood for every parameter value (and adjust to make sure that the posterior distribution’s area is equal to 1). This is your posterior distribution. Note that there are simple formulae available to skip the hassle of calculating posteriors for each parameter value.
  • The posterior distribution can be summarized by a “credibility interval” – the range that has a 95% probability of including the true effect, which, if we assume it is centered on the mean, is (M1 – 1.96xS1) to (M1 + 1.96xS1).
  • If we are interested in different hypotheses – that is, different possible mean values – we can compare them using a ratio (the “Bayes factor”): P(H1|D) / P(H0|D) = P(D|H1) / P(D|H0) x P(H1)/P(H0). If more than 1, it supports the numerator hypothesis more than the denominator hypothesis, nudging your belief in the former’s direction. It is often informative to report how the Bayes factor depends on different priors, so that other researchers can choose Bayes factor value based on their own priors.


Before proceeding, we should highlight a few conceptual points:

  • Lacking any prior means that you have no idea of the mean. In terms of prior distribution, this means that its standard deviation should be infinite. We call this a “flat prior”. There is a mathematical caveat to this, which is that when we apply a non-linear transformation of that variable (for example, its inverse) it will not remain flat. For example, if our variable is the male:female ratio of a country’s population, and we assume it as flat out of complete ignorance, then the female:male ratio will not be uniformly distributed. Hence, complete ignorance should mean an inability to assign any kind of prior (and in science, information corresponding to the prior is typically unavailable). There is an approach called “objective Bayesian” that tries to solve this.
  • If your prior is wide (i.e. diffuse), the posterior will be dominated by the likelihood, and if it were flat, the posterior would be the same as the likelihood.
  • The credibility interval of the posterior will normally be narrower than that of the prior, and will continue to do so as you collect more data. With more data, you therefore gain in precision, and different priors will converge on the same posterior.
  • A Bayes factor around 1 means that the experiment was not very effective in differentiating between the hypotheses, in picking up a difference (it is insensitive). It may be as a result of vague priors. Intuitively, this penalty makes sense, for in a vague theory, data very different from it can support it, and because the Bayes factor is low, you will not be inclined to accept it.
  • Your posterior only concerns hypotheses you have conceived of. As we learned in the introduction, this does not exhaust the space of possible theories.

But the most important take-home point is that, in the Bayesian approach, beliefs are adjusted continuously. You may continue to collect data for as long as you want to, until the credibility interval has the precision you require. The only information you need to update your belief is the likelihood (a controversial idea known as the “likelihood principle”). As a consequence, you can quantify the subjective probability of the hypotheses you have considered, but have no black-and-white criteria to support your decision making. However, you could stipulate, somewhat arbitrarily, that “to confirm a theory, the Bayes factor must be 4” and then collect data until you have 4 (or ¼, in support of the other hypothesis).


The Likelihood Analysis Approach

The Bayes factor assisted you in optimal guesswork: how much you should, subjectively, favor one hypothesis over the other, in light of the available evidence. We also saw that assigning priors is theoretically and, above all, in practice very difficult. An important point here is that, irrespective of priors, we could quantify the obtained data’s impact on how we update probabilities based only on the likelihood ratio. In other words, we can sidestep the prior-problem by simply considering how much the evidence favors one hypothesis over the other, and this measure, the relative evidential weight, is conceptually distinct from that of subjective beliefs.

For example, let us say that we have two different hypotheses concerning the probability distribution of some population. One of them is that the proportion of males is 50%, the other hypothesis states that it is 100%. These are your hypotheses – there are no priors involved now. You take a 5 random samples, all of which are male. You proceed to calculate the likelihood ratio P( 5 male | 100% men)/P(5 male | 50% male) = 1/(0.55)=32. This means that the evidence favors the numerator hypothesis 32 times more than it favors the denominator hypothesis, because that hypothesis predicted the data 32 more strongly. According to Richard Royall, a strong proponent of the likelihood approach, 32 happens to be a reasonably threshold for “strong relative evidence”, with 8 being “fairly strong”.


We may plot a hypothesis’ likelihood to see how it changes as a function of sample size (for example, if the above example had a hypothesis of 0.8, this function would be the Bernoulli function 0.8n0.2m-n, where the sample has size m and n of them are the outcome believed to have probability 80%), and given sample size, we can plan how many samples need to have a certain outcome in order to get a certain likelihood ratio. For example, given a sample of 20 people, to get a likelihood ratio greater than 8, 15 people need to get the 80% outcome to favor that hypothesis ((0.8150.25)/(0.5150.55) > 8). For two hypotheses that differ by more than 0.3, fewer subjects would be required for this strength of evidence.


We could also hold sample size fixed, and let the parameter-values vary (like the function θ5(1- θ)5). Given this latter likelihood function, we could calculate the equivalent of a credibility interval. Going by Royall’s recommendation that a ratio of 32 constitutes strong evidence, then all the θ-hypotheses with a likelihood value of more than 1/32 of the highest likelihood would have roughly the same strength of evidence. The “likelihood interval” therefore indicates the range of strongly favored hypotheses.

Again, there are a couple of conceptual points to highlight:

  • The ratio only requires two bits of information: the two likelihoods. It is unaffected by other changes in the population distribution (which is not the case of the Neyman-Pearson approach we will soon describe).
  • Again, you can continue to collect data for as long as you want, since likelihoods only depends on the product of each event’s probability.
  • The evidence, of course, can always be misleading, even though we interpret it in an appropriate way. However, the probability of obtaining a ratio of k in favor of the false hypothesis is bounded by 1/k, since this is the inverse.
  • If a likelihood interval does not include a hypothesis, this is not simply evidence against it, since it is relative to some other hypothesis, so we cannot reject it on those grounds.


The Neyman-Pearson Approach

The basic logic

We have seen the Bayesian schema inform us about how to update our beliefs, and the likelihood analysis schema disregard the priors and instead tell us about relative evidential strength. While it is possible to set up decision criteria thresholds, such as “Bayes factor above 4” or “likelihood ratio above 32”, an argument could be made that these approaches do not support behavioral, binary decisions very well – such as confirming or rejecting a hypothesis – because, conceptually, wise decision-making should not necessarily be based on beliefs or relative evidence. Rather, the top priority could be to control how often, on average, we will make the wrong decision. This is the logical basis of the methodologically more complex yet by far most widespread statistical procedure, known as the Neyman-Pearson approach, or “Null hypothesis significance testing”. It was institutionalized in the 1950s and often described as a backbone of statistics, when, in reality, it is just one instrument of many in the toolbox that is statistics.

Central to this logic is the notion of the testing procedure itself being part of an infinite population, called a “reference class”. For example, the toss of a coin could be considered as a sample from a collective of all possible outcomes. If, in this reference class, “head” occupies 40%, then this will be reflected in a 40% long-run frequency of heads. This long-run frequency is what defines probability, such that a single event cannot meaningfully be said to have one – only the reference class has a probability attached to it. This idea is often described as meaning that probability is “objective”, but how the reference class is conceived is dependent on our subjective information. If we knew the aerodynamics surrounding the coin, the reference class would be narrowed down to the indefinite number of realizable tosses with those particular wind conditions. More importantly, this conceptualization implies that, because a hypothesis has no reference class (it is not part of a well-defined space of possible hypotheses), it does not have an objective probability. It is either true or false. Our test’s reference class meanwhile, has one, so our decision as to whether a hypothesis is true or false will have a long-run error attached to it. In essence, therefore, we calculate “If we were to conduct the exact same test and procedure again, with the same amount of subjective ignorance and information, an infinite number of times, how often would the true/false test give the wrong answer?”.


The true/false test in question is performed not on the hypothesis we are curious about, called the “alternative hypothesis”, but on the “null hypothesis”. This is occasionally defined as the one that is most costly to reject falsely – our default explanation that we don’t abandon unless we have to, because it is simple and approximates well. In practice, it is often, but far from always, the hypothesis of no difference (a nil hypothesis such as “all the samples from different conditions come from the same population”). As such, the test is somewhat equivalent to a mathematical type of proof known as reduction ad absurdum, in which a statement is assumed to be true, and if it is legally manipulated so as to result in self-contradiction, this means that the original statement must have been false. Except, instead of self-contradiction, the rejection-criterion is now based on extreme improbability. In other words, if the outcome is very improbable by happenstance, i.e. to have occurred as a result of chancy fluctuations, then this suggests that the experimental manipulation (or some other, unknown nonrandom influence) must be held accountable for it. We may say that an observed difference is significant, meaning that we reject the null, without directly quantifying the probability of alternative hypothesis, since – given that only reference classes have probabilities – this would be meaningless.

We thus actively seek disconfirming evidence, and place the burden on the alternative explanation, which costs more in terms of parsimony compared to the default (“it’s all due to chance” is more parsimonious), something that we intuitively are very poor at. Rejecting the default hypothesis on the basis of low probability means that we again are concerned with a conditional probability. However, whereas in Bayesian and likelihood approaches, we calculated the likelihoods P(observed difference| hypothesis is true), holding the observed difference constant and letting the hypothesis vary, we are now concerned with the conditional probability P(getting as extreme or more extreme data |  null is true), in which the hypothesis is fixed as null so that we don’t consider every single parameter value, and the “possible data” is allowed to vary.

Moreover, we are no longer interested in the heights, but in areas, since the probability of the obtained data for a continuous density function is infinitesimally small, so we are interested in a low-probability range. Somewhat arbitrarily, this “rejection region” has by convention been set to 5%. This refers either to the two extreme 2.5% in both directions, in which case the test is two-tailed, or the most extreme 5% in one direction, in which case the test is one-tailed, and any effect in the other direction won’t be picked up. This makes the calculated probability dependent on unobserved things (choice of test), so that identical data can result in different conclusions. In effect, these are two different ways of calculating the conditional probability P(getting as extreme or more extreme data | null is true).


The alpha-level

We – given the population distribution – calculate the probability of obtaining our sample or a more extreme one, and if probability, called the p-value, is less than 0.05, then if we reject the null now when actually the null is true, we will do this mistake only 5% in any future replications of this test. The 5% level, called alpha, therefore is an a decision criterion set by ourselves in advance that gives us the number of false alarms we would get if we were to perform this test again and again if the null were true. It is an objective probability. The long-term behavior, alpha, is everything we know.  Note that it is not correct to define alpha as “the probability of false alarm error”, since this could be interpreted to mean the probability of false alarm after rejecting the null. If you obtain p < 0.011, it is a mistake to say that “My false alarm rate is less than 1.1%” or that “98.9% of all replications would get a significant result” or that “There is a higher probability that a replication will be significant”, since a single experiment cannot have a false alarm rate. Alpha is a property of the procedure, not of one particular experiment.


Confusing the p-value with the probability of a fluke is called the “base rate fallacy”. It ignores the fact that the probability of a significant result being a fluke depends on the prior distribution of real effects. If you perform 100 tests at alpha=0.05, then you would expect 5 false positives. However, in total you obtained 15 significant results, so the fraction of them that are truly false (a number known as the “false discovery rate”) would be 5/15=33%.  The lower the base rate (the fewer cases in which the null is true), the more opportunities for false positives. Therefore, in data-heavy research areas like genomics or early drug trials, because the base rate is so low, the vast majority of significant results are flukes.


The beta-level

There is a similar risk of failing to detect real effects, called beta and expressed as P(accepting the null | null is false). Like alpha, beta is an objective probability decided upon in advance of the experiment. Given the beta, the expected effect size, and the expected amount of noise, you can calculate the sample size required to keep the beta at this predetermined level. Because large sample sizes normally are expensive, the relationship between alpha and beta is usually a tradeoff: we can reduce the probability of false alarms by requiring a p-value below 1%, but only by increasing the probability of missing a true effect. Though in practice they often are, alpha and beta are not meant to be picked as a mindless ritual, but carefully chosen based on an experiment-specific cost-benefit analysis. False alarms are typically presumed to be more costly, but this is not always the case: in quality control, failure to detect malfunctioning is often a higher priority.


The complement of beta (1 – beta) is the probability to pick up an effect of the expected size. It is called “power” and can be thought of as the procedure’s sensitivity. Usually, a power of 0.8 or higher is considered sufficient, but in practice it is seldom achieved or even calculated. We mentioned previously that the hypotheses tested in the most common statistical tests tend to be low in content, because they only predict “there will be no difference” without specifying the effect size. The effect size is therefore not only needed for power-calculations, but also makes the theory more falsifiable. In neuroscience, the median study has a power of only 0.2, with the hope that meta-studies aggregating the results will compensate for it. A non-significant result in an underpowered study is meaningless, because the study never stood a chance of finding what it was looking for.  Underpowered studies also run the risk of “truth inflation” – true effect sizes vary randomly, and if the power is low, only effects that by chance are very large will be statistically significant. By reporting the effect size, you are over-estimating its future magnitude, when it regresses towards the mean.


Details of the reference class

The essence of null-hypothesis testing thus is to control long-term error, and to do this, the test procedure needs to be specified in finest detail, so that the reference class – the infinite number of replications that never happened – is well-defined.  This includes the number of tests for different hypotheses we perform as part of the whole testing procedure – the “family” of test. Each test does not stand on its own. The reference class is then comprised of an infinite number of replications of this whole family, so if we still wish to have false alarms in only 5% of our replications, then we need to provide for the fact that several tests increase the chance of a misleading significant result being found somewhere in the procedure. This fact, known as the “family-wise error rate”, can be understood by how, if for each test in the family, the probability of not making an error is (1-alpha) and the probability of making at least 1 error in k tests is the complement of (1-alpha)k, i.e.  1-(1-alpha)k, which increases with k. Thus, the overall long-term false alarm rate is no longer controlled. Intuitively, by exposing ourselves to more chance events for each trial, there is more opportunity for chance to yield a significant result, and to curb this, we enforce much more conservative criteria. The most common and least sophisticated way to control this error rate is to let each test’s alpha be 0.05/k, something known as Bonferroni correction.


The presence of multiple comparisons can be subtle. Consider, for example, how in neuroscientific studies, you test for increased activity in every 3D-pixel of the brain – the required multiple comparison correction is massive. There is, furthermore, an intuitively disturbing aspect to the concept of a test family. Bonferroni correction reduces statistical power dramatically. If the researcher had planned a priori to only perform one comparison (and pre-registered her statistical protocol), then the reference class can be re-defined as “Collect the data, perform the planned comparison at the set alpha and then perform the other tests for curiosity’s sake, Bonferroni-corrected”. Also, if the same comparisons had been part of different experiments, there would be no need for correctives. Again, the reason for this eerie feature is that the reference class needs to be well-defined.


Another reference class specification is that of when to stop testing, “stopping rules”. Suppose we recruit subjects gradually and perform tests as we go along – a procedure known as “sequential analysis”. On the one hand, it is obvious that testing until you have a significant effect is obviously not a good stopping rule, since you are guaranteed to get a significant effect if you wait long enough, but in medical research this is often ethically required, and certain correctives are used to account for multiple testing. Still, we cannot tell whether we stopped due to luck or due to an actual effect, so the effect is likely to be inflated. Notably, whereas in the Bayesian and likelihood approaches, we could continue to collect data for as long as we wanted, now, if we get a non-significant result for our 40 subjects and decide to test another 10 subjects, the p-value would refer to “Run 40 subjects, test, if not significant, run another 10”. This p-value is doomed to be above 0.05, since at n=40, there was a 5% chance of false alarm, and at n=50 there was another opportunity for false positives. Meanwhile, if a colleague decides to test 50 patients in advance, there is no need for adjustments, because his test is member of another reference class. Different stopping rules that coincidentally agree to stop at the same size can thus lead to different conclusions.

Result interpretation


The p-value is the probability for obtaining such data, the probability of the t-statistic. Therefore, if very small, this does not indicate a larger effect size. A small difference can be significant while a large difference is insignificant. Nor is the p-value a measure of how important it is. Another frequent mistake is to interpret a p-value as a Bayesian posterior (as a probability of the hypothesis being true), or as an indicator of evidential strength. Both the virtue and weakness with the Neyman-Pearson approach lies in just how mercilessly binary it is. If p=0.051, accept null. If p=0.049, reject null. If p=0.00.., reject null too. There is no such thing as “different degrees of significance”, and you are not justified to switch alpha to 0.001 if you get a p-value below that value, as this would undermine its purpose. For the Neyman-Pearson framework, the p-value contains no information other than whether it is past the critical threshold. Because it depends on things other than strength, such as stopping rules and whether the test is one-tailed or two-tailed, it cannot be used as an index for evidential strength. If you decide to test more after a non-significant result, the p-value would increase, indicating less evidential strength, even though intuitively evidential strength also would also increase.

Another important aspect is that two tests, say A vs. C and B vs. C, cannot be interpreted as “A was significantly better than B, while B was not significantly better than C, thus A is better than B” without directly comparing A and B. This is because of the arbitrariness of alpha-level (one could be slightly below, the other slightly above) and the statistical power is limited.


There is an alternative, more informative and straightforward reporting strategy known as “confidence intervals”. By calculating the set of mean values that are non-significantly different from your sample mean at your chosen alpha, mean +- 1.96 S.E., you will obtain an interval that, 95% of times that you replicate the procedure, will include the true population mean. In essence, it gives you the range of answers consistent with your data. If it includes both the null-predicted mean and the alternative-predicted mean, the result is non-significant, but it also tells you directly that the sensitivity is insufficient.  Moreover, the width of the interval indicates your precision. If it is narrow and includes zero (null), the effect is likely to be small, while if it is wide the procedure may be too imprecise to make any inference. Widths, like data, vary randomly, but at the planning stage of the experiment, it is possible to calculate required sample size so that the interval will be of a desired width 95% of the time (a number known as “assurance”). Confidence intervals are particularly useful for comparing differently sized groups, since the certainty of a large sample size will be reflected in that interval’s precision. For these reasons, confidence intervals are preferred, but even in a journal like Nature, only 10% of the articles report confidence intervals, perhaps because the finding may seem less exciting in light of its width, or because of pressures to conform.

Comparisons between approaches

In his book “Rationality for Mortals”, psychologist Gerd Gigerenzer once compared the three approaches to the three Freudian selves in unconscious conflict. The Bayesian approach corresponds to the instinctual Id, who longs for an epistemic interpretation and wishes to consider evidence in terms of the hypothesis-probabilities. The likelihood (or “Fisher”) approach corresponds to the pragmatic Ego, which, in order to get papers published, ritualistically applies an internally incoherent hybrid-approach that ignores beta and power-calculations and determines samples by some rule of thumb. Finally, the Neyman-Pearson approach corresponds to the purist Superego, which conscientiously sets alpha and beta in advance, reports p-values as p<0.05 rather than p=0.0078 and is aware of that it does not reflect degree of confidence, but rather a decision-supporting quality control.

The most important conceptual difference between Bayes/likelihood and Neyman-Pearson is that the former espouse “the likelihood principle” – the likelihood contains all the information required to update a belief. It shouldn’t matter how the researcher mentally groups tests together, or what sample size he plans. Stopping rule, multiple testings and timing of explanation don’t matter, and credibility or likelihood intervals do not need to modulated by them. The likelihood interval will cluster around the true value as data is gathered, regardless of the stopping rule. However, standards of good experimental practice should still apply. Suppose, for example, your Bayesian brain hypothesized “It is dark in the room” and sample information only when the eyes are closed. Due to a failure to randomize sampling and differentiate between the two hypotheses “Eyelids down” and “Dark outside”, it is likely to lead to the wrong beliefs.

Importantly, the approaches sometimes lead to very different results. As shown below, the Neyman-Pearson approach could accept the null hypothesis in cases where the evidence clearly supports the alternative hypothesis.


Selection effects in scientific practice

A number of criticisms have been raised against the Neyman-Pearson approach. One is that an internally incoherent hybrid-version is often taught, in which p-values are reported as p=0.0001, even if the alpha a priori was set at 0.05, which is the only thing that should matter. Another is that the hypotheses it tests are typically low in content. A third is that it invites cheating, since the Bonferroni penalties of collecting more data after a non-significant result means a lot of expensive data may be wasted. Also, because it is poorly implemented without prior power-calculations, and the “truth inflation” phenomenon means that reported effect sizes that are exaggerations of real effects may be unfairly dismissed, because the replication will have a power based on the initial effect size, not the degraded one, and consequently be underpowered.


A fifth is that the binary nature of the Neyman-Pearson approach also invites “publication bias”, in which both scientists and journals prefer unexpected and significant results, as a kind of confirmation bias writ large. If an alpha=0.05 experiment is replicated 20 times, 1 of them will be significant due to chance. If 20 independent research teams performed the same experiment, one would get a significant result, simply by chance, and that team certainly would not feel very lucky. If journals choose only to publish significant results, neglecting the 19 insignificant replications, it fosters something like a collective illusion. Moreover, it is not unusual to tweak, shoehorn and wiggle results – be selective about what data to collect in the first place – to make it past that 5% threshold, “torturing data until it confesses”.


How can we unravel this illusions quicker? How can we attain maximum falsifiability, maximum evolvability, to accelerate progress?

  • Meta-research: There are statistical tools for detecting data-wiggling, and a lot of initiatives aiming to weed out dubious results.
  • Improving peer-review: Studies in which deliberately error-ridden articles have been sent out for peer-review indicate that it may not be a very reliable way to detect errors. Data is rarely re-analyzed from scratch.
  • Triple-blindedness: Let those who perform the statistical tests be unaware of the hypothesis.
  • Restructuring incentives: Replications are extremely rare because they are so thankless in an industry biased towards pioneering work. Science isn’t nearly as self-corrective as it wishes it were. We must reward scientists for high-quality and successfully replicated research, and not the quantity of published studies.
  • Transparency: To avoid hidden multiple comparisons, encourage code- and data-sharing policies.
  • Patience: Always remain critical of significant results until they have been robustly and extensively validated. Always try not to get too swept away by the hype.
  • Flexibility: Consider other statistical approaches than Neyman-Pearson.

But ultimately, we have to accept that there is no mindless, algorithmic slot-machine path to approximate truth. Science writer Jonas Lehrer put it as follows: “When the experiments are done, we still have to choose what to believe.”



Posted in Okategoriserade | Leave a comment

OTOOP Part I.XI: The Normal Distribution

Populations and variables

There would not be much point to science unless it produced knowledge that was otherwise not available to us. The goal of science is to generalize into the past, present and future so as to reliably infer things that we cannot observe directly. Yes, some areas of science are of a localized, fine-grained variety, but the goal is ultimately to use this in order to detect patterns that latch onto system regularities, allowing us to re-engineer Nature in adaptive, self-serving ways.

As part of this pragmatic means-end analysis, and in a truly astounding feat by human cognition, we mentally parse our events into meaningful categories, called populations, upon which we project variables – possibility spaces of potential states. “Mankind” is an example of a population, with a variable such as “gender”, but so is “upper-class Norwegians”, “sword fish”, “all earthquakes” and “all theoretically possible coin tosses”. Moreover, member states can be distinguished categorically – based on how you interact with them in qualitatively different ways – or numerically, if they differ on a quantitative dimension. Statisticians have tried to erect complex taxonomies for the different types of variables, but as products of our brains, there is a degree of arbitrariness in how we frame a variable. For example, the variable “color” can both be considered categorical (including blue, green, red…) or numerical, based on electromagnetic frequency. Indeed, any continuous measurement can be chunked into separate groups, like “tall” and “short”, to simplify analysis, at the cost of less fine-grained results.


In a population, a variable’s possible states may vary in their relative frequencies. Based on an observed subset of a population, known as sample, we hope to find a way to predict what state a system will assume, or, at the very least, how certain we should be about a particular outcome, to support our decision-making. We are, in other words, looking for probability distributions and how different factors change them.

Probability distributions

The shape of a probability distribution reflects – albeit very indirectly and cryptically – logical properties about the generating mechanism underneath. An economist may infer from a national income distribution whether the economic system is socialist or capitalist. A gambler may infer that a roulette wheel is biased. Some shapes are rather ubiquitous and mathematically elegant, indicating general organizing principles in Nature. Two populations could generate a probability distribution of the same shape, though one could be wider and the other taller, and they may apply to different types of scales. Quantities that define the particulars of a distribution, apart from its general shape, are known as parameters. Based on corresponding quantities of the sample, known as statistics, the hope is to infer the population parameter, which, given the shape, holds the key to the probability distribution we are looking for.

Given a set of samples of a variable, we are curious about whether or not we should re-carve our reality and regard them as separate categories from different populations. As usual, whether this is meaningful or not depends on if it affords us any predictive power. Because to be member of different categories means having different parameters, parameters can be thought of as manipulable knobs, whose settings remain fixed as the system changes dynamically. An experiment effectively asks whether an observed change is attributable to different knob-settings.


Indeed, the metaphor of it as a man-made contraption has a strong appeal: we may imagine ourselves as an archeologist who unearths a complex, mechanical device that has no obvious purpose. You search for ways of adjusting it (you distinguish and alter its “parameters”) and observe its effects (the population is its behavior at each instant in time). You may turn a knob and find that the machine emits a sound, suggesting that the alteration implied a change in parameter values. Or, it could be due to some other change: maybe, just as you turned the knob, a confluence of events independent of your action caused the sound. Ideally therefore, we would like to rewind time and see if it would happen in the absence of our turning the knob, but because the laws of spacetime won’t allow such counterfactual exploration, we assume that the times that you do not turn the knob represent this scenario.

Experimental designs

This idea – of creating “fake counter-factuals” – is central to experimental designs. The manipulation made by the researcher is known as “independent variable”, while the results are measured along a “dependent variable”.  To fake parallel universes, the same subjects would have to undergo all the different treatments at different points in time, making it a “within-subject variable” or different subject subjects would only undergo one treatment, which could be done simultaneously, making it a “between-subject variable” where there are no dependencies between data. What design is preferred depends on the variables – in general, the former leads to less noise (since the subject is constant across the conditions) but the latter has fewer confounds (since there are no practice effects and the like).


While it is practically impossible to account for all conceivable influences, our failure to do so won’t pose a problem if their aggregated influence is balanced across all conditions. This is ensured by randomization, in which an experimenter assigns subjects to conditions using some form of random number generator, so that each subject subjectively has equal probability to end up in either condition. No category of subjects will be systematically biased to receive one treatment over the other. Uncontrolled, “extraneous” differences in gender, mood, height, etcetera, would thus be cancelled out by chance. Hence, an uncontrolled extraneous variable is not a valid criticism of an experiment. Balancing-by-randomness, however, is not guaranteed, so group allocations are typically examined afterwards to make sure that they are not grossly imbalanced on the most plausible confounds. Sometimes random assignment is impossible. If, for example, gender is manipulated as a between-subject variable, height and other variables will co-vary with this, since men generally are taller, making it “quasi-experimental”. If height is a priori judged to be potential confound, the groups will therefore have to be matched instead.


For a better understanding of aggregated randomness, let us now focus on natural populations with the most famous distribution of all – the normal distribution.

The concept of additive effects

linearsystemWhy are normal distributions so normal? The answer is partly because of selective reporting. In physical systems, an indefinite number of random (as in “unaccounted for”) factors combine to affect an outcome. These combinations are sometimes, but not always, additive. For example, height can be seen as the sum of genetic and dietary contributions. The combined effect is the same as the sum of separate effects, as if you were exposed to both separately. Such systems are called “linear” and they dominate education and our intuitions because they are so simple and learnable. In nature, phenomena typically behave linearly only over restricted ranges: eating may increase height, but only until obesity starts affecting your posture. To the extent that such a quantification makes sense at all, only a small subset of all effects are linear.

Others are non-linear, in which an increase in input will not have a proportionate effect, because internal system boundaries truncate linearities. “Synergistic” and “antagonistic” effects are greater and less than the sum, respectively, as if the diet could meddle with your DNA to enhance or reduce its own influence on your height. Non-linear systems are thus characteristically unpredictable, and often not a tractable object for scientific study. Therefore, if normal distributions seem pervasive out in the wild, it is because the underpinning system’s single, strong phase space attractor makes them salient to us. It gives them a characteristic scale in which extremes outcomes are extremely rare (i.e. the difference between the tallest and shortest scores is quite small, which is not true for e.g. power-law distributions), and, crucially, a symmetrical unimodal shape with a meaningful average to which the system regresses over time.


The central limit theorem


The mathematical idealization of the normal distribution can be understood by considering a Quincunx, a mechanical device developed by Sir Francis Galton in the 1800s. Balls are dropped on a triangular array of pins on a vertical board, so that the ball bounces either left or right with 50/50 probability as it hits a pin. After the last level, the ball falls into a bin, where they stack up. The stacks that result will roughly form a binomial distribution. This is because all the 2nr of levels possible paths in the system, though of equal length, differ in the number of lefts and rights, and there are more possible paths than there are possible L-R combinations. If we represent L and R by 0 and 1, we may think of combinations as sums. For example, only one path, corresponding to only rights, leads to the rightmost path, but many different paths have half L, half R. The binomial distribution gives us the expected frequencies all these possible combinations (sums). Each pin in a path (i.e. level, the number of which is n) is a trial with a particular probability (p). The distribution is defined by these two parameters – n and p – with a notation of B(n,p). As n increases, the proportions of balls in each bin for B(n,0.5)  define what we call the normal distribution. The mathematics of the binomial distribution is explained below.


The quincunx provides physical evidence for a general statistical fact about systems in which many factors contribute to a quantity additively, known as the “Central Limit Theorem”. It states that, for any random variable, as the number of outcome data collected becomes large, the distribution of their sums will approach a normal distribution. To understand this, consider how, for a discrete random variable such as a die, some sums will be more common than others, as a consequence of there being more ways in which they can occur. Thus, for two fair dice, 7 is a more likely sum than 12. This is, maybe counter-intuitively, regardless of the initial probability density distribution. The die may be so biased that, in 95% of all cases, 6 will come up, and 1% each for the rest, making (6,6,6) an abundantly likely outcome. Nevertheless, as the number n of samples that you sum increases, the other outcomes will begin to assert themselves. For example, if you sum 100 samples, 1,2,3,4 and 5 will all appear once on average, resulting in a sum slightly less than 6*100. Sometimes, your n=100 sum will be more or less than this, but this will be the most common sum, and form the middle value in the normal distribution that results as you collect more and more n=100 sums, for it is the most frequent sum in the space of all possible n=100 sums, just as 7 is most common for n=2 sum in fair dice.


The fact that a phase space of possible outcomes translates, over time, as additive effects accumulate, to a clear, central value is more famous as the “Law of large numbers”. It states that a larger sample is less likely to be affected by random variation, since fluctuations will cancel each other out. Like superimposing many diffuse images, so that the randomness will be averaged away and the signal pierce through. So there are two distinct reasons for why statistics as a discipline is so strongly associated with the normal distribution. Confusing the two can cause a lot of headache:

  • It is common, but by no means universal, in Nature “out in the wild”, because the variables that we find salient and are curious about are naturally those with a characteristic scale and typical value, which generally is the result of additive effects by many, random events.
  • Statisticians may, for a distribution that is not normal, collect samples and sum them, and this distribution will be normal.

The parameters of the normal distribution

What, then, are the parameters of the normal distribution? An idealized, normal distribution is fully determined given two quantities: its midpoint and its width. As already mentioned, normal distributions are special, because their symmetry and short tails means that it has a “central tendency” that can be used as a model to predict its value and summarize a dataset. When the distribution is skewed, it becomes useful to distinguish between different measures of central tendency (mode, median and mean), but for a bell-shaped symmetrical one, these are the same, and coincide with the mid-point. However, because empirically derived distributions are never perfect, the arithmetic mean, which takes the most information into account, is the one used, and therefore the mid-point population parameter is the mean, even though it equals the others.


The mean, effectively, is like the fulcrum of a balanced pair of scales poised at the center of all deviations from itself. It is defined as the point where deviations from it sum to zero. Its width, the average deviation, is called “standard deviation”. The wider the distribution, the poorer the mean will be as a predictor, and the more noise in the data. The intuitive explanation for its formula leaves it slightly under-determined – it is motivated by the mathematical equation of the normal distribution – but is shown below.


Given a population that is known to be normal, statisticians can use the mean and the standard deviation to calculate a density function that associates a particular value-range with a probability (a single point-value has an infinitesimally small probability, so we can only consider intervals). Because regardless of the parameters, the same location of a range relative the mean (expressed in terms of standard deviations – a length unit, remember) has the same probability, it makes sense to re-express a certain value in terms of “distance from the mean” for a standard distribution. It is the equivalent of two tradesmen translating the value of their goods to a shared currency, and is called z-scores.


Now recall that the Central Limit Theorem implies that large and unbiased samples will resemble the population from which it comes. The probability that any sample will deviate greatly from the population is low. Now imagine a weirdly distributed population, of known mean and SD, real or just a theoretical phantom, and, for a certain sample size, consider the space of all possible samples (i.e. subsets) that could be drawn from it. Take the mean of all those samples, erect a histogram, and call this the “sample distribution of the mean”. You may regard it as a population in its own right. Because of the central limit theorem (some averages are more common than others), this distribution will be more bell-shaped than the original distribution. The bigger sample size, the smoother it will be, and the better any theoretically based estimate will be.

For reasons that are intuitively clear, the population of the means of all possible samples drawn from a population distribution will have the same mean as itself. This means that there is a high likelihood that the samples you draw will have statistics that are similar to the parameters. Therefore, if you have your sample and know the population parameters, you may estimate how likely it is that the sample is indeed drawn from this population. If your sample mean has a score far to the right of the mean, the conditional probability P(score far away| score from population) is low.

Now suppose instead that the population distribution is unknown. Usually, your sample is the only information you have. Again, our goal is to construct a sample distribution of the mean to gauge where our sample falls on it. To do this, we need to know its width, that is, how much samples are expected to vary, in other words, its standard deviation of the sample means. We know that:

  • For a large sample size, there will be less variability (since noise will be cancelled out, causing means to cluster tightly).
  • For a large population SD, there will be more variability (since how much it is expected to vary depends on how much it actually varies). As said, the population SD is not available, so we have to base it on sample SD instead.

This calls for a mathematical formula with the sample SD as numerator (so that big SD causes a bigger value) and sample size as denominator (so that big N means a smaller value), and it goes by the name “Standard error of the mean” (which has the added quirk that the square root is taken from the denominator, because of the mathematics of the normal distribution).


To find how well parameter values predict a certain sample we need to calculate the equivalent of the z-score. However, this time, since our evidential support is so weak (we, for example, use sample mean to estimate population mean), modelling the sample distribution of the mean as a normal distribution would give us overconfidence in our estimates. We need a higher burden of proof. For this purpose, statisticians have come up with a distribution that has fatter tails (implying a larger z* critical value), which, cleverly, approaches the normal distribution for large sample sizes.

Thus, we have a way for estimating the conditional P(data|hypothesis) in a way that depends on sample size. It is from this point onwards that the different statistical procedures diverge in their prescriptions.


Posted in Okategoriserade | Leave a comment

OTOOP Part I.X: Surface Tension – The Philosophy behind Statistics


The Post Hoc and the A Priori

If you ever find yourself forced to summarize the ideas of western philosophy in a single metaphor, then the concept of a “self-modifying filter” may be your safest bet. Observation and measurement, according to this idea, are acts of filtering. A filter has a built-in bias – it lets some elements pass through but not others – and the categories embodied in this bias are the source of features and constancies in a universe that ultimately bathes in undifferentiated and structureless flux.

In the philosophical literature, the filter-metaphor hides under the distinction between “a priori”/”deduction” and “a posteriori”/”induction”. The former involve facts that are definitional in nature, like mathematical proofs or how all bachelors are unmarried, while the latter refers to facts derived from experience. For an appreciation of just how central they are, consider the following intellectual heavyweights:

  • René Descartes, of “cogito ergo sum” fame, is known as a “rationalist” in how he emphasized the importance of pre-formed knowledge in the acquisition of new one.
  • George Berkeley, in the early 1700s, advocated a view in which external reality did not exist unless perceived, and this stimulated different versions of “idealism”, according to which we actively create our world through mind-dependent categories.
  • David Hume, writing a little bit later, pondered in his Treatise of Human Nature how sense impressions are separate events in the mind, with causal relationships not directly perceived but projected upon them. His “empiricist” brand is known for rejecting a priori principles altogether.
  • Immanuel Kant’s prescient “Critique of Pure Reason” from 1781 is about the mind’s a priori knowledge of space and time organizes our sensations.
  • Noam Chomsky entered the academic consciousness in the 1960s with his idea that babies are born with knowledge about how to transform linguistic expressions, and that their mother tongue is merely a parameter to fine-adjust this innate grammar.

The 20th century brought us more sophisticated tools to study the learning process in detail, and upon doing this, the deduction-induction dichotomy tends to work best if we consider it as a snapshot in time of what is actually a continuous feedback-loop. From a psychological viewpoint, a priori knowledge could be said to refer to the offline mental manipulations we sometimes perform on our cognitive entities to lay bare facts latent in the way our concepts are stored. For example, if we, via environmental interactions, form the category of “Men” with the core property of being mortal, and then proceed to categorize Socrates as a man, then the mortality associated with men will also pertain to Socrates. If Socrates turns out to be immortal, it becomes a matter of either adjusting the category properties, or brushing it over.

The most natural way to think of a filter is as a passive separator, but there is nothing to prevent us from reversing figure and ground and instead conceptualize the filter as an active inquirer. If so, the bias can be thought of as a hypothesis that the filter asks the impinging dynamic to feed back a response to, indicating what category it belongs to. And importantly, a filter is limitless in what categorizations it could perform. Just like how the mesh-size of a fishing net determines the size of the fish caught, and the polarity of a cell membrane determines what particles may pass, we may divide mankind into genders or ethnicities, or maybe “people who like broccoli” and “people who don’t”, and in an infinite number of other ways.

For an inquirer, the kind of hypotheses he can ask is equally limitless. There is an indefinite number of random variables whose outcome frequencies we may keep track of, of data we could collect, of relationships to potentially explore, but we cannot and should not, for that would undermine its very purpose. It would be equivalent to a filter that lets everything pass through and as a result accomplishes nothing – the filtrate would not possess any more structure than the dynamic in its raw form.

Data therefore are selection effects, restricted by the finite number of hypotheses we select for our inquiries. This is a very, very important point, central to most of science, because believing our hypotheses to be exhaustive of the space of all conceivable hypotheses has historically led to fallacious conclusions, on matters than span everything from the atomic to the cosmic and theological. So, before we address the more mundane matters, like statistical procedures used at our own scale of existence, we might as well start off on a grandiose note.

The ill-defined possibility space

Firstly, selection effects have the potential to resolve much of the disquietude we feel regarding quantum nondeterminism.  If a quantum experiment is repeated, such as firing a photon on a screen with two slits it could pass through – then the relative frequencies of different outcomes (left slit versus right slit) can be predicted, but each individual instance appears to be irreducibly governed by chance, with any “hidden variables” cleverly ruled out. The idea is that the particle passes through both slits in a state of “superposition”, described by a mathematical entity called the “wavefunction”, which states the probability of it being found anywhere in the universe. The wavefunction unfolds deterministically according the Schrödinger equation. Then, upon observation (e.g. by turning on a particle detector behind the slits), the Copenhagen interpretation by Bohr and Heisenberg states that the wavefunction “collapses” into a unique state.

However, in 1957 Hugh Everett pointed out that nothing in the mathematics implies a collapse. Schrödinger’s equation could continue to evolve the wavefunction. The universe does not metaphysically “split” into non-interacting branches – the superposition remains as a single wavefunction – but as soon as it transfers information to something else, like an air molecule, it in effect becomes unobservable. By becoming correlated with the environment, it thus “decoheres”, and because the particle superpositions in neurons decohere faster than they fire, we cannot experience parallelism at macro-scales.

To adapt an analogy by physicist Max Tegmark, we could imagine ourselves being unknowingly cloned into ten copies while asleep, with each clone being placed to wake up in a room with a different number on the wall (ranging from 0 to 9). Upon waking up, then subjectively the number 6 on the wall would seem random, but if we had access to the parallel worlds, then finding that each numeral is represented would make it feel deterministic. Similarly in quantum experiments, we only see one out of all the logically conceivable outcomes contained in the wavefunction. While the “Many worlds” interpretation may not yet be empirically testable, it (bizarrely) has the virtue of parsimony, and it has become increasingly respectable.


Telescoping up to cosmic scales, awareness of selection effects forms one of the key methodological principles of cosmology. The cosmological origin story is that the early Universe was maximally simple, symmetric, and – at least in some places – low in entropy, but that expansion and consequent temperature fall caused these symmetries to break. According to the “inflationary hypothesis”, a special form of matter accelerated the expansion, causing some regions to inflate more than others. Therefore, in a different kind of multiverse theory, there could logically be an infinite number of other universes beyond the visible horizon, and ours just happened to be one that inflated enough to allow sentient, carbon-based life to evolve.

Point is: regardless of how intrinsically improbable it is, we would necessarily find ourselves in a Universe that can support us. In what is known as the “Weak Anthropic Principle”, the observed universe should not be considered as coming from some unconstrained space of possible universes, but from the life-supporting subset thereof. If this principle is neglected, erroneous conclusions will be drawn. For example, Paul Dirac wanted to revise the law of gravitation in light of a coincidence between a constant of Nature and the age of the Universe, but without that coincidence, there would have been no Paul Dirac there that could be preoccupied with such fine tunings!


Selection effects also figure heavily in discussions regarding the eerie nature of mathematics, and as physicist Eugene Wigner said, its “unreasonable effectiveness” in physical predictions. Newtonian physics gives the impression that we live in a universe of perfect spheres and parabolas. There are many examples of mathematical curiosities that have been collecting dust for decades and suddenly find themselves elegantly applied to some newly discovered phenomenon, and high-energy physics portrays the fundamental laws to be astoundingly tidy and integer-laden.

However, the Universe is more than deep symmetries – it would not be fully specified without its initial conditions. Something in their interaction appears to have generated a cosmos of perplexingly rich structure, dynamical systems and nested hierarchies, and unlike the deep symmetries themselves, their outcome is far from mathematically elegant. With the advent of computers and big data, there is a growing appreciation for just how disorderly the universe is. In biology and sociology, attempts at mathematical formalisms are taken as cartoonish simplifications.

Philosopher Reuben Hersh argues that vision of a clockwork-universe is an illusion arising from how we disproportionately focus on phenomena that are amenable to mathematical modelling. The concept of cardinal numbers (“put in 5 balls, then 3 into a container. The prediction is that it will contain 8 balls”) would break down for water drops or gases. The prediction would hold only if humans went on to invent the concept of mass and volume. The tools were selected based on their predictive abilities. As Mark Twain said: “To a man with a hammer, everything looks like a nail.

The problem is that there is no way for us to quantify all the “phenomena” to calculate what proportion of these are “orderly”. Nor can we conceive of a possibility space of different, logically consistent laws of physics that were not simple integers, to see just probable our universe is. Instead, we are restricted to the illusion selected for us, by quantum decoherence, post-Big Bang inflation, and our own perceptual affinity for mathematical elegance – forever to wonder what’s on the other side of the filter.

 Applying Bayes

We are already familiar with Bayes, and the theorem that, simply put, expresses the probability that a hypothesis is true as fitness with evidence weighted by its prior probability.


Importantly, when we lack a clear idea of possible explanations (of the other side of the filter) and try to apply Bayes’ theorem to our reasoning, we are faced with the fact that there is no such thing as an unbiased prior. There is a more sophisticated mathematical reason for this, but we can see it more intuitively in theological arguments for the existence of God.

I was once told by a creationist: “If you were to shake a handful of sand and throw it all up in the air, and if it is all mindlessly random – as you say – then obviously it would be improbable to the point of impossible for it to fall by pure happenstance into something as well-organized as an organism?” The reasoning behind this watchmaker-argument goes that “Given that there is no god, the structure we see would be improbable. Therefore, because the world is so stunningly assembled, we must conclude that there is a conscious God”. Here, the observation that we do exist is our data. Because we have no priors, we will have to invoke equiprobable priors, such that “God exists” and “God does not exist” having 50% each. As for likelihoods, a creationist would say that P(humans|no God) has a vanishingly low probability, like a one-trillionth and P(humans|God) is significantly higher, maybe a millionth.

The argument may be problematic because when we assigned 50% each to God/no God, we partitioned the vast space of possible hypotheses in an inevitably biased way.  We could fragment “no god” into an indefinite number of alternative hypotheses, such as “Hindu gods”, or that we are a quantum computer simulation, which would be compatible with an orderly universe too. To this type of fundamental questions, in which we have no good conception of the possibility space, probability theory cannot be meaningfully applied.


In philosophy of science, the fact that the hypothesis space is never exhaustively specified is known as “under-determination of theory by data”. This refers to the idea that we can never know whether another theory would account for the evidence equally well. Hence, science cannot be said to constitute truth. This is consistent with Bayesian reasoning, where only hypotheses deemed worthy a priori are investigated. Yes, a demon tampering with your mind could explain your current sensory experience, but this prior probability is so low it becomes negligible.

Interestingly, a similar argument has been advanced by philosopher Hillary Putnam in defense of science, called the “No Miracles” argument. Science’s successes are improbably due to luck, and therefore science must deal in truth. The argument is a conditional probability, where P(science’s success| science is unrelated to truth) is considered low.  Bayes’ theorem tells us that this is meaningless unless we consider the base rate, which is the relative frequency of false theories, but again, this incidence cannot be quantified. However, if an evolutionary view of science is taken, the “No Miracles” argument amounts to “Survival of the fittest”. Science is successful, precisely because hypotheses that do not perform well are eliminated.

Confirmation bias in terms of Bayes

If we let the unconscious cognitive processes constitute a filter, we find not only the filtration of how the brain collects data, but also biases in what hypotheses it focuses evidence acquisition on. That is: even when we are aware of a whole set of hypotheses, we have a tendency of gathering information about only one of them. In the literature, this is known as “confirmation bias”. It has been explained in various different ways, like how we feel good about being right and value consistency, but is perhaps most elegantly accounted for with reference to Bayes.

According to German psychologist Gerd Gigerenzer, scientific theories tend to be metaphors of the tools used to discover and justify them (the “tools-to-theories heuristic”). It changes the Fragestellung. For example, the cognitive revolution was fueled by the advent of computers and statistical techniques, which soon became theories of mind. Thus, the brain is often thought of as a homunculus statistician that collects data and plugs them iteratively into Bayes’ theorem. And as already stated, the brain cannot collect all kinds of data – a completely Bayesian mind is impossible, for the number of computations required for optimization (i.e. finding the best hypothesis) would increase exponentially with the number of variables. The brain must therefore allocate computational resources dynamically, based on how promising a hypothesis seems. By limiting the sample size (the working memory and attentional window), we become more likely to detect correlations, because contingencies are more likely to be exaggerated. Presumably, these benefits outweigh the dangers of confirmation bias.

In Bayesian terms, confirmation bias would mean that we fail to take the likelihood ratio P(D|H1)/P(D|H2) into account, particularly when H1 and H2 are each other’s opposites. The fact that our evidence acquisition is partial means that, for the hypotheses we favor a priori, we over-weigh the prior (our preconceptions), while for the hypothesis we initially disbelieve, we over-weigh the likelihood, being either too conservative or too keen in our belief revisions. Because this tendency persists over time, the hypothesis is positively reinforced, leading to what psychologist Wolfgang Köhler called “fixedness” and “mental set”, the inability to think “outside the box”, which has been implicated in many clinical conditions, from depression to paranoia. Iterated Bayesian inference is like a self-modifying filter in which our belief in a hypothesis is continually revised in light of incoming data. The posterior probabilities correspond to the widths of filter meshes – the more coarse-meshed the filter, the more receptive we are to the hypothesis, and the bigger its impact on future interpretations.

Our biased evidence-collection is most evident in the type of experiment pioneered by Peter Wason in 1960. Subjects are given a triplet of numbers, such as “2,4,8”, hypothesize the generating rule, and ask the experimenter about the correctness of other triplets in order to infer it. What subjects tend to do is to state the hypothesis “2 + 2x” and then only look for confirmatory evidence by asking about, for example, “4,8,10”. As a consequence, they will never infer a generating rule for which these triplets are a subset, for example “Any even numbers” or “Numbers in increasing order”. By analogy, if you want to know if a person is still alive, you immediately go for his pulse, rather than, say, if the eyelids are down, because the pulse is a better differentiator. We want evidence that surprises us, which in information-theoretical terms has a high self-information content. It would have been more rational, more efficient, if instead the subject asked about triplets that would distinguish between the working hypotheses and more general candidates, like “4,5,7”, in order to, as Richard Feynman put it, “prove yourself wrong as quickly as possible”. Instead, we are like the drunkard looking for our key under the streetlight, because that is where we can see.


Maximizing falsifiability

This insight, that science will progress faster if it focuses on differentiating evidence, is extremely profound, and the most important concept in the philosophy of science. It is known to many as “Falsificationism” and associated with Sir Karl Popper, who argued that good scientific practice is defined by presenting theories alongside criteria by which they can be rejected, so that all effort can be focused on trying to refute them by testing the potentially falsifying predictions derived from them. A strong theory is a theory that points towards its own Achilles heel, and because (according to Popper) for Freud and Marx there are no conceivable “ugly facts” that could show them to be incorrect, they are closed to criticism and worthless as predictors.


Science essentially serves to gauge the predictive power of a theory. According to “instrumentalism”, this is all there is to it – theories are nothing but means of making predictions. A relatively good theory therefore is one that:

  • Is precise in its predictions, so as to forbid relatively more outcomes, thus providing more scenarios by which it can be rejected. For example, if you have two dots on a Cartesian plane, the hypothesis “The relationship is linear” is easier to falsify (just test a new dot not aligned with the two others!) than “The relationship is quadratic”, because the third dot is consistent with more quadratic data patterns. A vaguer theory is less strengthened (“corroborated”) by confirmatory data than a more specific one, because its Bayesian likelihood P(data|hypothesis) is lower.
  • Makes predictions across a wide range of domains. For example, psychological theories are much less precise than physical ones, and the hypotheses derived from are typically a modest statement like “What condition a participant is under will make a difference”, without specifying the size or direction of this difference. However, psychological theories can often give rise to a wider variety of predictions, like how the theory about confirmation bias may partly predict both depression and paranoia.

The two criteria may appear to be contradictory, but they are really just about grain and extent, the same old filter-dimensions we have spoken of before. A good theory is fine-grained and wide-eyed.


Moreover, in order to truly gauge its value how well it predicts, as opposed to how accurately it fits with evidence, it matters whether the theory is stated given the evidence (“post hoc”) or in ignorance of the evidence (“a priori” – typically, but not necessarily, before the evidence is gathered). A parallel distinction is often drawn between “data fitting”, such as finding a mathematical function that approximates a given dataset, and “ex ante prediction”, where the function is stated in advance. The reason for why this is important is because with post hoc explanations it is impossible to know how much of the data is influenced by random factors – noise – and if the explanation takes these into account, it will fare worse as a predictor, since in future circumstances, the noise will be different. If a prediction fails, the theory must be modified, according to Popper by suggesting new tests, so as to increase its falsifiability.

In evolution, if the ecological structures changes for a species, an adaptation may no longer be beneficial, no matter how useful it has been in the past. There is a similar concept called the “problem of induction”, associated with David Hume, that states that any empirically based generalizations (such as “all swans are white”) are vulnerable to refutation simply by counter-example (finding a black swan). With such a disquieting asymmetry, a theory would be hard to build but easy to dismantle. Popper too argued that how well a theory has generalized in the past does not matter for its correctness, and if it ever fails, it is no longer serviceable.

However, falsifying evidence should not be accepted without caution. Philosopher Pierre Duham has pointed out that a theory cannot be completely rejected, since the apparatus or the background assumptions could have caused the result. When in 2011 physicists at CERN detected neutrinos travelling at a speed faster than light, they did not immediately declare Einstein’s special relativity theory false, but instead –based on the theory’s previous track record – tenaciously sought after sources of experimental error (which they eventually found). The interpretation of any hypothesis presupposes facts that themselves are conjectural in nature, causing a web-like regress of uncertainty, but, Bayesian reasoning goes, pre-existing knowledge does and should matter in how we update our beliefs.


We may thus think of science has a hierarchy of priors. If we are highly confident in a theory a priori, we won’t let an anomalous result refute it before the anomalies have accumulated to a degree so high they can no longer be ignored. Given the positive feedback of confirmation bias this tends to engender, and the volumes of discrepant results required to outweigh it, the occasional theoretical discontinuities seen in science history – dubbed “paradigm shifts” by Thomas Kuhn – seem intrinsic to the dynamic of an intelligent system. But in the short term, it raises vexing questions about the rational course of action: for how long should we persevere in a once-promising-but-now-challenged belief? When does the in-hindsight-so-romanticized persevering maverick become a fanatic madman?

Confirmatory and exploratory

We saw with the brain-as-a-Bayesian-statisticians that the number of hypotheses we can entertain is limited, and we saw with Falsificationism that the best hypotheses are those that differentiate between candidate theories. According to science writer Howard Bloom, for intelligent behavior to emerge, a system does not only need “conformity enforcers” (to coordinate the system), “inner judges” (to test hypotheses), “resource shifters” (to reward successful hypotheses) and “intergroup tournaments” (to ensure that adaptations benefit the entire system), but also “diversity generators” – we must make sure that new hypotheses continually are generated.

In the brain as in science (scientists have brains), this can be thought of as a random, combinatorial play. Activity spreads stochastically with varying degrees of constraint through neural networks to find ideas to associate, and scientists are similarly, via their social environment, exposed to random ideas they encode in their own neural networks. Institutionally, to safe-guard against theoretical blindness and confirmation bias, it is therefore encouraged within science to maintain a free market of ideas and always conceive of alternative explanations in an article’s end discussion.

Different research methods vary along a continuum in how constrained the observational filter is. In “qualitative” research, such as interviews, the prior probabilities are weak, and hypotheses emerge over time as promising leads are picked up and shadowy hunches in the minds of the researchers are gradually reinforced. This exploratory, data-driven, “bottom-up” kind of research is necessary in the absence of robust theories. But when we do have a high-prior hypotheses available, we may test these using quantitative methods, such as experiments, which by comparison are deductive, confirmatory and “top-down”. The process can be described as a a “parallel terraced scan”.

parallel terraced scan

If we need to scrap the hypothesis following expensive, focused experimentation, we may have to revert to square one, to the cheap and unfocused information processing of qualitative research. Just like how a brain’s attention can be concentrated or vigilant, science needs both. The important thing, following exploration, is not to double-dip in the same data, since the hypothesis would be selected by virtue of its fitness with that dataset, so to gauge its predictive power, the dataset would have to be fresh.

Different statistical frameworks

Intuitive as Bayesian conceptualizations seem, a fascinating twist to this story is that the Bayesian nature of science is implicit. Scientists do not literally quantify their priors and plug the relevant numbers into Bayes’ equation – you are extremely unlikely to find this in a research article. The influence of priors is manifest only in the introductory and concluding discussion sections of research articles, in which researchers rationalize why they have performed this particular query (because it had a high prior) and interprets the query result through the lens of the pre-existing belief system.

Instead, science in general and experimental psychology in particular, is dominated by a different framework, called “null-hypothesis significance testing” or “the Neyman-Pearson approach”. To undergraduate students it is often presented as the only approach – as a single, objective structure – when actually it is heavily criticized within statistics, and only one approach among many. But Bayes and Neyman-Pearson are not necessarily in conflict – they are internally coherent, but conceptualize probability differently to serve different purposes. Nor should we hasten to conclude that both types of computations cannot occur alongside each other in the brain. Neyman-Pearson too has been used for theories of the mind, like Harold Kelley’s causal attribution theory and signal detection theory. And just like in science, the brain is capable of one-trial learning without sampling, so we should be wary of presenting either as some grand general-purpose mechanism.

There is meanwhile a third approach that is in minority but gaining popularity, called “likelihood analysis”. We will now consider the nuts and bolts of these procedures, but to do so, we will first need a good understanding of probability distributions.

Posted in Okategoriserade | Leave a comment

OTOOP Part I.IX: Science as Information-Processing


Information theory introduced the image of system boundaries as implicitly projecting “input categories” upon reality, parsing it into “variables” with an associated space of possible outcomes. Metaphorically, a system queries Nature, asks it a yes/no-question, the answer of which constitutes a bit. Nowhere is this more evident than in children, who with unceasing curiosity experiment with their surroundings and never accept an answer without following it up with a progressively more patience-testing “But why?”. Eventually, however, our internal schema of reality becomes sufficiently rich for us to live a functional life without craving answers to everything and exploring any corner in sight. Curiosity-driven exploits become a waste of resources. We lose the childhood habit of isolating some aspect of reality by mentally situating it in a space of other conceivable scenarios, of counter-factual could-have-beens, of whys, and we replace it with “whatever”.

This is where science comes in. Science may be seen as an outgrowth of our instinctive curiosity, but it is by no means spontaneous. It took Homo sapiens millions of years of ad hoc religious explanations before Galileo Galilei, in the 16th and 17th century, pioneered the use of experiments to discover causes and make better predictions, and the scientific method is still undergoing refinement. But science is more than methodology – through the establishment of research institutions, humans are given resources and economic incentive to sustain their inquisitiveness and keep asking “Why?”.

But posing questions is no trivial task. It takes a brain of Newton’s caliber to watch an apple fall to the ground, wonder why it doesn’t fall upwards or leftwards, and from empirical data construct a mathematical model that makes its downward fall necessary and predictable. Historians of science use the term “Fragestellung” to refer to the fact that our adult capacity to mentally place an observation in a space of alternative scenarios, thereby creating uncertainty which we become motivated to resolve, depends critically on the intellectual climate and ideas to which the scientist is exposed. For example, the classical Greeks were eminent at careful, quantitative observation, yet they never arrived at Linnaeus’s taxonomy scheme, or conducted Mendel’s simple pea experiments which allowed him to infer the existence of genes.

And this is where information theory comes in. Not only has its ideas generated extremely fruitful lines of inquiry, it also brings a fresh perspective on science itself, as well as the strange loopiness that characterizes the ontological endeavor to access raw reality while simultaneously being embedded in it. The information-theoretical account runs somewhere along the following statements, about how new systems of processing information arise organically out of older systems:

  • DNA encodes information about the evolutionary past, in the sense that systematic differences pierce through the noisy cloud of random fluctuations to, via natural selection, have statistical effects on the gene pool, making life more viable.
  • Natural selection hones the nervous system so that systematic features of the environment can be encoded in the organism during his own lifetime (this is called “learning”). Perception is here regarded as the reception of a message.
  • An organism’s learning may involve writing system, allowing him to encode systematic, stable cognitions on physical media like paper and computer documents. This acts like a collective conscious memory, which on Earth is unique to humans. To conduct an experiment is to query Nature, and the acceptance or rejection of an experimental hypothesis constitutes a bit.
  • This way, low-entropy regularities in the environment will be translated into low-entropy regularities in neural activity and eventually low-entropy regularities in external representations like text and diagrams (which are then fed back, in a loop, as humans read and integrate symbolic information, expanding our collective knowledge).

Information theory also taught us how a message always can be encoded in a way that makes distortions negligible by building redundancy into the transmission code (for example, we may append extra letters and instructions on how to recover original message from redundancies). This raises the interesting question to what extent natural selection has polished neural encoding to optimize its fidelity, and maximize its mutual information with Nature. The discovery that humans are prone to cognitive biases and make poor intuitive statisticians, due to peculiarities in System 1 and System 2, suggests that natural selection has not put high priority on making humans sophisticated predictors. So in what way exactly does the scientific method supplement our flawed intuitions? What’s the secret behind its incredible success? The keys to science’s success are in its active probing, falsifiability, and the principle of Occam’s razor. Let us explain these in terms of information.

Scientific experiments are active interventions in causal chains. A scientist does not passively await data to investigate hypotheses – he brings himself to the conditions in which the data are the most precise and the least noise-ridden. In Bayesian predictive coding, this implies that prediction error signal will be minimized, since it is weighted by precision. Because precise data have greater influence in the updating of hypotheses, agency reduces uncertainty much more efficiently than passive perceptual inference. Note how this makes science an extension of action in general, from the eyes’ micro-saccades, to hand-reaching and everyday explorations. Science is Bayesian, in a way that our brains’ System 1 and 2 aren’t. According to Richard Feynman, it is “a way of trying not to fool yourself”, of making optical guesswork and protect our pooled knowledge from human fallibilities like confirmation bias and bandwagon-effects.

The notion of “falsifiability” means that a theory must provide test criteria by which it can be rejected in order to be meaningful. Science, according to this logic, cannot tell you what reality is like, but it can tell you what it is not like. It describes the world in terms of negatives. Recall that the more surprising an event is – that is, the bigger our misprediction, and the bigger our confidence in our theory – the more informative it is, because it makes a greater difference and causes more uncertainty. For a well-established theory to pass a test and be experimentally confirmed is a low-information event, but prediction failures carry high information and may cause great paradigmatic upheaval. By regarding science as a system that randomly generates conjectures and then eliminates them by refutation, it becomes analogous to a species undergoing evolution, and can be seen as a form of information processing that results in the editing of our universe-compression program, just like how the environment is compressed in the DNA.

“Occam’s razor” is the philosophical principle of parsimony according to which, all else equal, a scientist faced with competing explanations for a phenomenon should select the one with the fewest ontological assumptions about what entities exist. The reason for why a short explanation is intrinsically more plausible can be understood with reference to the concept of “algorithmic probability”. Much like how monkeys typing randomly in a programming environment are more likely to write a functional computer program if it is short, a stable dynamical organization is more likely to emerge randomly if it is simple. Fractals abound in Nature precisely because they are so easy to generate, just a simple mechanism generated many times.

Moreover, short explanations are also more useful. Statisticians must be careful not to use too many variables in their models – something known as “overfitting” – since this implies a bigger risk for incorporation of random factors, making it a worse predictor. The idea that reality is organized into nested systems and stable regularities, which impart it predictability, is the same as saying that reality contains information redundancies and is algorithmically compressible.  A scientific model can be viewed as information compression, in how a physical equation subsumes a multitude of disparate observations. The depth of a theory – the more phenomena we can derive from it, by either prediction or postdiction – the more powerful it is, so when choosing between competing theories, we naturally favor the one that compresses the most, for it is the lightest to carry.

Information theory also has interesting things to say about randomness. Because any theory, like a program, is finite, it can produce only a finite set of results. The outcomes that are unaccounted for will in effect be random and incompressible. The unpredicted outcomes will also be our only source of new information. Seen in this light, our understanding is complete only when the universe can no longer surprise us. However, we can merely be confident in a putative “Theory of Everything” – we cannot know that it is correct unless we know that it is irrefutable. And because a theory must be falsifiable in order to be meaningful, this means that we can never know whether a black swan, a high-information scientific discovery, awaits us. Because there will always be potential experiments to perform, randomness is in this way inherent in the universe.

Finally, information theory places limits on what science can in principle accomplish. There is a well-known theorem found by Alan Turing, known as the “Halting problem”, which states that a computer program cannot compute whether or not a program will terminate. Its own future behavior is intrinsically unpredictable – it is said to be “uncomputable”. Therefore, a debugger that checks whether a program will crash before we run it can impossibly exist. More generally, if a logical system supports self-reference, then it automatically results in paradoxes. In mathematics, this is known as Gödel’s incompleteness theorem. Laplace’s demon can impossibly exist, because to simulate the future it would need as much processing power as the Universe itself. Seen in a different way, we cannot predict our own behavior, because by engaging in prediction we interfere with how we normally would behave. We can impossibly know where our train of thought will lead us, and we can impossibly predict the precise course of the universe, because we are integral to its evolution.

Posted in Okategoriserade | Leave a comment