I just made a post discussing the ludic fallacy, which can also serve to make a non-technical introduction to Bayesian inference.
Taleb explains the fallacy as “basing studies of chance on the narrow world of games and dice”.
In other words, it’s our tendency to underestimate uncertainty by assuming that the real world is a game where the rules are fixed (and known).
The post starts with a basic but trick question:
I throw a die and I tell you I got a 6, what’s the probability that I get another 6?
Apart from looking forward to reading any feedback (positive or negative) you have about the post, I was also curious about your experience with the ludic fallacy. More specifically, in what contexts have you come across the fallacy (either personal and professional) and what were the consequences of that misrepresentation of uncertainty. Taleb’s favourite example is, of course, the 2008 financial crisis.
The technical terms you want are “known unknowns” and “unknown unknowns”. If it’s really dice, then it’s mostly “known unknowns”, like number of sides and balance. The real problem is the unknown unknowns. In statistics, there are related notions of “epistemic uncertainty” (missing knowledge, which is what Taleb’s getting at) vs. “aleatory uncertainty” (irreducible randomness, which is what the dice give you).
I’d keep the framing of “prior knowledge” and get rid of the “belief”. It tends to confuse people and make them think Bayesian statistics is somehow more subjective than frequentist statistics. Instead, both rely on assumptions.
The posterior = prior * likelihood isn’t an equality. The equality should be proportional to, as it’s missing the normalization in the denominator (which is called the “evidence” or “data marginal”). It’s also not clear how one multiplies by an “evidence based update” since that sounds like an algorithm, not something one would add. Maybe your evidence-based update also renormalizes? If you really want to phrase it this way, then the update rule takes the prior and the data and returns the posterior.
The probably of three equal numbers in a row is six times higher than the probably of any specific number three times in a row. Would you have been equally surprised by three ones? If so, the rate of surprise from three in a row is (1/6)^2, not (1/6)^3.
If you show me a regular die, like one taken out of a new board game, I would think about it the same way as you explain. That means my prior knowledge of dice is going to concentrate very strongly around “roughly uniform”. An observation of three 6s in three tosses isn’t going to budge my prior in the real world.
I’m not sure what you’re trying to do with the visualizations. Usually you see things plotted like running averages from coin flips. Before you start plotting this, you need to say where you start. We have a lot more knowledge of dice and their behavior than is reflected in a uniform prior over simplexes (i.e., Dirichlet(1)).
We can only assign probabilities to a countable set like integers, so we couldn’t technically assign probabilities to all numbers. We can assign non-zero probabilities to every element in countable sets of outcomes.
I didn’t go beyond the Wikipedia, so I’m not sure I understand Taleb’s point. Specifically, I’m not sure what he means by games of chance or what would count as a non-ludic model of something. All of our mathematical models are crafted out of simpler components that ground out in simple random number generators called distributions.
Thanks for your input Bob! Especially for taking your time to read through and posting such detailed feedback.
The post is built on an underlying assumption (that I think is true for the uninformed reader, which is the main target of the text): most people would answer the starting question saying the probability is 1/6, and would only hesitate perhaps because the question seems too easy. In that sense, the possibility that the die is not fair or 6-faced constitutes an unknown unknown.
I understand that for you it was a known unknown, because your mind is more attuned to looking for uncertainty, but my claim is that this is not the general way an untrained mind works. Any unknown unknown can potentially be turned into a known unknown, which is the first step of knowledge discovery.
I’m not too fond of that distinction, or at least of the terminology (precisely I wanted to write another post about that!). To me the natural opposite of “epistemic” (related to things as we know them) is “ontologic” (related to how things in themselves are).
“Aleatoric” comes from the latin “alea” for dice (or that is what I read), and I don’t see why the randomness of a dice should be intrinsic or irreducible; in fact, throwing a die seems like quite a deterministic process to me, it’s just that we are not able to fully characterised all the mechanics of the process and run the appropriate calculations, which means that the uncertainty is actually “epistemic” (i.e., it’s a function of the subject, and not of the object itself).
To me, as far as we know all the randomness in the universe could be epistemic. We like to call irreducible the uncertainty we have no clue how to reduce, but since we can never know things in themselves, we will never be able to tell whether they are intrinsically undetermined or we lack the necessary knowledge (perhaps with quantum physics this is different, but since the argument is more philosophycal than physics-based I tend to think it still holds). Unless we are handed the plans of creation by an omniscient god, that is.
Yes, the evidence based update is likelihood divided by the data marginal, I feel like that rearrangement of the equations ir more interpretable: leaving the prior alone and grouping the data-related terms into an updating factor. I’m open to alternative names, I wanted it to be as descriptive as possible but at a very high level, I could’ve used a plus symbol but it felt like that could create confusion if people dig deeper or had seen the formula elsewhere.
I agree, and I’m the kind of person who would’ve easily overlooked that, but reviewing the post I see I was discussing the probability of getting four 6s in a row so I’m a bit confused by this clarification (I found it a useful reminder anyway).
I’m not sure I understood this, the first plots show the priors, which are the starting point. The following plots show posteriors under different scenarios. It’s just an example, but to me for the purposes of demonstration, a reasonable prior is to put high probability on it being a typical die (i.e., uniform 1-6) and then some probability on any other natural number (you could say why not integers or rationals, or even any possible symbol, I just wanted to avoid that complexity but perhaps it’s worth a footnote). I’m open to considering alternatives.
I’m not sure I got this either. The naturals are countable so we can assign non-zero probabilities to each of them or can’t we?
Yeah, for all we know reality itself could be crafted like that too. It’s reasonable that we tend to oversimplify, because it’s convenient and sometimes the only possible way to get going. I think Taleb’s issue is with our tendency to believe in our own simplifications, and take our game-like simulations of reality for reality itself. This is a failure to account for unknown unknowns but also for many known unknowns that are uncomfortable to bear in mind and our mind likes to forget.
That’s a good point—this is all relative to someone’s perspective.
That was also Laplace’s take–it’s where Laplace’s demon comes from. Laplace wrote,
We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past could be present before its eyes.
I believe in statistics, the notion of aleatory uncertainty is usually understood conditionally. If I build a linear regression, p(y | x, \beta, \sigma) = \textrm{normal}(y \mid x \cdot \beta, \sigma), then there may be only so much we can learn about y from observing x and \sigma represents the scale of the residual uncertainty. But you’re right that you can consider this further epistemic uncertainty in that if you learn more about the world and condition on more information or build a better model, you a reduce the residual uncertainty.
You can put it on the log scale. I think the main issue with your framing is that it’s not going to line up with what people have seen elsewhere. I could figure out what you must have meant for the equation to be true, but it doesn’t line up with any textbook descriptions I’ve seen.
Now I see what’s going on. I was reading it as a time series of updates like one usually shows when discussing updating. I probably would have figured that out if I had read the labels more carefully. I’d use something other than “number” on the x-axis, like “outcome”. I think I was confused because we were talking about 6-sided dice and then the plot went up to 100. I know you talked about that possibility, too, but I missed that in the plots.
Yes, we can. We can assign any countable set a distribution that has positive probability on each outcome. For the naturals, you can use p(n) = 2^{-(n + 1)} and any other countable set can be bijectively mapped to the naturals. But you can’t assign positive probabilities to all real numbers and those are numbers, too. Sorry—this was just a pedantic point about using terms like “number” more technically.
Yes I in that context I can see the value of making that distinction, I guess my issue is just with using terms that seem to imply unnecessary metaphysical claims, I think I’d rather call it exogenous and endogenous uncertainty or something like that.
Interesting, for me it felt natural to rearrange P(\theta | \mathcal{D}) = \frac{P(\mathcal{D} | \theta) P(\theta)}{P(\mathcal{D})} into P(\theta | \mathcal{D}) = P(\theta) \frac{P(\mathcal{D} | \theta) }{P(\mathcal{D})}. I understand that from the statistics (or more broadly mathematical) lens the likelihood and the data marginal are two rather distinct elements, but from the abstract view of knowledge updates, I think it makes sense to group them together.
After all, we don’t update upwards the probability of a model/hypothesis because it explained well the data (as in it had high likelihood), but because it explained the data better than the other alternatives (as encapsulated in the data marginal).
I came to this arrangement while trying to make a probability puzzle’s solution more intuitive for the untrained reader.
Thanks for the suggestion on the plots labelling, certainly “number” is quite an equivocal term.
No worries! I may add a note to clarify it’s the naturals anyway, and now that you mention it’s an interesting edge case considering the possibility that the die could contain any symbol.