I like mathematical puzzles, but I don't like mathematical puzzles. Especially paradoxes. The problem I have is that I tend intuitively to think of the wrong answer - just like pretty much everyone else. This reminds me to be cautious whenever I look at my own or other peoples' data. As a result, I can't help wondering about every assumption, and whether I'm missing something. This doubt turns me into a pedant and a nitpicker, and makes me occasionally patronising when I ask if people have thought their problem through properly (that's my excuse and I'm sticking to it…).
The other week, a statistical puzzle on The Skeptics' Guide To The Universe (SGU) chimed with an email conversation with a colleague, where we talked about data and studies that seemed to fall into a similar paradoxical trap. At the core of this paradox is how you specify classification and sampling of data up-front: how you design your experiment to answer a specific question, or not. The or not part was the issue in our conversation. This is a genuine problem for 'data-mining' and other large data studies.
[EDIT: The discussion below is only correct if you assume throughout that if the parent has at least one boy, they must tell you this. See comments below, and later post for more discussion]
|The SGU puzzle
To see why the answer is 13/27, let's start off with a simpler puzzle.
A simpler puzzle"I have two children. One of them is a boy. What is the probability I have two boys?" [EDIT: As noted above, we assume that the parent must mention that they have a boy, if they have one]
These kinds of teaser problems are typically stated without enough clarity concerning their assumptions. Do you live, perhaps, in a repressive state where girls are 'disfavoured' and the birth rate appears to be biased? The unstated assumptions then are likely to be along the lines of:
- There are only two sexes, represented by boys and girls
- The distribution of boys and girls in the population is 50:50
- Your pair of children are a fair random sampling from the larger population, with no untoward influences (e.g. cloning)
See what I mean about pedantry?
Anyway, the 'obvious' but incorrect answer is that the probability is 1/2, since the probability of any randomly-chosen child - including your second child - being a boy is 1/2. Also, if you already know about one of the children, it's natural to think that the question of the sex of the other child is independent of that. That seems reasonable, but it doesn't take long to see why it's wrong.
Let's say you have two girls. The probability of any (randomly chosen) child from the population being a girl is 1/2, so the probability of you having two girls is 1/2 x 1/2 = 1/4. A similar argument holds if you have two boys. And, at first glance, it seems that the same is true if you have one boy and one girl: 1/2 x 1/2 = 1/4. But something odd happens when you add up the probabilities:
1/4 + 1/4 + 1/4 = 3/4 < 1
And that 3/4 < 1 bit is key. It shows us that we've done something wrong, as the probabilities for all possible outcomes must add up to one (that's just the way the universe is - we can't do anything about that…). There are only three possible outcomes, here (LGBT issues aside…): boy + boy, boy + girl and girl + girl, so their probabilities must sum to unity.
What's gone wrong is that we haven't taken into account that, for any pair of items, there may be more than one equivalent ordering: (boy, girl) and (girl, boy) are, in fact, different outcomes for boy + girl.
Now, that isn't necessarily an intuitive thing to say, especially when we're dealing with individual humans - which is why I think the puzzle works so well: we intuitively think of two humans as not interchangeable even where, statistically-speaking, the labels we assign to them may be. Here, for example, you may easily buy the argument that (boy, girl) and (girl, boy) are different ways of ordering any pair that is a boy and a girl. But - you may argue - why are (boy1, boy2) and (boy2, boy1) - who are, after all, individuals and not absolutely equivalent - not equally valid alternative ways of ordering two boys? The answer is that the labels 'boy' and 'boy' are equivalent and, since there's no way of distinguishing, say, (girl, girl) from (girl, girl), the case where we have two children of the same sex counts as one event, while the case with one child of each sex counts as two events. This is an example of combinatorics (specifically permutations), a very interesting, and occasionally counter-intuitive branch of mathematics.
So, now that we see that (boy, girl) is not the same as (girl, boy), and that both have probability of 1/2 x 1/2 = 1/4, but they are both cases where you have one boy and one girl, it is clear that the probability of having one boy and one girl is 1/4 + 1/4 = 1/2. Now we have:
1/4 + 2 x 1/4 + 1/4 = 1
and all is well with the world. Or better than it was before, anyway.
So, back to the question: given that we know you have one boy, what's the probability that you actually have two boys? Is it 1/4, because that's the total probability of having two boys? Well, no… because we already know that you have at least one boy, it's not possible that you have two girls. This restricts the number of possible outcomes to fewer cases, so we expect the probability of having two boys to be higher than 1/4.
Since we know you don't have two girls, the remaining options are that you either have two boys, or a boy and a girl. The total probability that you have two boys is 1/4, and the total probability that you have a boy and a girl is 1/2; the probability that you have any boys is then 1/4 + 1/2 = 3/4. Intuitively then, the contribution of 'having two boys' to this total probability of 'having any boys' is 1/4 ÷ 3/4 = 1/3. More formally, we consider this as a conditional probability. Here, we think of it as:
P(you have two boys | you have at least one boy)
which is read as: "the probability that you have two boys, given that you have at least one boy". In fact, by the standard definition of conditional probability (and filling in the appropriate values, noting that if you have two boys, you certainly have one boy, and so the probability of having two boys and at least one boy is just the probability of having two boys):
|Conditional probability for the simple puzzle
So far, so good.
The less simple puzzle
When we include the days of the week in the question, as in the SGU puzzle, we are changing our labels. No longer are the girls (or boys) all equivalent to each other: now we have seven different classes of each sex - one for each day of the week, corresponding to the day on which they were born. The combinatorics could get a bit hairy at this point, but there is an intuitive way through the thicket.
We're assuming that, in addition to the stipulations above, the distribution of days of birth is uniform: any child is equally likely to have been born on any day of the week (with probability of 1/7). So, the probability of a randomly-chosen child being a boy born on a Tuesday is 1/2 x 1/7 = 1/14. We have, effectively, 14 uniformly-distributed classes. But the complicating factor in the SGU question is that the classes we're talking about are hierarchically nested, so there are two parent classes for the 14 (boy and girl).
It's not as neat to list out all the possible combinations this time, as there are 14 x 14 = 196 of them, but we can use conditional probability to take us through the logic. Using the shorthand B for boy and G for girl, and Mo, Tu, We, Th, Fr, Sa, Su for day of the week, we have B-Tu to represent our birthday boy, and B or G to represent all boys and all girls, respectively. Now,
|Conditional probability for the less simple, SGU problem
We know that the probability of having two boys (in total) is 1/4, but we need to calculate the probability of there being at least one B-Tu in our pair of children: P(at least one B-Tu). This can happen in a number of ways: (G, B-Tu), (B-Tu, G), (B-Tu, B-Mo/We/Th/Fr/Sa/Su), (B-Mo/We/Th/Fr/Sa/Su, B-Tu) and (B-Tu, B-Tu). That is, for any B-Tu the other child can be:
- a girl (in two different orderings)
- a boy not born on a Tuesday (in two different orderings)
- a boy born on a Tuesday (in one ordering)
That this last possibility only has one possible way of being ordered is equivalent to there being only one way to have two girls, above: it's the ways of ordering the labels differently that are important, and there's only one way to order (B-Tu, B-Tu). So, these work out as probabilities:
- 1/2 x 1/14 + 1/14 x 1/2 = 14/196
- 6/14 x 1/14 + 1/14 x 6/14 = 12/196
- 1/14 x 1/14 = 1/196
And summing them we get (14 + 12 + 1)/196 = 27/196.
For the conditional probability we also need to know the probability of having two boys, where at least one is B-Tu: P(two B AND at least one B-Tu). This is just the sum of probabilities 2 and 3 in the list: (12 + 1)/196 = 13/196.
We can then apply the conditional probability formula, to get:
|Conditional probability and solution for the SGU problem.
The probability is 13/27, which is what we said it was going to be.
The paradox: sampling
What's striking to me about this is the apparent paradox that, if you only state that you have 'a boy', the probability that you have two boys is 1/3 ≈ 0.33 but, if you say you have a boy born on a Tuesday, the probability rises to 13/27 ≈ 0.48 ≈ 0.5! So, just by specifying the day on which your boy was born, it appears that you immediately have a considerably greater probability of having two boys. And it doesn't even matter what day it is that you say they were born on. That is disturbing, isn't it? How can the probability change just by knowing what day the child was born on? What's going on, here?
Obviously, the probability isn't really changing. You're no more or less likely to have two boys just because you declared a birthday. It's a sampling issue, and - as with so much in statistics - the result you get depends very much on how you ask the question to start with.
The way the question is originally posed, we are absolutely upfront and open about what we're asking: "I have two children. One is a boy born on Tuesday. What is the probability I have two boys?" [EDIT: again, the parent must tell you about the Tuesday-born boy, if they have one]. We have calculated an exact answer to this question, and this is equivalent to having a room filled with parents that have two children and asking them to leave if they do not have at least one boy born on a Tuesday; then asking how many of the parents left have two boys - the probability here being 13/27.
If, however, we had asked the parents to leave if they have no boys, and then asked one parent what day one of their boys (which may be the only boy) was born on, and they said 'Tuesday', we could ask how many of the parents left have two boys, and get an answer that 1/3 of them do.
The difference is this: in the puzzle we're specifying that at least one boy must be born on a Tuesday before we start counting; in the second example we are finding out that at least one boy was born on a Tuesday only in the course of the experiment, but nothing is conditioned on that fact - we're just counting everyone that's left there.
For the biologists out there, this would be analogous to the difference between identifying a subset of sequences with a known property, and seeing which other sequences they shared an 'important' characteristic with; and identifying a set of sequences that share an 'important' characteristic, and noting that one has our known property. You can probably think of your own examples from the literature…
If we're not concentrating (or not reading the paper closely) we might think that these two circumstances are identical, but they're not. It trips up the best of us and it should do, because it's often more subtle, and not always as clear-cut as it is in the puzzle. And that's tricky enough.