Wednesday 13 February 2013

A Nice New Paradox

In which I work through a popular statistical puzzle/paradox (with potential implications for interpretation of large data studies). With example code.

I like mathematical puzzles, but I don't like mathematical puzzles. Especially paradoxes. The problem I have is that I tend intuitively to think of the wrong answer - just like pretty much everyone else. This reminds me to be cautious whenever I look at my own or other peoples' data. As a result, I can't help wondering about every assumption, and whether I'm missing something. This doubt turns me into a pedant and a nitpicker, and makes me occasionally patronising when I ask if people have thought their problem through properly (that's my excuse and I'm sticking to it…).

The other week, a statistical puzzle on The Skeptics' Guide To The Universe (SGU) chimed with an email conversation with a colleague, where we talked about data and studies that seemed to fall into a similar paradoxical trap. At the core of this paradox is how you specify classification and sampling of data up-front: how you design your experiment to answer a specific question, or not. The or not part was the issue in our conversation. This is a genuine problem for 'data-mining' and other large data studies.

[EDIT: The discussion below is only correct if you assume throughout that if the parent has at least one boy, they must tell you this. See comments below, and later post for more discussion]

The Puzzle

The SGU puzzle
In the SGU's "Who's That Noisy" segment, the puzzle was presented: "I have two children. One is a boy born on Tuesday. What is the probability I have two boys?". I'm going to ruin the puzzle by telling you that the answer is 13/27 (≈0.48), almost 50:50 (and much larger than the probability you have two boys if I don't know the boy was born on Tuesday: 1/3 ≈ 0.33 - which is the paradox: you'd think that being more specific would have no effect, or even reduce the probability). But there isn't necessarily consensus about that answer since, as is common in statistics, it does depend on how you ask the question. There especially isn't consensus about the answer on the SGU forum discussing the problem

To see why the answer is 13/27, let's start off with a simpler puzzle.

A simpler puzzle

"I have two children. One of them is a boy. What is the probability I have two boys?" [EDIT: As noted above, we assume that the parent must mention that they have a boy, if they have one]

These kinds of teaser problems are typically stated without enough clarity concerning their assumptions. Do you live, perhaps, in a repressive state where girls are 'disfavoured' and the birth rate appears to be biased? The unstated assumptions then are likely to be along the lines of:

  1. There are only two sexes, represented by boys and girls
  2. The distribution of boys and girls in the population is 50:50
  3. Your pair of children are a fair random sampling from the larger population, with no untoward influences (e.g. cloning)
See what I mean about pedantry?

Anyway, the 'obvious' but incorrect answer is that the probability is 1/2, since the probability of any randomly-chosen child - including your second child - being a boy is 1/2. Also, if you already know about one of the children, it's natural to think that the question of the sex of the other child is independent of that. That seems reasonable, but it doesn't take long to see why it's wrong.

Let's say you have two girls. The probability of any (randomly chosen) child from the population being a girl is 1/2, so the probability of you having two girls is 1/2 x 1/2 = 1/4. A similar argument holds if you have two boys. And, at first glance, it seems that the same is true if you have one boy and one girl: 1/2 x 1/2 = 1/4. But something odd happens when you add up the probabilities:

1/4 + 1/4 + 1/4 = 3/4 < 1

And that 3/4 < 1 bit is key. It shows us that we've done something wrong, as the probabilities for all possible outcomes must add up to one (that's just the way the universe is - we can't do anything about that…). There are only three possible outcomes, here (LGBT issues aside…): boy + boy, boy + girl and girl + girl, so their probabilities must sum to unity.

What's gone wrong is that we haven't taken into account that, for any pair of items, there may be more than one equivalent ordering: (boy, girl) and (girl, boy) are, in fact, different outcomes for boy + girl.

Now, that isn't necessarily an intuitive thing to say, especially when we're dealing with individual humans - which is why I think the puzzle works so well: we intuitively think of two humans as not interchangeable even where, statistically-speaking, the labels we assign to them may be. Here, for example, you may easily buy the argument that (boy, girl) and (girl, boy) are different ways of ordering any pair that is a boy and a girl. But - you may argue - why are (boy1, boy2) and (boy2, boy1) - who are, after all, individuals and not absolutely equivalent - not equally valid alternative ways of ordering two boys? The answer is that the labels 'boy' and 'boy' are equivalent and, since there's no way of distinguishing, say, (girl, girl) from (girl, girl), the case where we have two children of the same sex counts as one event, while the case with one child of each sex counts as two events. This is an example of combinatorics (specifically permutations), a very interesting, and occasionally counter-intuitive branch of mathematics.

So, now that we see that (boy, girl) is not the same as (girl, boy), and that both have probability of 1/2 x 1/2 = 1/4, but they are both cases where you have one boy and one girl, it is clear that the probability of having one boy and one girl is 1/4 + 1/4 = 1/2. Now we have:

1/4 + 2 x 1/4 + 1/4 = 1

and all is well with the world. Or better than it was before, anyway.

So, back to the question: given that we know you have one boy, what's the probability that you actually have two boys? Is it 1/4, because that's the total probability of having two boys? Well, no… because we already know that you have at least one boy, it's not possible that you have two girls. This restricts the number of possible outcomes to fewer cases, so we expect the probability of having two boys to be higher than 1/4.

Since we know you don't have two girls, the remaining options are that you either have two boys, or a boy and a girl. The total probability that you have two boys is 1/4, and the total probability that you have a boy and a girl is 1/2; the probability that you have any boys is then 1/4 + 1/2 = 3/4. Intuitively then, the contribution of 'having two boys' to this total probability of 'having any boys' is 1/4 ÷ 3/4 = 1/3. More formally, we consider this as a conditional probability. Here, we think of it as:

P(you have two boys | you have at least one boy)

which is read as: "the probability that you have two boys, given that you have at least one boy". In fact, by the standard definition of conditional probability (and filling in the appropriate values, noting that if you have two boys, you certainly have one boy, and so the probability of having two boys and at least one boy is just the probability of having two boys):

Conditional probability for the simple puzzle

So far, so good.

The less simple puzzle

When we include the days of the week in the question, as in the SGU puzzle, we are changing our labels. No longer are the girls (or boys) all equivalent to each other: now we have seven different classes of each sex - one for each day of the week, corresponding to the day on which they were born. The combinatorics could get a bit hairy at this point, but there is an intuitive way through the thicket.

We're assuming that, in addition to the stipulations above, the distribution of days of birth is uniform: any child is equally likely to have been born on any day of the week (with probability of 1/7). So, the probability of a randomly-chosen child being a boy born on a Tuesday is 1/2 x 1/7 = 1/14. We have, effectively, 14 uniformly-distributed classes. But the complicating factor in the SGU question is that the classes we're talking about are hierarchically nested, so there are two parent classes for the 14 (boy and girl).

It's not as neat to list out all the possible combinations this time, as there are 14 x 14 = 196 of them, but we can use conditional probability to take us through the logic. Using the shorthand B for boy and G for girl, and Mo, Tu, We, Th, Fr, Sa, Su for day of the week, we have B-Tu to represent our birthday boy, and B or G to represent all boys and all girls, respectively. Now,

Conditional probability for the less simple, SGU problem
We know that the probability of having two boys (in total) is 1/4, but we need to calculate the probability of there being at least one B-Tu in our pair of children: P(at least one B-Tu). This can happen in a number of ways: (G, B-Tu), (B-Tu, G), (B-Tu, B-Mo/We/Th/Fr/Sa/Su), (B-Mo/We/Th/Fr/Sa/Su, B-Tu) and (B-Tu, B-Tu). That is, for any B-Tu the other child can be:
  1. a girl (in two different orderings)
  2. a boy not born on a Tuesday (in two different orderings)
  3. a boy born on a Tuesday (in one ordering)
That this last possibility only has one possible way of being ordered is equivalent to there being only one way to have two girls, above: it's the ways of ordering the labels differently that are important, and there's only one way to order (B-Tu, B-Tu). So, these work out as probabilities:
  1. 1/2 x 1/14 + 1/14 x 1/2 = 14/196
  2. 6/14 x 1/14 + 1/14 x 6/14 = 12/196
  3. 1/14 x 1/14 = 1/196
And summing them we get (14 + 12 + 1)/196 = 27/196.

For the conditional probability we also need to know the probability of having two boys, where at least one is B-Tu: P(two B AND at least one B-Tu). This is just the sum of probabilities 2 and 3 in the list: (12 + 1)/196 = 13/196.

We can then apply the conditional probability formula, to get:

Conditional probability and solution for the SGU problem.

The probability is 13/27, which is what we said it was going to be.

The paradox: sampling

What's striking to me about this is the apparent paradox that, if you only state that you have 'a boy', the probability that you have two boys is 1/3 ≈ 0.33 but, if you say you have a boy born on a Tuesday, the probability rises to 13/27 ≈ 0.48 ≈ 0.5! So, just by specifying the day on which your boy was born, it appears that you immediately have a considerably greater probability of having two boys. And it doesn't even matter what day it is that you say they were born on. That is disturbing, isn't it? How can the probability change just by knowing what day the child was born on? What's going on, here?

Obviously, the probability isn't really changing. You're no more or less likely to have two boys just because you declared a birthday. It's a sampling issue, and - as with so much in statistics - the result you get depends very much on how you ask the question to start with.

The way the question is originally posed, we are absolutely upfront and open about what we're asking: "I have two children. One is a boy born on Tuesday. What is the probability I have two boys?" [EDIT: again, the parent must tell you about the Tuesday-born boy, if they have one]. We have calculated an exact answer to this question, and this is equivalent to having a room filled with parents that have two children and asking them to leave if they do not have at least one boy born on a Tuesday; then asking how many of the parents left have two boys - the probability here being 13/27.

If, however, we had asked the parents to leave if they have no boys, and then asked one parent what day one of their boys (which may be the only boy) was born on, and they said 'Tuesday', we could ask how many of the parents left have two boys, and get an answer that 1/3 of them do.

The difference is this: in the puzzle we're specifying that at least one boy must be born on a Tuesday before we start counting; in the second example we are finding out that at least one boy was born on a Tuesday only in the course of the experiment, but nothing is conditioned on that fact - we're just counting everyone that's left there.

For the biologists out there, this would be analogous to the difference between identifying a subset of sequences with a known property, and seeing which other sequences they shared an 'important' characteristic with; and identifying a set of sequences that share an 'important' characteristic, and noting that one has our known property. You can probably think of your own examples from the literature…

If we're not concentrating (or not reading the paper closely) we might think that these two circumstances are identical, but they're not. It trips up the best of us and it should do, because it's often more subtle, and not always as clear-cut as it is in the puzzle. And that's tricky enough.

Example code:

Here's some Python code that simulates a large set of two-child families with uniform distribution of boys and girls, an a uniform distribution of birth days, just to prove that I'm not making this stuff up:


  1. Q1) A parent is chosen at random from a large set of two-child families. What is the probability that both of his/her children have the same gender?

    The answer is simple: 1/2.

    Q2) Another parent is chosen at random from the subset of these families that include at least one boy. What is the probability both children have the same gender (which, by necessity, must be "boy")?

    This answer is just a little less simple, but it is the one you solved above as the "simple" puzzle: 1/3.

    Q3) A third parent is chosen at random from the entire set, and this parent tells you "One of my two children is a boy." What is the probability both children have the same gender (which, by necessity, must be "boy")?

    This isn't as simple as you make it out to be. Let's call it X.

    Q4) A fourth parent is chosen at random from the entire set, and this parent tells you "One of my two children is a girl." What is the probability both children have the same gender (which, by necessity, must be "girl")?

    Since there is no practical difference between this question, and Q3, the answer must also be X. But if X=1/3, as you claim, we have a problem. If the probability of shared genders is 1/3 regardless of what gender the parent mentions, then it has to be the probability of shared genders if the parent says nothing. But that's Q1, where we know the answer is 1/2.

    This paradox, in another form, is well known in the world of probability. It's called Bertrand's Box Paradox. Joseph Bertrand introduced it in 1889, as a warning that "just the information" is insufficient to calculate probabilities. Sometimes, you need to know how the information was determined.

    The error you make in Q3, when you get 1/3, is assuming that a parent of a boy and a girl will always tell you about the boy, and never the girl. The same mistake is made in Q4, but the opposite way. But you can't assume both, since the questions are the same up to that point. If the method such a parent uses to decide what child to tell you about is not known, you have to assume (s)he chooses randomly between the two possibilities or else you get a contradiction.

    P(two boys|mention a boy)=P(two boys AND mention a boy)/P(mention a boy)
    = P(mention a boy|two boys)*P(two boys)
    / [P(mention a boy|two boys)*P(two boys) + P(mention a boy|one boy)*P(one boy)]
    = (1)*(1/4) / [(1)*(1/4) + (1/2)*(1/2)]
    = (1/4) / (1/2)
    = 1/2.

    And the paradox goes away.

    Q5) A fifth parent is chosen at random from the entire set. You ASK this parent if (s)he has a son, and (s)he says "yes." What is the probability both children have the same gender (which, by necessity, must be "boy")?

    This is the question you actually answered. Both conditional probabilities in the denominator of my equation become 1, and the answer changes to 1/3.

    The same logic applies to the more complicated question, but the math is slightly more complicated as well. You can't count ALL of the families that include a boy who was born on a Tuesday unless you know, for a fact, that the parent was asked to provide that specific information. If you don’t, you get a paradox unless you assume the parent choose randomly from what is likely two similar facts. The answer will be 1/2 if you make that assumption, and 13/27 only if you assume (s)he was asked about Tuesday boys. That answer goes up, from 1/3 to 13/27, because a two-boy family is almost twice as likely to answer "yes" as a one-boy family.

    1. Thanks for the comment, JeffJo.

      You really made me think about what's going on with the wording of that simple puzzle. I agree: you're correct so long as the parent of a BG/GB family has the opportunity to name either the boy or the girl, and ask the appropriate question (even if phrased as 'I do not have two boys').

      In my defence, I was explicit about my interpretation of the question wording. However, I can't see a way to reach that interpretation without making the unstated (and unrealised at the time) assumption that the parent is compelled only ever to talk about a boy in the family. That unseen and unstated option for the BG/GB parent makes the puzzle even more interesting!

      I think I need to write a followup post… ;)



  2. But the whole point is that stating "one is a boy," and being compelled to say whether one is a boy, are quite different things. And it is the very act of compelling that causes the unexpected probability fluctuations as facts are added to what is being compelled.

    1. Thanks JeffJo, but I understood you the first time. Repeating your point won't help: it's correct, and I see that it's correct, as I said in my reply above.