# The birthday problem: relaxed assumptions

# The birthday problem: relaxed assumptions#

There are two assumptions we used in the last few sections while investigating the birthday problem - equally likely birthdates and ignoring February 29 as a possible birth date. While relaxing these can complicate the mathematical calculation, the simulations can be easily modified to account for more complicated scenarios.

We use below a dataset from FiveThirtyEight that contains the number of daily births in US between 2000 and 2014 to estimate the odds of each day of the year to be a birthday:

https://github.com/fivethirtyeight/data/tree/master/births

Note that in the following dataset, the variable for day of week is coded 1 for Monday and 7 for Sunday. Also note that there are four leap years in this dataset - finding the correct probability for being born in a leap year is beyond the scope of this section.

```
birth_data = pd.read_csv("../../data/US_births_2000-2014_SSA.csv")
birth_data
```

year | month | date_of_month | day_of_week | births | |
---|---|---|---|---|---|

0 | 2000 | 1 | 1 | 6 | 9083 |

1 | 2000 | 1 | 2 | 7 | 8006 |

2 | 2000 | 1 | 3 | 1 | 11363 |

3 | 2000 | 1 | 4 | 2 | 13032 |

4 | 2000 | 1 | 5 | 3 | 12558 |

... | ... | ... | ... | ... | ... |

5474 | 2014 | 12 | 27 | 6 | 8656 |

5475 | 2014 | 12 | 28 | 7 | 7724 |

5476 | 2014 | 12 | 29 | 1 | 12811 |

5477 | 2014 | 12 | 30 | 2 | 13634 |

5478 | 2014 | 12 | 31 | 3 | 11990 |

5479 rows × 5 columns

This is an interesting dataset and we encourage you to use it to answer questions like: what is the least frequent day of the week for giving birth?

The `pandas`

library has commands that allow you to group rows by unique values in a column. We introduced it in Chapter 7.

```
counts_df=birth_data.groupby(['month','date_of_month']).sum()[['births']]
counts_df.head(5)
```

births | ||
---|---|---|

month | date_of_month | |

1 | 1 | 116030 |

2 | 144083 | |

3 | 170115 | |

4 | 171663 | |

5 | 166682 |

We see that there were 116,030 births on January 1st, 144,083 (much larger! why?) on January 2nd etc. A histogram of the counts in the above data frame:

```
plt.hist(counts_df.births,bins = np.arange(116000, 195000, 2000))
plt.xticks(ticks=[130000,150000,170000,190000], labels=["130K","150K","170K","190K"])
plt.xlabel("Number of births");
```

Note that some days of the year are outliers in number of births. Can you guess which?

We will use these counts to estimate the probability that a given date is a birthday for a random US subject.

```
bday_probs=counts_df.births/sum(counts_df.births)
```

These probabilities are added to the simulation when using the `random.choice`

function. Look at the function below and compare it to the `birthday_sim`

function introduced in Section 11.2.

```
# adding February 29 - the number of possible birthdays is now 366
birthdays2=np.arange(1,367,1)
def birthday_sim2(n,nrep,pr):
'''Estimate birthday matching probabilities using nrep simulations.
The 366 possible birthdays are weighted by given probabilities'''
outcomes = np.array([])
for i in np.arange(nrep):
outcomes = np.append(outcomes,
Counter(np.random.choice(birthdays2,n,p=pr)).most_common(1)[0][1])
return outcomes
```

We calculate below the probability for the case \(n=23\) using these relaxed assumptions. Before running the cell or reading its output, do you think the probability will be higher or lower?

```
n=23
nrep=100000
sum(birthday_sim2(n,nrep,bday_probs)>1)/nrep
```

```
0.50871
```

**Note**: more accurate simulation experiments do not always lead to different results - but we do not know that before performing them!

We will continue to investigate the issue of analytical and computation approaches throughout this textbook.