A simulation-based solution#

We have introduced the concept of simulation in Section 5.3 (we simulated the output of rolling a six-sided die). Simulations are used to imitate real-world scenarios. They can have many roles, ranging from obtaining insight into a physical system to testing and training of models and algorithms. In this chapter, we will use simulations to estimate probabilities. We will explore other applications later, such as gaining insight into probability distributions.

Here we use the computer to run/execute our simulations and we will develop code that will allow us to do that. There are several steps in designing and executing a simulation:

Conceptualize what to simulate;
Simulate one instance;
Decide on the number of repetitions;
Summarize the results of the simulation.

There are two important issues that we would like to note here:

Simulations will give us only an estimate/approximation of the probability we are interested in; the more repetitions, the better the approximation.
The number of repetitions is important and strategies for selecting them will be discussed in more detail later in the chapter.

Conceptualize and simulate one instance#

In the birthday problem described at the beginning of this chapter (30 people at a party), the only information that is needed for deciding on matching birthdays is the set of birthdates. The simulation we will construct will focus on that - we just need a function that generates 30 random birthdays. It turns out we have already seen a function in numpy that can do that: random.choice.

The cell code below shows the results of simulating a group of 30 birthdates, where birthdates are described by the integers from 1 to 365 and saved in the array birthdays.

birthdays=np.arange(1,366,1)
n=30
one_run=np.random.choice(birthdays,n)
one_run

array([215, 305, 148, 322,  96, 244, 274, 315,  80,  39, 242, 255, 234,
        70, 119, 211, 287, 271,  39, 354, 312,  61, 127, 266, 322, 356,
       154, 336, 192, 190])

Are there duplicates in the one_run array? It is not easy to answer by just looking at the above output, but we can use the numpy sorting function to help with that.

np.sort(one_run)

array([ 39,  39,  61,  70,  80,  96, 119, 127, 148, 154, 190, 192, 211,
       215, 234, 242, 244, 255, 266, 271, 274, 287, 305, 312, 315, 322,
       322, 336, 354, 356])

We need a function that will give us the number of occurrences of the most frequent day. There are many ways to achieve this:

use the Counter function we introduced before;
numpy has a useful function called bincount;
write your own function.

We will use the first option; using the other two are useful exercises for developing your coding skills. Let’s first recall the Counter function output:

from collections import Counter
Counter([15,2,7,2,2,1,7])

Counter({2: 3, 7: 2, 15: 1, 1: 1})

As you can see in the above output, Counter calculates for each element in the list or array the number of occurrences. The most frequent occurrence (in case of ties, the one that occurs first in our list/array) can be obtained as below. The following cells show the code for extracting: (i) the most frequent element and the number of times it appears in our list or array; (ii) the number of appearances in the list of the most frequent element.

Counter([15,2,7,2,2,1,7]).most_common(1)

[(2, 3)]

Counter([15,2,7,2,2,1,7]).most_common(1)[0][1]

Repeated simulations and summary#

We now create a function that will simulate nrep repetitions of the birthday setting for n subjects. The function returns an array that has nrep entries, each showing the count (frequency) of the most frequent birthday in one simulation.

def birthday_sim(n,nrep):
    '''Estimate birthday matching probabilities using nrep simulations'''
    outcomes = np.array([])
    for i in np.arange(nrep):
        outcomes = np.append(outcomes, 
                             Counter(np.random.choice(birthdays,n)).most_common(1)[0][1])
    return outcomes

# results for ten repetitions of a group of 23 people
birthday_sim(23,10)

array([1., 2., 2., 1., 1., 1., 1., 2., 2., 1.])

We are ready now to estimate the probability that at least two people share a birthday in a group of n random subjects. In a similar fashion to the coin toss example where the probability of heads was given by the long run frequency (number of heads / number of tosses), the estimated probability is

\[\frac{\mbox{number of repetitions with shared birthdays}}{\mbox{nrep}}\]

As mentioned above, the number of repetitions affects both accuracy (better accuracy for more repetitions) and computational time. Chapter 12 provides more details on this issue. In the cell code below we use 1000 repetitions.

n=23
nrep=1000
sum(birthday_sim(n,nrep)>=2)/nrep

0.498

Note that the result we obtain is close but not equal to the probability we calculated at the beginning of this chapter (0.5073). Increasing the number of repetitions will lead to an estimate that is closer to the exact probability. Also, rerunning the above cell will lead to a (slightly) different result:

n=23
nrep=1000
sum(birthday_sim(n,nrep)>=2)/nrep

0.531

On the assumptions used in simulations#

It is important to consider whether the assumptions made in the simulations are the same or different than the ones we made in the mathematical derivation. The answer is yes (same assumptions) because: (i) the birthdays array has 365 elements (so we implicitly assume the year has 365 days); and (ii) birthdates are independent and equally likely (because the numpy random.choice function is designed to sample elements this way when only sample size is provided).

We will talk in a later section about ways to relax these assumptions.

Introduction to Data Science I & II

A simulation-based solution

Contents

A simulation-based solution#

Conceptualize and simulate one instance#

Repeated simulations and summary#

On the assumptions used in simulations#