Biases#

Recall the introduction to this chapter where we stated that bigger data is not always better. This is often due to the sampling method used to gather that “Big Data”. We have already discussed the need for representative samples to ensure generalization of the sample to the population. The bias introduced by oversampling some portions of the population over others is known as selection bias. However, this is not the only bias that can be introduced during the data collection process.

Non-response Bias#

Imagine sampling participants and emailing each participant a survey to complete. Some participants might not complete that survey. Non-response bias occurs when the people who decline to respond are different in some meaningful way than those who do respond. Perhaps you wish to study parenting and all single parents were too busy to complete the survey. Your study would be missing an important perspective.

Response Bias#

Turning attention from those who did not respond to those who did, their responses can suffer from response bias. Response bias can appear in multiple formats. Sometimes, participants have an incentive to respond in ways that might not be truthful, especially if questions are sensitive or embarrassing. For example, in a survey of campus sexual health, students might be embarrassed to report STIs, and therefore trends in these data may be misleading. This can be influenced by the wording or tone of the questions as well as if participants have been ensured their data will be kept private.

Some response bias can be due more to boredom than truthfulness. For example, especially in long surveys, participants may care more about completing the task than completing the task well. Some participants may choose to select random answers, select the same answer for every question, or answer questions in a pattern. It is important for a researcher to consider the wording, tone and length of a survey carefully, and to check all surveys for possible response bias before analyzing data.

Randomized Response#

Suppose I want to know how many college students have cheated on an exam at some point in their lives. Students are less likely to respond truthfully about cheating which could create response bias in my sample. One strategy for combatting this type of response bias is to use a technique known as randomized response. Instead of asking all students to respond truthfully, I ask them to flip 2 coins without letting me see the results. If the first coin lands on heads, they should give a truthful answer. If it lands on tails, their answer depends on the second coin flip where they answer “yes” if the second coin lands on heads and “no” if the second coin lands on tails (see image below).

../../_images/FlowChart.png

This inserts randomness into response ensuring that the researcher does not know who answered truthfully and therefore does not know who has cheated in the past. This encourages students to give truthful answers when prompted and allows the researcher to calculate an estimate of the true proportion of students who have cheated. Let’s explore this in a simulation.

Let the true proportion of students who have cheated be 30%. We can simulate the truthful answers and coin flips of 100 students using np.random.choice as follows.

import numpy as np

np.random.seed(1890)

truth = np.random.choice(["Yes", "No"], 100, p=[0.3, 0.7])
flip1 = np.random.choice(["Heads", "Tails"], 100)
flip2 = np.random.choice(["Heads", "Tails"], 100)

The reported answers given by students in our random response survey would be the following.

reported = truth.copy()

reported[(flip1 == "Tails") & (flip2 == "Heads")] = "Yes"
reported[(flip1 == "Tails") & (flip2 == "Tails")] = "No"

sum(reported == "Yes") / 100
0.43

Compare this to the true proportion.

sum(truth == "Yes") / 100
0.27

Based on the coinflips about half of the participants responded truthfully. We also know that about a quarter of the participants falsely responded “yes” and a quarter falsely responded “no”. Therefore, if we call the chance of someone truthfully responding “yes” \(p\), the chance of seeing a response of “yes” (truthful or not truthful) is \(P(yes) = \frac{1}{2}p + \frac{1}{4}\). Solving for \(p\) the chance of a truthful yes is \(2P(yes)-\frac{1}{2}\).

2 * (sum(reported == "Yes") / 100) - (1 / 2)
0.36

Which is a good estimate for the truthful proportion of 0.27. This method (like many other data science methods) makes use of properties of probabilities which we will learn more about next!