Empirical and Probability Distributions

10. Empirical and Probability Distributions#

Susanna Lange and Amanda R. Kube Jotte

In the past few chapters, we have discussed methods of sampling individuals from a population and how biased samples can affect the representativeness (generalizability) of our data. Remember, sampling is used to make inferences about a population when gathering information about the entire population is difficult or impossible. We make these inferences through calculating statistics on our sample with the goal of estimating the true population parameter we are interested in.

At the start of this book, we learned how to slice DataFrames or select elements from arrays. When we select samples in this way, we are performing deterministic sampling. Deterministic sampling means the selection of data (ie. a sample) is fixed or predictable in some way. For example, we may explicitly choose the third row of a DataFrame, or elements 5 through 15 of an array. In other words, no randomness is involved. Without random choice, we can not use deterministic sampling to make strong inferences about a population.

In this section, we will build on our use of the random.choice() function learned previously to create probabilistic samples where the probability of each unit being chosen is known before sampling is done. Simple random samples (SRS), as we learned in Chapter 9, are samples in which each unit has equal probability of being chosen. Since we only know the probability of each unit being chosen, but not which unit will be chosen ahead of time, a SRS is an example of a probabilistic sample.

In this chapter, we will use probabilistic sampling and the probability basics we learned the last chapter to explore ways of understanding a population from a sample.