Glossary of Terms

Glossary of Terms#

Index#

A | B | C | D | E | F | G | H | I | L | M | N | O | P | Q | R | S | T | U | V | #

B#

Bar Graph#

A chart that uses rectangular bars to compare values across categories. Bars can be vertical or horizontal.

Learn more in Chapter 7 | Back to Top

Bayes’ Theorem#

\[\text{P}(A|B) = \frac{\text{P}(B|A) \text{P}(A)}{\text{P}(B)}\]

Triplets Problem#

What is the probability that at least three people in a group of $n$ people share the same birthday?

Learn more in Chapter 9 | Back to Top

Benchmarking#

A method used to compare observed proportions in a dataset to some external reference (the “benchmark”). In the context of data ethics and policing, benchmarking often compares the proportion of traffic stops for a group (e.g., by race) to that group’s proportion in the local driving population.

If the proportions are similar, this suggests no evidence of disproportionate treatment.
If the proportions are very different, this may indicate potential bias.

Learn more in Chapter 14 | Back to Top

Bernoulli Trial#

A random experiment with two possible outcomes, often described as ‘success’ or ‘failure’.

Learn more in Chapter 10 | Back to Top

Bias#

Bias refers to systemic errors in a study that are caused by skewed representation of certain subgroups within its population of interest and can distort its findings.

Learn more in Chapter 8 | Back to Top

Biased Sampling#

Biased sampling occurs when the complete frame of the population is not utilized to draw samples.

Learn more in Chapter 8 | Back to Top

Binary#

A concept meaning “two possible values” or “two categories.”
For example, a binary variable can only be True or False, or “yes” or “no.”

Can also refer to a number system that uses only 0 and 1. Computers store and process all data in binary.

Learn more in Chapter 2 | Back to Top

Binomial Distribution#

A discrete probability distribution which describes the probability of two outcomes (success or failure) in a sequence of independent trials (Bernoulli trials). For a binomial distribution, there are a fixed number of trials, where the probability of success $p$, is the same for each trial.

Learn more in Chapter 10 | Back to Top

Birthday Problem#

In a group of $n$ people, what is the probability that at least two of them share the same birthday?

Learn more in Chapter 9 | Back to Top

Boolean#

A data type with only two possible values: True or False.

Learn more in Chapter 2 | Back to Top

Boxplot#

A graphical summary of a distribution that shows the median, quartiles, and possible outliers. Also called a box-and-whisker plot.

Learn more in Chapter 7 | Back to Top

Broadcasting#

A set of rules in numpy that allow arrays of different shapes to be combined in arithmetic operations by “stretching” one to match the other.

import numpy as np
arr = np.array([1, 2, 3])
print(arr + 10)  # [11, 12, 13]

Learn more in Chapter 3 | Back to Top

C#

Call (A Function)#

To execute a function by writing its name followed by parentheses.

    print("Hello")

Learn more in Chapter 2 | Back to Top

Categorical Data#

Data that falls into groups or categories. Values are usually labels, not numbers. Examples: eye color, city of residence, type of car.

Learn more in Chapter 7 | Back to Top

Causality#

A quantifiable relationship between two variables where a change in one directly causes a change in the other.

Learn more in Chapter 8 | Back to Top

Center#

A summary statistic of the data that measures the center of a distribution. Measures of center are often calculated using mean, expected value, median, or mode.

Learn more in Chapter 10 | Back to Top

Central Limit Theorem (Clt)#

An important mathematical theorem which states that the distribution of sample means from a sufficiently large random sample (with replacement) will be approximately normally distributed. The CTL allows us to estimate the mean and standard deviation of a distribution of sample means. If the mean and standard deviation of the population you are sampling from are $\mu$ and $\sigma$ respectively, then the mean and standard deviation of the distribution of sample means are $\mu$ and $\frac{σ}{\sqrt{n}}$, respectively, where n is the sample size.

Learn more in Chapter 10 | Back to Top

Cherry-Picking#

Selecting only the data, examples, or results that support a desired conclusion, while ignoring evidence that contradicts it. This creates a misleading picture of reality.

Example: reporting only one experiment where a drug worked, while leaving out several where it did not.

Learn more in Chapter 13 | Back to Top

Cluster Sampling#

Much like stratified random sampling, the population for cluster sampling also consists of clusters or subgroups. In this case, the sample is formed by drawing a simple random of these clusters.

Learn more in Chapter 8 | Back to Top

Colliding Variable#

When two variables have a causal impact on the same variable, the latter is called a collider or a colliding variable.

Learn more in Chapter 8 | Back to Top

Comment#

Text in code that Python ignores. Used for notes or explanations.

    # This is a comment

Learn more in Chapter 2 | Back to Top

Complement Of An Event#

The complement of an event $A$ is the event that $A$ does not occur.

Learn more in Chapter 9 | Back to Top

Compound Events#

A compound event is an event built from combinations of other events.

Learn more in Chapter 9 | Back to Top

Concatenate#

To link strings together using the + operator.

    "Hello" + " " + "world"  # "Hello world"

Learn more in Chapter 2 | Back to Top

Conditional Probability#

Conditional probability is the probability that an event occurs given that another event has already occurred.

Learn more in Chapter 9 | Back to Top

Conditioning#

Conditioning for a confounding variable requires experimentors to statistically adjust its influence to correctly measure the causal effect of an independent variable on a dependent variable. Conditioning for a confounding variable can be done using techniques like randomization. This involves keeping the value of the confounding variable a constant across subjects and then randomly dividing into two groups that are identical in all but the independent variable in question.

Learn more in Chapter 8 | Back to Top

Confidence Interval#

An interval which captures a plausible range of values for the true population parameter.

Learn more in Chapter 12 | Back to Top

Confounding Variable#

In the context of measuring the causal effect of one variable on another, a confounding variable or confounder is a third variable (an external factor) that affects both the independent variable (probable cause) and dependent variable (probabale effect). When the confounder is not factored in correctly, a study can lead to drawing incorrect conclusion regarding the causal relationship of the independent variable on the dependent one.

Learn more in Chapter 8 | Back to Top

Consistency#

Evaluation of the plausibility that data was generated from a particular model. In other words, when evaluating a hypothetical scenario, we may ask “Is this outcome consistent with the proposed model?”

Learn more in Chapter 11 | Back to Top

Continuous Random Variable#

A random variable containing infinite elements in a sample space. The sample space of a continuous random variable is often often an interval of possible outcomes.

Learn more in Chapter 10 | Back to Top

Continuous Uniform Distribution#

A probability distribution that assigns equal (uniform) probability to each continuous random variable $X$, on the interval $[a, b]$.

For example, if to randomly sample a random variable from a continuous sample space between 1 and 6, our random variable that takes values between 1 and 6 would be denoted by $X \sim U(1,6)$.

The PDF for a continuous uniform random variable is:

$f(x) = \frac{1}{b-a}$ when $x$ is between $a$ and $b$ and 0 otherwise.

From this distribution, we calculate the middle of the interval:

$\mu = \frac{b+a}{2}$

and the variance:

$\sigma^2 = \frac{(b-a)^2}{12}$

Learn more in Chapter 10 | Back to Top

Control Groups#

The group receiving no treatment, i.e. in placebo, within an experimental study is called the control group.

Learn more in Chapter 8 | Back to Top

Csv#

.csv is a file format (Comma-Separated Values) used to store tabular data in plain text, with each row as a line and columns separated by commas.

Learn more in Chapter 5 | Back to Top

D#

Dataframe#

A two-dimensional, labeled data structure from the pandas library, similar to a table or spreadsheet. Rows and columns can have labels, and each column can hold a different data type.

Learn more in Chapter 3 | Back to Top

Deep Copy#

A fully independent copy of an object and all objects nested within it. Changes to the original will not affect the deep copy.

import copy
original = [[1, 2], [3, 4]]
deep = copy.deepcopy(original)
original[0][0] = 99
print(deep)  # [[1, 2], [3, 4]]

Learn more in Chapter 3 | Back to Top

Default Value#

The value an optional argument takes if no value is provided.

    def greet(name="world"):
        print("Hello", name)

Learn more in Chapter 2 | Back to Top

Deterministic Sampling#

A type of sampling when the selection of a sample is fixed or predictable, as opposed to involving randomness.

Learn more in Chapter 10 | Back to Top

Dictionary#

A collection of key–value pairs in Python. Keys must be unique and immutable.
Defined with curly braces and colons, e.g. {"a": 1, "b": 2}.

Learn more in Chapter 3 | Back to Top

Dimensions Of A Dataframe#

The number of rows $\times$ the number of columns

Learn more in Chapter 5 | Back to Top

Discerete Random Variable#

A random variable containing finite or countably infinite elements in its sample space. The sample space of a discrete random variable is a set of distinct possible outcomes.

Learn more in Chapter 10 | Back to Top

Discrete Uniform Distribution#

A probability distribution that assigns equal (uniform) probability to all outcomes in a discrete sample space. For example, rolling a fair six-sided die is a discrete uniform distribution, as the theoretical probability of rolling each side is $\frac{1}{6}$

For a sample space containing $n$ elements, the Probability Mass Function (PMF) is defined by:

$P(X=x)=\frac{1}{n}$ for all x in the sample space S (0 otherwise).

Further, if $E$ is an event containing multiple elements from the sample space, then:

$P(E)=\frac{\text{Number of elements in E}}{n}$

Learn more in Chapter 10 | Back to Top

Disjoint Events#

Events $A$ and $B$ are disjoint (or mutually exclusive) if they have no outcomes in common.

Learn more in Chapter 9 | Back to Top

Distribution#

The way values of a variable are spread out. For example, a distribution might be normal (bell-shaped), skewed, or uniform.

Learn more in Chapter 7 | Back to Top

Distributions#

Possible outcomes for a random event and their probabilities.

Learn more in Chapter 10 | Back to Top

Docstring#

A special string inside a function, class, or module that describes what it does. Typically written inside triple quotes.

Learn more in Chapter 2 | Back to Top

Dot Operator#

The . used in Python to access attributes and methods of objects.

my_list = [1, 2, 3]
my_list.append(4)  # uses the dot operator

Learn more in Chapter 3 | Back to Top

E#

Elementwise (Calculations)#

Operations that are applied independently to each element of a collection (like a numpy array).

import numpy as np
arr = np.array([1, 2, 3])
print(arr + 5)  # [6, 7, 8]

Learn more in Chapter 3 | Back to Top

Empirical Probability Distributions#

The observed distribution that may come from multiple samples or repeated experiments.

Learn more in Chapter 10 | Back to Top

Escape Sequence#

A special character combination used inside strings to represent things like newlines (\n) or tabs (\t).

    "Line 1\nLine 2"

Learn more in Chapter 2 | Back to Top

Estimation#

A term used to describe learning about a population characteristic from a sample.

Learn more in Chapter 12 | Back to Top

Event#

A set of outcomes of a random phenomenon.

Learn more in Chapter 9 | Back to Top

Expected Value#

A common measure of the center of a random variable, defined by $\mu(X)$ or $E(X)$. This describes the average value of the sample space. For a discrete distribution, this corresponds to a weighted average, with the given probabilities as weights. For example, given $x_1, ...,x_n$ elements in a sample space with associated probabilities $p_1, p_2, ..., p_n$, we find the mean by computing

$\mu = \sum_{i=1}^n x_i*p_i$.

Learn more in Chapter 10 | Back to Top

Experimental Studies#

A study setup where researchers selectively impose a treatment on a specific population to observe resulting outcomes and estimate its causal linkage to that is called an experimental study.

Learn more in Chapter 8 | Back to Top

Expression#

Any piece of code that Python can evaluate to a value. Example: 2 + 3, "hi" * 2.

Learn more in Chapter 2 | Back to Top

F#

Feature#

A feature is a column in a DataFrame that represents a variable associated with each datapoint (row).

Learn more in Chapter 5 | Back to Top

Flatten (An Array)#

The process of converting a multi-dimensional array into a one-dimensional array. In numpy, this can be done with the .flatten() method, which returns a copy of the data in a single dimension.

Learn more in Chapter 3 | Back to Top

Float#

A number with a decimal point. Example: 3.14, -0.001.

Learn more in Chapter 2 | Back to Top

Floor Division#

Division where the result is rounded down to the nearest whole number.

    7 // 3  # returns 2 because 7 divided by 3 is 2.3, which is rounded down to 2

Learn more in Chapter 2 | Back to Top

Function (Built-In, User-Defined)#

A reusable block of code that performs a task.

Built-in function: provided by Python (e.g., len()).
User-defined function: written by the programmer with def.

Learn more in Chapter 2 | Back to Top

Function Body#

The indented code block inside a function that runs when the function is called.

Learn more in Chapter 2 | Back to Top

Function Definition#

The part of the code where a function is created, starting with the def keyword.

Learn more in Chapter 2 | Back to Top

G#

Glossary#

Learn more in Chapter 4 | Back to Top

H#

Harking#

“Hypothesizing After the Results are Known.”
Presenting a hypothesis as if it was decided before the data analysis, when in fact it was created after seeing the results. This undermines the credibility of research, since hypotheses should guide analysis, not be invented from it.

Example: noticing a surprising relationship in the data and then writing the paper as if that relationship was the original hypothesis.

Learn more in Chapter 13 | Back to Top

Histogram#

A graph that shows the distribution of numerical data by grouping values into bins (intervals) and counting how many fall into each.

Learn more in Chapter 7 | Back to Top

Homogenous#

Homogeneous means that all elements in a collection share the same data type.

Learn more in Chapter 5 | Back to Top

Hypothesis#

In statistics, a hypothesis is defined as a null or alternative view of how data were generated from a model.

Learn more in Chapter 11 | Back to Top

I#

Immutable Data Types#

Data types whose values cannot be changed after they are created. Examples: integers, floats, strings, tuples.

Learn more in Chapter 2 | Back to Top

Independent Events#

Independent events are two or more events where the occurrence of one does not affect the probability of the others.

Learn more in Chapter 9 | Back to Top

Index#

A numeric or labeled position used to access elements in a sequence (like a list or array) or rows/columns in a DataFrame. Python uses 0-based indexing.

Learn more in Chapter 3 | Back to Top

Integer#

A whole number (positive, negative, or zero) without a decimal point. Example: -5, 0, 42.

Learn more in Chapter 2 | Back to Top

Interpreted Language#

A programming language where the code is executed line by line by an interpreter, rather than being compiled into machine code all at once.

Python is an interpreted language, which means you can run code directly in the console or a notebook and see results immediately. This makes it easier to test and experiment, but it can also be slower than compiled languages like C or Java.

Learn more in Chapter 2 | Back to Top

Interquartile Range#

The range between the first quartile (Q1) and third quartile (Q3). Represents the middle 50% of the data.
$$ IQR = Q3 - Q1 $$

Learn more in Chapter 7 | Back to Top

Intersection Of Events#

Intersection of events $A$ and $B$ is the set of all outcomes that are in $A$ and in $B$.

Learn more in Chapter 9 | Back to Top

L#

Law Of Large Numbers#

As the number of experiments increases, the mean of the empirical distribution gets closer to the mean of the probability distribution (also known as the expected value).

Learn more in Chapter 10 | Back to Top

Lexicographic#

Ordering of strings based on the alphabetical order of characters, using their underlying ASCII or Unicode values.

Learn more in Chapter 2 | Back to Top

Library#

A collection of modules that provide useful tools for programming (e.g., NumPy, pandas).

Learn more in Chapter 2 | Back to Top

Line Graph#

A graph where data points are connected by lines, often used to show change over time.

Learn more in Chapter 7 | Back to Top

List#

An ordered, mutable collection in Python that can hold elements of different types.
Defined with square brackets, e.g. [1, 2, 3].

Learn more in Chapter 3 | Back to Top

M#

Mean#

The arithmetic average.
$$ \text{Mean} = \frac{\text{Sum of all values}}{\text{Number of values}} $$

Learn more in Chapter 7 | Back to Top

Median#

The middle value when data is ordered. If there is an even number of values, the median is the average of the two middle values.

Learn more in Chapter 7 | Back to Top

Method#

A function that is attached to an object and called with dot notation.

    "hello".upper()

Learn more in Chapter 2 | Back to Top

Model#

A set of rules that describe how data are generated.

Learn more in Chapter 11 | Back to Top

Module#

A single Python file that contains functions, classes, and variables related to a specific task (e.g., the math module).

Learn more in Chapter 2 | Back to Top

Modulo / Mod / Modulus#

The modulo operation (% in Python) finds the remainder after division of one number by another.

    7 % 3  # returns 1 because 7 divided by 3 is 2 with a remainder of 1

Learn more in Chapter 2 | Back to Top

Multiplication Rule#

\[\text{P}(A \text{ and } B) = \text{P}(A|B) \text{P}(B)\]

Learn more in Chapter 9 | Back to Top

Multistage Sampling#

Multistage sampling is a method that involves a series of sampling steps to select a random sample from a population by breaking it down into smaller and smaller groups.

Learn more in Chapter 8 | Back to Top

Mutable Data Types#

Data types whose values can be changed after creation. Examples: lists, dictionaries, sets.

Learn more in Chapter 2 | Back to Top

Mutation#

A change to a mutable object (such as a list or dictionary) after it has been created. Mutating an object modifies it in place.

Learn more in Chapter 3 | Back to Top

Mutually Exclusive Events#

Events $A$ and $B$ are mutually exclusive (or disjoint) if they have no outcomes in common.

Learn more in Chapter 9 | Back to Top

N#

Nominal Data#

Categorical data with no inherent order.
Example: red, green, blue; or dog, cat, bird.

Learn more in Chapter 7 | Back to Top

Non Response Bias#

Non response bias appears when some participants in a self-reported survey or study decline to respond to questions and they are also meaningfully different from the subset of the population whose responses are available.

Learn more in Chapter 8 | Back to Top

Normal Approximation To The Binomial Distribution#

When the number of trials, n, is sufficiently large, we can approximate a binomial distribution using a normal distribution. The number of trials, n, is deemed sufficiently large when we can satisfy the following conditions:

Is $np \geq 5$?
Is $n(1-p) \geq 5$?

If both conditions are met, you can approximate the binomial distribution with a normal distribution with mean:

$\mu = np$

and standard deviation:

$\sigma = \sqrt{np(1-p)}$

Learn more in Chapter 10 | Back to Top

Normal Distribution#

A continuous distribution, often nicknamed the “bell-curve”, owing to its symmetric and bell-shaped characteristics. Due to the symmetry of the normal distribution, the three measures of center (mode, median, and mean) are exactly the same. The normal distribution is defined entirely in terms of its mean and standard deviation. Notationally, given a random variable $X$ that is normally distributed, we can say $X \sim N(\mu,\sigma)$, where $\mu$ and $\sigma$ are the mean and standard deviation of the distribution, respectively.

Learn more in Chapter 10 | Back to Top

Null Hypothesis#

The default view (or hypothesis) that is generally believed to be true.

Learn more in Chapter 11 | Back to Top

Numerical Data#

Data represented by numbers. Can be discrete (counts) or continuous (measurements). Examples: age, height, test scores.

Learn more in Chapter 7 | Back to Top

O#

Object#

A piece of data in Python that has a type and associated methods. Everything in Python is an object.

Learn more in Chapter 2 | Back to Top

Observational Studies#

In observational studies, you are just given a dataset that is a collection of observations passively reflecting the state of a subset of the population vis a vis some specific outcome being studied. Observational data and inferential stats can answer many interesting questions but not discover causal relationships between independent variables and the outcome.

Learn more in Chapter 8 | Back to Top

Optional Argument#

A function argument that you don’t have to provide when calling the function.

Learn more in Chapter 2 | Back to Top

Ordinal Data#

A type of categorical data where the categories have a natural order or ranking, but differences between ranks are not necessarily meaningful.
Example: small, medium, large.

Learn more in Chapter 7 | Back to Top

Outlier#

A value much higher or lower than the rest of the data. Outliers can strongly influence statistics like the mean.

Learn more in Chapter 7 | Back to Top

Output#

The result a program or function produces.

Learn more in Chapter 2 | Back to Top

P#

P-Hacking#

Manipulating data analysis until statistically significant results (p < 0.05) appear, even if those results are due to chance. This can involve trying many different tests, stopping data collection at just the right point, or selectively reporting results.

Example: running 20 different tests and only publishing the one that happened to be significant.

Learn more in Chapter 13 | Back to Top

P-Value#

The chance, under the null hypothesis, that the test statistic is equal to the observed value or is further in the direction of the alernative hypothesis.

Learn more in Chapter 11 | Back to Top

Parameter#

A numerical characteristic of a population, denoted by $\theta$.

Learn more in Chapter 12 | Back to Top

Percentile Bootstrap Confidence Interval#

A method for calculating a confidence interval by bootstrapping from the original sample. The statistic of interest (say, the mean) is calculated from each bootstrapped sample. The distribution of sample means is then used to calculate the empirical percentiles of the lower and upper bound of the desired confidence interval.

Learn more in Chapter 12 | Back to Top

Permutation Test#

A procedure to shuffle data to approximate the sampling distribution of a test statistic.

Learn more in Chapter 11 | Back to Top

Pie Graph#

A circular chart divided into slices, where each slice represents a proportion of the whole.

Learn more in Chapter 7 | Back to Top

Population#

A population is an entire group of subject upon which an experimental or observational study is conducted.

Learn more in Chapter 8 | Back to Top

Probabilistic Sampling#

A type of sampling where the probability of each unit being chosen is known before sampling is done. Simple random samples (SRS) are an example of probabilistic sampling.

Learn more in Chapter 10 | Back to Top

Probability#

Probability is a numerical measure of how likely an event is to occur, ranging from 0 (impossible) to 1 (certain).

Learn more in Chapter 9 | Back to Top

Probability Density Function (Pdf)#

A function to compute probabilities for continuous random variables. Unlike the discrete case, where all probabilities in a sample space will sum to 1, in the continuous case, this corresponds to an area of 1 under the curve of the probability density function.

Learn more in Chapter 10 | Back to Top

Probability Distributions#

The theoretical likelihood of a distribution. Probability distributions can be studied and understood without collecting any sample or conducting an experiment.

Learn more in Chapter 10 | Back to Top

Probability Mass Function (Pmf)#

A function that assigns the probability of each possible outcome of a random variable for discrete random variables.

The PMF is usually denoted $P(X=x)$ where $X$ is a random variable and $x$ is the outcome of an event.

All probabilities, $P(X=x)$, must satisfy the following criteria:

the probability of each element occurring is greater than or equal to 0
the sum of all probabilities of elements in the sample space equals 1

Learn more in Chapter 10 | Back to Top

Q#

Quartile#

One of four equal parts of ordered data. Q1 is the 25th percentile, Q2 is the median (50th percentile), Q3 is the 75th percentile.

Learn more in Chapter 7 | Back to Top

R#

Random Phenomenon#

A phenomenon where individual outcomes are uncertain.

Learn more in Chapter 9 | Back to Top

Random Variable#

A numerical quantity representing an outcome of an event, often denoted by uppercase letters $X$ or $Y$.

Learn more in Chapter 10 | Back to Top

Randomization#

Randomization is the process of assigning experimental subjects randomly to different groups such that the group are comparable in every aspect other than whether or not they receive the treatment. This is done to estimate the causal effect of the treatment on a specific outcome under question.

Learn more in Chapter 8 | Back to Top

Randomized Response#

Randomized response is a technique to conduct statistical research. This process allows researchers to gather sensitive information in a manner that introduces randomness to the survey response process, thereby preserving respondent privacy.

Learn more in Chapter 8 | Back to Top

Rangeindex#

RangeIndex is the default integer-based index in a pandas DataFrame or Series, automatically assigned when no custom index is provided.

Learn more in Chapter 5 | Back to Top

Relative File Path#

Starts from the folder where your notebook or script is located and specifies the file’s location relative to that folder. For example, ../data.csv.

Learn more in Chapter 5 | Back to Top

Reserved Words#

Special words in Python that should never be used as variable names because they have specific meanings (e.g., if, for, while).

Learn more in Chapter 2 | Back to Top

Response Bias#

Response bias appears when participants in self-reported surveys or studies respond to questions in an inaccurate manner rather than with the truth.

Learn more in Chapter 8 | Back to Top

S#

Sample Mean#

Measure of center for an empirical distribution. Defined as:

$\bar{x} = \frac{\Sigma x_i}{n}$

Learn more in Chapter 10 | Back to Top

Sample Space#

The set of all possible outcomes of a random phenomenon.

Learn more in Chapter 9 | Back to Top

Sample Standard Deviation#

The square root of the sample variance, denoted by $s$

Learn more in Chapter 10 | Back to Top

Sample Variance#

Measurement of the spread for an empirical distribution. Defined as:

$s^2 = \frac{\Sigma (x_i - \bar{x})^2}{n-1}$

Learn more in Chapter 10 | Back to Top

Sampling#

The process of selecting a representative subset from the population of a study for analyzing specific phenomena.

Learn more in Chapter 8 | Back to Top

Scatter Plot#

A graph of points showing the relationship between two numerical variables. Each point represents one observation.

Learn more in Chapter 7 | Back to Top

Scope#

Where a variable is visible in code.

Variables defined inside a function have local scope.
Variables defined outside functions have global scope.

Learn more in Chapter 2 | Back to Top

Selection Bias#

Selection bias occurs when some groups within a population are disproportionately represented in a sample over the rest thereby leading to misleading generalizations.

Learn more in Chapter 8 | Back to Top

Set#

An unordered collection of unique elements in Python.
Defined with curly braces, e.g. {1, 2, 3}.

Learn more in Chapter 3 | Back to Top

Shallow Copy#

A new object that contains references to the elements of the original object rather than fully duplicating them. Nested objects remain shared.

original = [[1, 2], [3, 4]]
shallow = original.copy()
shallow[0][0] = 99
print(original)  # [[99, 2], [3, 4]]

Learn more in Chapter 3 | Back to Top

Significance Level#

A cut-off value, $\alpha$, that is used as a threshold to determine whether to accept the null hypothesis or the alternative hypothesis. A commonly used significance level is $\alpha=0.05$. If the p-value is smaller than $\alpha < 0.5$, we reject $H_0$, otherwise we fail to reject it.

Learn more in Chapter 11 | Back to Top

Simple Random Sampling#

Simple Random Sampling is a technique to randomly select samples from a population of interest such that each member is equally likely to be included in the sample.

Learn more in Chapter 8 | Back to Top

Slice#

A portion of a sequence selected by specifying a start, stop, and step.

my_list = [0, 1, 2, 3, 4, 5]
print(my_list[1:4:2])  # [1, 3]

Learn more in Chapter 3 | Back to Top

Spread#

A summary statistic of the data that measures how the data is dispersed. Measures of spread include variance and standard deviation.

Learn more in Chapter 10 | Back to Top

Standard Deviation#

The square root of the variance symbolized by $\sigma(X)$. The standard deviation is a measure of how far each element is from the mean. In other words, the standard deviation is the average measure of how far all our data is from the mean.

Learn more in Chapter 10 | Back to Top

Standard Normal Distribution#

A special case of the normal distribution that occurs when $\mu = 0$ and $\sigma = 1$. Given a random variable and any values for $\mu$ and $\sigma$, that is $X ∼ N(\mu, \sigma)$, we can transform to a standard normal, by normalizing it! That is:

\[\frac{X-\mu}{\sigma}\]

Note this may be useful if you are comparing values from multiple normal distributions.

Learn more in Chapter 10 | Back to Top

Stratified Random Sampling#

Stratified Random Sampling involves dividing the population into homogeneous subgroups or strata such that the subsequent sample is formed by drawing simple random sample from within each strata.

Learn more in Chapter 8 | Back to Top

String#

A sequence of characters, such as letters, numbers, or symbols, enclosed in quotation marks. Example: "hello".

Learn more in Chapter 2 | Back to Top

Survivorship Bias#

Survivorship bias appears with the overrepresentation of responses from participants who remained (or ‘survived’) within the purview of a study or survey long-term while ignoring those who dropped out.

Learn more in Chapter 8 | Back to Top

Syntax#

The rules that define how Python code must be written so that the computer can understand it. Incorrect syntax usually causes an error.

Learn more in Chapter 2 | Back to Top

T#

Test Statistic#

A summary measure of the data that we use to investigate the consistency of a model. The choice of a test statistic depends on the specific hypotheses we are investigating. For example, the difference in means, medians, or standard deviations could all be used as test statistics.

Learn more in Chapter 11 | Back to Top

Total Variation Distance (Tvd)#

Sum of absolute differences in proportions.

\[{\rm TVD}=\frac{1}{2} \sum |p_i-q_i|\]

In the above formula, $p_i$’s are proportions of subjects in various categories in one sample while $q_i$’s are proportions in the second sample.

Learn more in Chapter 11 | Back to Top

Tuple#

An ordered, immutable collection in Python. Once created, its elements cannot be changed.
Defined with parentheses, e.g. (1, 2, 3).

Learn more in Chapter 3 | Back to Top

Two Sample Test#

A hypothesis test performed to investigate whether there is a difference between two random samples or groups. This is also called A/B testing as we can refer to our two groups as Group A and Group B.

Learn more in Chapter 11 | Back to Top

Type 1 Error#

Rejection of the null hypothesis $H_0$ when it is true. The type 1 error rate is equal to the significance level ($\alpha$).

Learn more in Chapter 11 | Back to Top

Type 2 Error#

Failure to reject the null hypothesis $H_0$ when it is false.

Learn more in Chapter 11 | Back to Top

Type-Casting#

Converting a value from one data type to another.

print(float(3))   # 3.0
print(int("7"))   # 7

Learn more in Chapter 3 | Back to Top

U#

Undercoverage Bias#

Undercoverage bias appears when drawing participants from an incomplete sampling frame that leaves out significant sections of the population of interest.

Learn more in Chapter 8 | Back to Top

Uniform Distribution#

A distribution where all events in a given sample space are equally likely to occur. Examples include, the distribution of possible outcomes when tossing a fair coin, rolling a die, or using a random number generator. Uniform distributions can include both discrete and continuous random variables.

Learn more in Chapter 10 | Back to Top

Union Of Events#

Union of events $A$ and $B$ is the set of all outcomes in $A$, or in $B$, or in both.

Learn more in Chapter 9 | Back to Top

Unpacking (A Tuple)#

Assigning elements of a tuple (or other iterable) to individual variables in a single statement.

x, y = (1, 2)
print(x)  # 1
print(y)  # 2

Learn more in Chapter 3 | Back to Top

V#

Variable / Variable Name#

A name that refers to a value stored in memory. Example: x = 10.

Learn more in Chapter 2 | Back to Top

Variance#

The spread of data, symbolized by $\sigma^2(X)=Var(X)$. This describes how the data is dispersed.

Learn more in Chapter 10 | Back to Top

#

0-Indexed#

The convention that counting positions in a sequence starts at 0 rather than 1. For example, in [10, 20, 30], the first element 10 is at index 0.

Learn more in Chapter 3 | Back to Top

Glossary of Terms

Contents

Glossary of Terms#

Index#

A#

A/B Test#

Absolute File Path#

Aliasing#

Alternative Hypothesis#

Argument#

Array#

Assignment Statement#

Association#

Attribute#

B#

Bar Graph#

Bayes’ Theorem#

Triplets Problem#

Benchmarking#

Bernoulli Trial#

Bias#

Biased Sampling#

Binary#

Binomial Distribution#

Birthday Problem#

Boolean#

Boxplot#

Broadcasting#

C#

Call (A Function)#

Categorical Data#

Causality#

Center#

Central Limit Theorem (Clt)#

Cherry-Picking#

Cluster Sampling#

Colliding Variable#

Comment#

Complement Of An Event#

Compound Events#

Concatenate#

Conditional Probability#

Conditioning#

Confidence Interval#

Confounding Variable#

Consistency#

Continuous Random Variable#

Continuous Uniform Distribution#

Control Groups#

Csv#

D#

Dataframe#

Deep Copy#

Default Value#

Deterministic Sampling#

Dictionary#

Dimensions Of A Dataframe#

Discerete Random Variable#

Discrete Uniform Distribution#

Disjoint Events#

Distribution#

Distributions#

Docstring#

Dot Operator#

E#

Elementwise (Calculations)#

Empirical Probability Distributions#

Escape Sequence#

Estimation#

Event#

Expected Value#

Experimental Studies#

Expression#

F#

Feature#

Flatten (An Array)#

Float#

Floor Division#

Function (Built-In, User-Defined)#

Function Body#