Glossary of Terms#
Index#
A| B| C| D| E| F| G| H| I| L| M| N| O| P| Q| R| S| T| U| V| #
A#
A/B Test#
A hypothesis test performed to investigate whether there is a difference between two random samples or groups (e.g., Group A and Group B). This is also called a Two Sample Test.
Learn more in Chapter 11 | Back to Top
Absolute File Path#
Starts from the root of your computer and specifies the complete location of a file. For example, /Users/<username>/Documents/data.csv
.
Learn more in Chapter 5 | Back to Top
Aliasing#
In Python, aliasing can refer to two related but different ideas:
Import aliasing
Using theas
keyword to give a shorter or alternate name to a module or object when importing.
This is often used to make code more concise and to follow conventions.import numpy as np arr = np.array([1, 2, 3])
Here,
np
is an alias for thenumpy
library.Variable aliasing
When two variables refer to the same underlying object in memory.
Changing the object through one variable will affect the other.a = [1, 2, 3] b = a # b is an alias for the same list as a b.append(4) print(a) # [1, 2, 3, 4]
In both cases, aliasing means you have more than one “name” that refers to the same thing.
Learn more in Chapter 3 | Back to Top
Alternative Hypothesis#
A view (or hypothesis) that is opposite of the null hypothesis.
Learn more in Chapter 11 | Back to Top
Argument#
The value you pass into a function when you call it.
len("hello") # "hello" is the argument
Learn more in Chapter 2 | Back to Top
Array#
A structured, ordered collection of elements of the same type. In Python, often created with numpy
for numerical data and efficient elementwise operations.
Learn more in Chapter 3 | Back to Top
Assignment Statement#
A statement that stores a value in a variable using =
.
x = 5
Learn more in Chapter 2 | Back to Top
Association#
An association between two variables implies a pattern that can help us make inferences on one variable given information about the other.
Learn more in Chapter 8 | Back to Top
Attribute#
A value or property stored inside an object that can be accessed using the dot operator.
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
print(df.shape) # (3, 1)
Learn more in Chapter 3 | Back to Top
B#
Bar Graph#
A chart that uses rectangular bars to compare values across categories. Bars can be vertical or horizontal.
Learn more in Chapter 7 | Back to Top
Bayes’ Theorem#
Triplets Problem#
What is the probability that at least three people in a group of \(n\) people share the same birthday?
Learn more in Chapter 9 | Back to Top
Benchmarking#
A method used to compare observed proportions in a dataset to some external reference (the “benchmark”). In the context of data ethics and policing, benchmarking often compares the proportion of traffic stops for a group (e.g., by race) to that group’s proportion in the local driving population.
If the proportions are similar, this suggests no evidence of disproportionate treatment.
If the proportions are very different, this may indicate potential bias.
Learn more in Chapter 14 | Back to Top
Bernoulli Trial#
A random experiment with two possible outcomes, often described as ‘success’ or ‘failure’.
Learn more in Chapter 10 | Back to Top
Bias#
Bias refers to systemic errors in a study that are caused by skewed representation of certain subgroups within its population of interest and can distort its findings.
Learn more in Chapter 8 | Back to Top
Biased Sampling#
Biased sampling occurs when the complete frame of the population is not utilized to draw samples.
Learn more in Chapter 8 | Back to Top
Binary#
A concept meaning “two possible values” or “two categories.”
For example, a binary variable can only be True
or False
, or “yes” or “no.”
Can also refer to a number system that uses only 0
and 1
. Computers store and process all data in binary.
Learn more in Chapter 2 | Back to Top
Binomial Distribution#
A discrete probability distribution which describes the probability of two outcomes (success or failure) in a sequence of independent trials (Bernoulli trials). For a binomial distribution, there are a fixed number of trials, where the probability of success \(p\), is the same for each trial.
Learn more in Chapter 10 | Back to Top
Birthday Problem#
In a group of \(n\) people, what is the probability that at least two of them share the same birthday?
Learn more in Chapter 9 | Back to Top
Boolean#
A data type with only two possible values: True
or False
.
Learn more in Chapter 2 | Back to Top
Boxplot#
A graphical summary of a distribution that shows the median, quartiles, and possible outliers. Also called a box-and-whisker plot.
Learn more in Chapter 7 | Back to Top
Broadcasting#
A set of rules in numpy
that allow arrays of different shapes to be combined in arithmetic operations by “stretching” one to match the other.
import numpy as np
arr = np.array([1, 2, 3])
print(arr + 10) # [11, 12, 13]
Learn more in Chapter 3 | Back to Top
C#
Call (A Function)#
To execute a function by writing its name followed by parentheses.
print("Hello")
Learn more in Chapter 2 | Back to Top
Categorical Data#
Data that falls into groups or categories. Values are usually labels, not numbers. Examples: eye color, city of residence, type of car.
Learn more in Chapter 7 | Back to Top
Causality#
A quantifiable relationship between two variables where a change in one directly causes a change in the other.
Learn more in Chapter 8 | Back to Top
Center#
A summary statistic of the data that measures the center of a distribution. Measures of center are often calculated using mean, expected value, median, or mode.
Learn more in Chapter 10 | Back to Top
Central Limit Theorem (Clt)#
An important mathematical theorem which states that the distribution of sample means from a sufficiently large random sample (with replacement) will be approximately normally distributed. The CTL allows us to estimate the mean and standard deviation of a distribution of sample means. If the mean and standard deviation of the population you are sampling from are \(\mu\) and \(\sigma\) respectively, then the mean and standard deviation of the distribution of sample means are \(\mu\) and \(\frac{σ}{\sqrt{n}}\), respectively, where n is the sample size.
Learn more in Chapter 10 | Back to Top
Cherry-Picking#
Selecting only the data, examples, or results that support a desired conclusion, while ignoring evidence that contradicts it. This creates a misleading picture of reality.
Example: reporting only one experiment where a drug worked, while leaving out several where it did not.
Learn more in Chapter 13 | Back to Top
Cluster Sampling#
Much like stratified random sampling, the population for cluster sampling also consists of clusters or subgroups. In this case, the sample is formed by drawing a simple random of these clusters.
Learn more in Chapter 8 | Back to Top
Colliding Variable#
When two variables have a causal impact on the same variable, the latter is called a collider or a colliding variable.
Learn more in Chapter 8 | Back to Top
Comment#
Text in code that Python ignores. Used for notes or explanations.
# This is a comment
Learn more in Chapter 2 | Back to Top
Complement Of An Event#
The complement of an event \(A\) is the event that \(A\) does not occur.
Learn more in Chapter 9 | Back to Top
Compound Events#
A compound event is an event built from combinations of other events.
Learn more in Chapter 9 | Back to Top
Concatenate#
To link strings together using the +
operator.
"Hello" + " " + "world" # "Hello world"
Learn more in Chapter 2 | Back to Top
Conditional Probability#
Conditional probability is the probability that an event occurs given that another event has already occurred.
Learn more in Chapter 9 | Back to Top
Conditioning#
Conditioning for a confounding variable requires experimentors to statistically adjust its influence to correctly measure the causal effect of an independent variable on a dependent variable. Conditioning for a confounding variable can be done using techniques like randomization. This involves keeping the value of the confounding variable a constant across subjects and then randomly dividing into two groups that are identical in all but the independent variable in question.
Learn more in Chapter 8 | Back to Top
Confidence Interval#
An interval which captures a plausible range of values for the true population parameter.
Learn more in Chapter 12 | Back to Top
Confounding Variable#
In the context of measuring the causal effect of one variable on another, a confounding variable or confounder is a third variable (an external factor) that affects both the independent variable (probable cause) and dependent variable (probabale effect). When the confounder is not factored in correctly, a study can lead to drawing incorrect conclusion regarding the causal relationship of the independent variable on the dependent one.
Learn more in Chapter 8 | Back to Top
Consistency#
Evaluation of the plausibility that data was generated from a particular model. In other words, when evaluating a hypothetical scenario, we may ask “Is this outcome consistent with the proposed model?”
Learn more in Chapter 11 | Back to Top
Continuous Random Variable#
A random variable containing infinite elements in a sample space. The sample space of a continuous random variable is often often an interval of possible outcomes.
Learn more in Chapter 10 | Back to Top
Continuous Uniform Distribution#
A probability distribution that assigns equal (uniform) probability to each continuous random variable \(X\), on the interval \([a, b]\).
For example, if to randomly sample a random variable from a continuous sample space between 1 and 6, our random variable that takes values between 1 and 6 would be denoted by \(X \sim U(1,6)\).
The PDF for a continuous uniform random variable is:
\(f(x) = \frac{1}{b-a}\) when \(x\) is between \(a\) and \(b\) and 0 otherwise.
From this distribution, we calculate the middle of the interval:
\(\mu = \frac{b+a}{2}\)
and the variance:
\(\sigma^2 = \frac{(b-a)^2}{12}\)
Learn more in Chapter 10 | Back to Top
Control Groups#
The group receiving no treatment, i.e. in placebo, within an experimental study is called the control group.
Learn more in Chapter 8 | Back to Top
Csv#
.csv
is a file format (Comma-Separated Values) used to store tabular data in plain text, with each row as a line and columns separated by commas.
Learn more in Chapter 5 | Back to Top
D#
Dataframe#
A two-dimensional, labeled data structure from the pandas
library, similar to a table or spreadsheet. Rows and columns can have labels, and each column can hold a different data type.
Learn more in Chapter 3 | Back to Top
Deep Copy#
A fully independent copy of an object and all objects nested within it. Changes to the original will not affect the deep copy.
import copy
original = [[1, 2], [3, 4]]
deep = copy.deepcopy(original)
original[0][0] = 99
print(deep) # [[1, 2], [3, 4]]
Learn more in Chapter 3 | Back to Top
Default Value#
The value an optional argument takes if no value is provided.
def greet(name="world"):
print("Hello", name)
Learn more in Chapter 2 | Back to Top
Deterministic Sampling#
A type of sampling when the selection of a sample is fixed or predictable, as opposed to involving randomness.
Learn more in Chapter 10 | Back to Top
Dictionary#
A collection of key–value pairs in Python. Keys must be unique and immutable.
Defined with curly braces and colons, e.g. {"a": 1, "b": 2}
.
Learn more in Chapter 3 | Back to Top
Dimensions Of A Dataframe#
The number of rows \(\times\) the number of columns
Learn more in Chapter 5 | Back to Top
Discerete Random Variable#
A random variable containing finite or countably infinite elements in its sample space. The sample space of a discrete random variable is a set of distinct possible outcomes.
Learn more in Chapter 10 | Back to Top
Discrete Uniform Distribution#
A probability distribution that assigns equal (uniform) probability to all outcomes in a discrete sample space. For example, rolling a fair six-sided die is a discrete uniform distribution, as the theoretical probability of rolling each side is \(\frac{1}{6}\)
For a sample space containing \(n\) elements, the Probability Mass Function (PMF) is defined by:
\(P(X=x)=\frac{1}{n}\) for all x in the sample space S (0 otherwise).
Further, if \(E\) is an event containing multiple elements from the sample space, then:
\(P(E)=\frac{\text{Number of elements in E}}{n}\)
Learn more in Chapter 10 | Back to Top
Disjoint Events#
Events \(A\) and \(B\) are disjoint (or mutually exclusive) if they have no outcomes in common.
Learn more in Chapter 9 | Back to Top
Distribution#
The way values of a variable are spread out. For example, a distribution might be normal (bell-shaped), skewed, or uniform.
Learn more in Chapter 7 | Back to Top
Distributions#
Possible outcomes for a random event and their probabilities.
Learn more in Chapter 10 | Back to Top
Docstring#
A special string inside a function, class, or module that describes what it does. Typically written inside triple quotes.
Learn more in Chapter 2 | Back to Top
Dot Operator#
The .
used in Python to access attributes and methods of objects.
my_list = [1, 2, 3]
my_list.append(4) # uses the dot operator
Learn more in Chapter 3 | Back to Top
E#
Elementwise (Calculations)#
Operations that are applied independently to each element of a collection (like a numpy
array).
import numpy as np
arr = np.array([1, 2, 3])
print(arr + 5) # [6, 7, 8]
Learn more in Chapter 3 | Back to Top
Empirical Probability Distributions#
The observed distribution that may come from multiple samples or repeated experiments.
Learn more in Chapter 10 | Back to Top
Escape Sequence#
A special character combination used inside strings to represent things like newlines (\n
) or tabs (\t
).
"Line 1\nLine 2"
Learn more in Chapter 2 | Back to Top
Estimation#
A term used to describe learning about a population characteristic from a sample.
Learn more in Chapter 12 | Back to Top
Event#
A set of outcomes of a random phenomenon.
Learn more in Chapter 9 | Back to Top
Expected Value#
A common measure of the center of a random variable, defined by \(\mu(X)\) or \(E(X)\). This describes the average value of the sample space. For a discrete distribution, this corresponds to a weighted average, with the given probabilities as weights. For example, given \(x_1, ...,x_n\) elements in a sample space with associated probabilities \(p_1, p_2, ..., p_n\), we find the mean by computing
\(\mu = \sum_{i=1}^n x_i*p_i\).
Learn more in Chapter 10 | Back to Top
Experimental Studies#
A study setup where researchers selectively impose a treatment on a specific population to observe resulting outcomes and estimate its causal linkage to that is called an experimental study.
Learn more in Chapter 8 | Back to Top
Expression#
Any piece of code that Python can evaluate to a value. Example: 2 + 3
, "hi" * 2
.
Learn more in Chapter 2 | Back to Top
F#
Feature#
A feature is a column in a DataFrame that represents a variable associated with each datapoint (row).
Learn more in Chapter 5 | Back to Top
Flatten (An Array)#
The process of converting a multi-dimensional array into a one-dimensional array. In numpy
, this can be done with the .flatten()
method, which returns a copy of the data in a single dimension.
Learn more in Chapter 3 | Back to Top
Float#
A number with a decimal point. Example: 3.14
, -0.001
.
Learn more in Chapter 2 | Back to Top
Floor Division#
Division where the result is rounded down to the nearest whole number.
7 // 3 # returns 2 because 7 divided by 3 is 2.3, which is rounded down to 2
Learn more in Chapter 2 | Back to Top
Function (Built-In, User-Defined)#
A reusable block of code that performs a task.
Built-in function: provided by Python (e.g.,
len()
).User-defined function: written by the programmer with
def
.
Learn more in Chapter 2 | Back to Top
Function Body#
The indented code block inside a function that runs when the function is called.
Learn more in Chapter 2 | Back to Top
Function Definition#
The part of the code where a function is created, starting with the def
keyword.
Learn more in Chapter 2 | Back to Top
G#
Glossary#
Learn more in Chapter 4 | Back to Top
H#
Harking#
“Hypothesizing After the Results are Known.”
Presenting a hypothesis as if it was decided before the data analysis, when in fact it was created after seeing the results. This undermines the credibility of research, since hypotheses should guide analysis, not be invented from it.
Example: noticing a surprising relationship in the data and then writing the paper as if that relationship was the original hypothesis.
Learn more in Chapter 13 | Back to Top
Histogram#
A graph that shows the distribution of numerical data by grouping values into bins (intervals) and counting how many fall into each.
Learn more in Chapter 7 | Back to Top
Homogenous#
Homogeneous means that all elements in a collection share the same data type.
Learn more in Chapter 5 | Back to Top
Hypothesis#
In statistics, a hypothesis is defined as a null or alternative view of how data were generated from a model.
Learn more in Chapter 11 | Back to Top
I#
Immutable Data Types#
Data types whose values cannot be changed after they are created. Examples: integers, floats, strings, tuples.
Learn more in Chapter 2 | Back to Top
Independent Events#
Independent events are two or more events where the occurrence of one does not affect the probability of the others.
Learn more in Chapter 9 | Back to Top
Index#
A numeric or labeled position used to access elements in a sequence (like a list or array) or rows/columns in a DataFrame. Python uses 0-based indexing.
Learn more in Chapter 3 | Back to Top
Integer#
A whole number (positive, negative, or zero) without a decimal point. Example: -5
, 0
, 42
.
Learn more in Chapter 2 | Back to Top
Interpreted Language#
A programming language where the code is executed line by line by an interpreter, rather than being compiled into machine code all at once.
Python is an interpreted language, which means you can run code directly in the console or a notebook and see results immediately. This makes it easier to test and experiment, but it can also be slower than compiled languages like C or Java.
Learn more in Chapter 2 | Back to Top
Interquartile Range#
The range between the first quartile (Q1) and third quartile (Q3). Represents the middle 50% of the data.
$\(
IQR = Q3 - Q1
\)$
Learn more in Chapter 7 | Back to Top
Intersection Of Events#
Intersection of events \(A\) and \(B\) is the set of all outcomes that are in \(A\) and in \(B\).
Learn more in Chapter 9 | Back to Top
L#
Law Of Large Numbers#
As the number of experiments increases, the mean of the empirical distribution gets closer to the mean of the probability distribution (also known as the expected value).
Learn more in Chapter 10 | Back to Top
Lexicographic#
Ordering of strings based on the alphabetical order of characters, using their underlying ASCII or Unicode values.
Learn more in Chapter 2 | Back to Top
Library#
A collection of modules that provide useful tools for programming (e.g., NumPy
, pandas
).
Learn more in Chapter 2 | Back to Top
Line Graph#
A graph where data points are connected by lines, often used to show change over time.
Learn more in Chapter 7 | Back to Top
List#
An ordered, mutable collection in Python that can hold elements of different types.
Defined with square brackets, e.g. [1, 2, 3]
.
Learn more in Chapter 3 | Back to Top
M#
Mean#
The arithmetic average.
$\(
\text{Mean} = \frac{\text{Sum of all values}}{\text{Number of values}}
\)$
Learn more in Chapter 7 | Back to Top
Median#
The middle value when data is ordered. If there is an even number of values, the median is the average of the two middle values.
Learn more in Chapter 7 | Back to Top
Method#
A function that is attached to an object and called with dot notation.
"hello".upper()
Learn more in Chapter 2 | Back to Top
Model#
A set of rules that describe how data are generated.
Learn more in Chapter 11 | Back to Top
Module#
A single Python file that contains functions, classes, and variables related to a specific task (e.g., the math
module).
Learn more in Chapter 2 | Back to Top
Modulo / Mod / Modulus#
The modulo operation (%
in Python) finds the remainder after division of one number by another.
7 % 3 # returns 1 because 7 divided by 3 is 2 with a remainder of 1
Learn more in Chapter 2 | Back to Top
Multiplication Rule#
Learn more in Chapter 9 | Back to Top
Multistage Sampling#
Multistage sampling is a method that involves a series of sampling steps to select a random sample from a population by breaking it down into smaller and smaller groups.
Learn more in Chapter 8 | Back to Top
Mutable Data Types#
Data types whose values can be changed after creation. Examples: lists, dictionaries, sets.
Learn more in Chapter 2 | Back to Top
Mutation#
A change to a mutable object (such as a list or dictionary) after it has been created. Mutating an object modifies it in place.
Learn more in Chapter 3 | Back to Top
Mutually Exclusive Events#
Events \(A\) and \(B\) are mutually exclusive (or disjoint) if they have no outcomes in common.
Learn more in Chapter 9 | Back to Top
N#
Nominal Data#
Categorical data with no inherent order.
Example: red, green, blue; or dog, cat, bird.
Learn more in Chapter 7 | Back to Top
Non Response Bias#
Non response bias appears when some participants in a self-reported survey or study decline to respond to questions and they are also meaningfully different from the subset of the population whose responses are available.
Learn more in Chapter 8 | Back to Top
Normal Approximation To The Binomial Distribution#
When the number of trials, n, is sufficiently large, we can approximate a binomial distribution using a normal distribution. The number of trials, n, is deemed sufficiently large when we can satisfy the following conditions:
Is \(np \geq 5\)?
Is \(n(1-p) \geq 5\)?
If both conditions are met, you can approximate the binomial distribution with a normal distribution with mean:
\(\mu = np\)
and standard deviation:
\(\sigma = \sqrt{np(1-p)}\)
Learn more in Chapter 10 | Back to Top
Normal Distribution#
A continuous distribution, often nicknamed the “bell-curve”, owing to its symmetric and bell-shaped characteristics. Due to the symmetry of the normal distribution, the three measures of center (mode, median, and mean) are exactly the same. The normal distribution is defined entirely in terms of its mean and standard deviation. Notationally, given a random variable \(X\) that is normally distributed, we can say \(X \sim N(\mu,\sigma)\), where \(\mu\) and \(\sigma\) are the mean and standard deviation of the distribution, respectively.
Learn more in Chapter 10 | Back to Top
Null Hypothesis#
The default view (or hypothesis) that is generally believed to be true.
Learn more in Chapter 11 | Back to Top
Numerical Data#
Data represented by numbers. Can be discrete (counts) or continuous (measurements). Examples: age, height, test scores.
Learn more in Chapter 7 | Back to Top
O#
Object#
A piece of data in Python that has a type and associated methods. Everything in Python is an object.
Learn more in Chapter 2 | Back to Top
Observational Studies#
In observational studies, you are just given a dataset that is a collection of observations passively reflecting the state of a subset of the population vis a vis some specific outcome being studied. Observational data and inferential stats can answer many interesting questions but not discover causal relationships between independent variables and the outcome.
Learn more in Chapter 8 | Back to Top
Optional Argument#
A function argument that you don’t have to provide when calling the function.
Learn more in Chapter 2 | Back to Top
Ordinal Data#
A type of categorical data where the categories have a natural order or ranking, but differences between ranks are not necessarily meaningful.
Example: small, medium, large.
Learn more in Chapter 7 | Back to Top
Outlier#
A value much higher or lower than the rest of the data. Outliers can strongly influence statistics like the mean.
Learn more in Chapter 7 | Back to Top
Output#
The result a program or function produces.
Learn more in Chapter 2 | Back to Top
P#
P-Hacking#
Manipulating data analysis until statistically significant results (p < 0.05) appear, even if those results are due to chance. This can involve trying many different tests, stopping data collection at just the right point, or selectively reporting results.
Example: running 20 different tests and only publishing the one that happened to be significant.
Learn more in Chapter 13 | Back to Top
P-Value#
The chance, under the null hypothesis, that the test statistic is equal to the observed value or is further in the direction of the alernative hypothesis.
Learn more in Chapter 11 | Back to Top
Parameter#
A numerical characteristic of a population, denoted by \(\theta\).
Learn more in Chapter 12 | Back to Top
Percentile Bootstrap Confidence Interval#
A method for calculating a confidence interval by bootstrapping from the original sample. The statistic of interest (say, the mean) is calculated from each bootstrapped sample. The distribution of sample means is then used to calculate the empirical percentiles of the lower and upper bound of the desired confidence interval.
Learn more in Chapter 12 | Back to Top
Permutation Test#
A procedure to shuffle data to approximate the sampling distribution of a test statistic.
Learn more in Chapter 11 | Back to Top
Pie Graph#
A circular chart divided into slices, where each slice represents a proportion of the whole.
Learn more in Chapter 7 | Back to Top
Population#
A population is an entire group of subject upon which an experimental or observational study is conducted.
Learn more in Chapter 8 | Back to Top
Probabilistic Sampling#
A type of sampling where the probability of each unit being chosen is known before sampling is done. Simple random samples (SRS) are an example of probabilistic sampling.
Learn more in Chapter 10 | Back to Top
Probability#
Probability is a numerical measure of how likely an event is to occur, ranging from 0 (impossible) to 1 (certain).
Learn more in Chapter 9 | Back to Top
Probability Density Function (Pdf)#
A function to compute probabilities for continuous random variables. Unlike the discrete case, where all probabilities in a sample space will sum to 1, in the continuous case, this corresponds to an area of 1 under the curve of the probability density function.
Learn more in Chapter 10 | Back to Top
Probability Distributions#
The theoretical likelihood of a distribution. Probability distributions can be studied and understood without collecting any sample or conducting an experiment.
Learn more in Chapter 10 | Back to Top
Probability Mass Function (Pmf)#
A function that assigns the probability of each possible outcome of a random variable for discrete random variables.
The PMF is usually denoted \(P(X=x)\) where \(X\) is a random variable and \(x\) is the outcome of an event.
All probabilities, \(P(X=x)\), must satisfy the following criteria:
the probability of each element occurring is greater than or equal to 0
the sum of all probabilities of elements in the sample space equals 1
Learn more in Chapter 10 | Back to Top
Q#
Quartile#
One of four equal parts of ordered data. Q1 is the 25th percentile, Q2 is the median (50th percentile), Q3 is the 75th percentile.
Learn more in Chapter 7 | Back to Top
R#
Random Phenomenon#
A phenomenon where individual outcomes are uncertain.
Learn more in Chapter 9 | Back to Top
Random Variable#
A numerical quantity representing an outcome of an event, often denoted by uppercase letters \(X\) or \(Y\).
Learn more in Chapter 10 | Back to Top
Randomization#
Randomization is the process of assigning experimental subjects randomly to different groups such that the group are comparable in every aspect other than whether or not they receive the treatment. This is done to estimate the causal effect of the treatment on a specific outcome under question.
Learn more in Chapter 8 | Back to Top
Randomized Response#
Randomized response is a technique to conduct statistical research. This process allows researchers to gather sensitive information in a manner that introduces randomness to the survey response process, thereby preserving respondent privacy.
Learn more in Chapter 8 | Back to Top
Rangeindex#
RangeIndex is the default integer-based index in a pandas DataFrame or Series, automatically assigned when no custom index is provided.
Learn more in Chapter 5 | Back to Top
Relative File Path#
Starts from the folder where your notebook or script is located and specifies the file’s location relative to that folder. For example, ../data.csv
.
Learn more in Chapter 5 | Back to Top
Reserved Words#
Special words in Python that should never be used as variable names because they have specific meanings (e.g., if
, for
, while
).
Learn more in Chapter 2 | Back to Top
Response Bias#
Response bias appears when participants in self-reported surveys or studies respond to questions in an inaccurate manner rather than with the truth.
Learn more in Chapter 8 | Back to Top
S#
Sample Mean#
Measure of center for an empirical distribution. Defined as:
\(\bar{x} = \frac{\Sigma x_i}{n}\)
Learn more in Chapter 10 | Back to Top
Sample Space#
The set of all possible outcomes of a random phenomenon.
Learn more in Chapter 9 | Back to Top
Sample Standard Deviation#
The square root of the sample variance, denoted by \(s\)
Learn more in Chapter 10 | Back to Top
Sample Variance#
Measurement of the spread for an empirical distribution. Defined as:
\(s^2 = \frac{\Sigma (x_i - \bar{x})^2}{n-1}\)
Learn more in Chapter 10 | Back to Top
Sampling#
The process of selecting a representative subset from the population of a study for analyzing specific phenomena.
Learn more in Chapter 8 | Back to Top
Scatter Plot#
A graph of points showing the relationship between two numerical variables. Each point represents one observation.
Learn more in Chapter 7 | Back to Top
Scope#
Where a variable is visible in code.
Variables defined inside a function have local scope.
Variables defined outside functions have global scope.
Learn more in Chapter 2 | Back to Top
Selection Bias#
Selection bias occurs when some groups within a population are disproportionately represented in a sample over the rest thereby leading to misleading generalizations.
Learn more in Chapter 8 | Back to Top
Set#
An unordered collection of unique elements in Python.
Defined with curly braces, e.g. {1, 2, 3}
.
Learn more in Chapter 3 | Back to Top
Shallow Copy#
A new object that contains references to the elements of the original object rather than fully duplicating them. Nested objects remain shared.
original = [[1, 2], [3, 4]]
shallow = original.copy()
shallow[0][0] = 99
print(original) # [[99, 2], [3, 4]]
Learn more in Chapter 3 | Back to Top
Significance Level#
A cut-off value, \(\alpha\), that is used as a threshold to determine whether to accept the null hypothesis or the alternative hypothesis. A commonly used significance level is \(\alpha=0.05\). If the p-value is smaller than \(\alpha < 0.5\), we reject \(H_0\), otherwise we fail to reject it.
Learn more in Chapter 11 | Back to Top
Simple Random Sampling#
Simple Random Sampling is a technique to randomly select samples from a population of interest such that each member is equally likely to be included in the sample.
Learn more in Chapter 8 | Back to Top
Slice#
A portion of a sequence selected by specifying a start, stop, and step.
my_list = [0, 1, 2, 3, 4, 5]
print(my_list[1:4:2]) # [1, 3]
Learn more in Chapter 3 | Back to Top
Spread#
A summary statistic of the data that measures how the data is dispersed. Measures of spread include variance and standard deviation.
Learn more in Chapter 10 | Back to Top
Standard Deviation#
The square root of the variance symbolized by \(\sigma(X)\). The standard deviation is a measure of how far each element is from the mean. In other words, the standard deviation is the average measure of how far all our data is from the mean.
Learn more in Chapter 10 | Back to Top
Standard Normal Distribution#
A special case of the normal distribution that occurs when \(\mu = 0\) and \(\sigma = 1\). Given a random variable and any values for \(\mu\) and \(\sigma\), that is \(X ∼ N(\mu, \sigma)\), we can transform to a standard normal, by normalizing it! That is:
Note this may be useful if you are comparing values from multiple normal distributions.
Learn more in Chapter 10 | Back to Top
Stratified Random Sampling#
Stratified Random Sampling involves dividing the population into homogeneous subgroups or strata such that the subsequent sample is formed by drawing simple random sample from within each strata.
Learn more in Chapter 8 | Back to Top
String#
A sequence of characters, such as letters, numbers, or symbols, enclosed in quotation marks. Example: "hello"
.
Learn more in Chapter 2 | Back to Top
Survivorship Bias#
Survivorship bias appears with the overrepresentation of responses from participants who remained (or ‘survived’) within the purview of a study or survey long-term while ignoring those who dropped out.
Learn more in Chapter 8 | Back to Top
Syntax#
The rules that define how Python code must be written so that the computer can understand it. Incorrect syntax usually causes an error.
Learn more in Chapter 2 | Back to Top
T#
Test Statistic#
A summary measure of the data that we use to investigate the consistency of a model. The choice of a test statistic depends on the specific hypotheses we are investigating. For example, the difference in means, medians, or standard deviations could all be used as test statistics.
Learn more in Chapter 11 | Back to Top
Total Variation Distance (Tvd)#
Sum of absolute differences in proportions.
In the above formula, \(p_i\)’s are proportions of subjects in various categories in one sample while \(q_i\)’s are proportions in the second sample.
Learn more in Chapter 11 | Back to Top
Tuple#
An ordered, immutable collection in Python. Once created, its elements cannot be changed.
Defined with parentheses, e.g. (1, 2, 3)
.
Learn more in Chapter 3 | Back to Top
Two Sample Test#
A hypothesis test performed to investigate whether there is a difference between two random samples or groups. This is also called A/B testing as we can refer to our two groups as Group A and Group B.
Learn more in Chapter 11 | Back to Top
Type 1 Error#
Rejection of the null hypothesis \(H_0\) when it is true. The type 1 error rate is equal to the significance level (\(\alpha\)).
Learn more in Chapter 11 | Back to Top
Type 2 Error#
Failure to reject the null hypothesis \(H_0\) when it is false.
Learn more in Chapter 11 | Back to Top
Type-Casting#
Converting a value from one data type to another.
print(float(3)) # 3.0
print(int("7")) # 7
Learn more in Chapter 3 | Back to Top
U#
Undercoverage Bias#
Undercoverage bias appears when drawing participants from an incomplete sampling frame that leaves out significant sections of the population of interest.
Learn more in Chapter 8 | Back to Top
Uniform Distribution#
A distribution where all events in a given sample space are equally likely to occur. Examples include, the distribution of possible outcomes when tossing a fair coin, rolling a die, or using a random number generator. Uniform distributions can include both discrete and continuous random variables.
Learn more in Chapter 10 | Back to Top
Union Of Events#
Union of events \(A\) and \(B\) is the set of all outcomes in \(A\), or in \(B\), or in both.
Learn more in Chapter 9 | Back to Top
Unpacking (A Tuple)#
Assigning elements of a tuple (or other iterable) to individual variables in a single statement.
x, y = (1, 2)
print(x) # 1
print(y) # 2
Learn more in Chapter 3 | Back to Top
V#
Variable / Variable Name#
A name that refers to a value stored in memory. Example: x = 10
.
Learn more in Chapter 2 | Back to Top
Variance#
The spread of data, symbolized by \(\sigma^2(X)=Var(X)\). This describes how the data is dispersed.
Learn more in Chapter 10 | Back to Top
#
0-Indexed#
The convention that counting positions in a sequence starts at 0 rather than 1. For example, in [10, 20, 30]
, the first element 10
is at index 0
.
Learn more in Chapter 3 | Back to Top