Toggle navigation sidebar
Toggle in-page Table of Contents
Introduction to Data Science I & II
Introduction
Part I: Exploring Data
1. What is Data Science?
2. Data Science Case Study
3. Programming in Python
Operations
Assignment Statements
Data Types
Comparisons
Functions
Built-In Functions and Methods
User-Defined Functions
4. Collections of Data
4.1 Lists
4.2 Dictionaries
4.3 Arrays
4.4 Assignment for Mutable Data Types
5. Randomness and Control Statements
5.1 Random Choice
5.2 Conditional Statements
5.3 Iteration and Simulation
6. DataFrames
6.1 Creating a DataFrame
6.2 Accessing Columns
6.3 Column Operations
6.4 Accessing Rows
6.5 Selection by Label
6.6 Selection by Condition
7. DataFrame Methods and Operations
7.1 Applying Functions
7.2 Merging Data
7.3 Grouping Data
7.4 Pivot Tables
8. String Data and Fuzzy Matching
8.1 Set-Based (Jaccard) Similarity
8.2 Sequence-Based Similarlity
8.3 Canonicalization
8.4 Reduced Alphabet Similarity
8.5 Example: Building Inspection reports
8.6 Encoding and Unicode
9. Data Visualization
9.1 Introduction to Matplotlib
9.2 Numerical Data
9.3 Categorical Data
9.4 Other Visualization Techniques
10. Data Collection
Causality versus Association
Observational versus Experimental Studies
Sampling
Biases
11. Probability
11.1 Definitions and Rules
11.2 A Simulation-Based Solution
11.3 Mathematical Derivation vs Computational Estimation
11.4 The Birthday Problem: Relaxed Assumptions
12. Empirical and Probability Distributions
12.1 Distributions Overview
12.2 Uniform Distribution
12.3 Normal Distribution
12.4 Binomial Distribution
13. Hypothesis Testing
13.1 Evaluating Consistency Between Data and a Model
13.2 Hypothesis Testing
13.3 Two-Sample Testing
13.4 Categorical Data
13.5 Connections with Classical Statistical Methods
14. Estimation and Confidence Intervals
14.1 Theoretical Justification for Confidence Intervals
14.2 The Bootstrap
14.3 Percentile Bootstrap Confidence Intervals
15. Ethics and Pitfalls in Data Science
15.1 Data Ethics and the Law
15.2 Pillar 1: Data Transparency & Accountability
15.3 Pillar 2: Data Privacy
15.4 Pillar 3: Informed Consent
15.5 Pillar 4: Mitigating Unintended Consequences
16. Traffic Stops Case Study
16.1 Study Background
16.2 Investigating Traffic Stops
Part II: Using Data to Understand Our World
17. Prediction and Correlation
17.1 Prediction
17.2 Correlation
18. Simple Linear Regression
19. Multiple Linear Regression
20. Relational Databases and SQL
21. Data Warehouses and Data Lakes
22. Scalable Data Processing
23. Classification
24. Clustering
25. Feature Engineering and Feature Selection
26. Prediction with Many Features
27. Neural Networks
28. Cloud Computing
29. Reproducibility
Binder
.ipynb
.pdf
Prediction with large number of features
Prediction with large number of features
#
Forthcoming…