DataFrames

5. DataFrames#

Jesse London and Kriti Sehgal

The DataFrame is a data structure in Python that is widely used in Data Science and is provided by the pandas library for data analysis and manipulation.

DataFrames are two-dimensional collections of data. You can think of them as tables, similar to Excel spreadsheets. The dimensions of a DataFrame refer to its shape, i.e., the number of rows and columns. Each row represents a data point (observation/record), and each column represents the value of a feature or variable associated with that data point. Each column has its own name, i.e., the feature name, and each row has an index, which can be numbers or custom labels. In the example DataFrame below, the rows have index 0, 1, 2, … and the feature or column names are Name, Age, Job, and Department.

Name Age Job Department
0 Amina 40 Professor History
1 Hiroshi 35 Researcher Physics
2 Gabriela 32 Lecturer Biology
3 Elena 28 Lab Technician Chemistry
4 Arjun 32 Postdoc CS
5 Maria 38 Administrator Admin
6 Wei 29 Data Analyst Statistics

Note

An interesting aspect of pandas DataFrames is that different columns can store different types of data. For example, in the DataFrame above, the column Age contains numeric values, while the column Job contains strings. This is a key distinction between numpy arrays and pandas DataFrames. Recall from section 4.3 that numpy arrays are homogeneous, meaning all elements must have the same data type.

Another important difference is in indexing. In numpy arrays, all axes or dimensions are indexed with integers, whereas in pandas DataFrames, both row and column indexes can be custom labels, including integers, strings, or other types, as we will see in the next section.

numpy arrays and pandas DataFrames are optimized for different tasks. numpy arrays are ideal for numerical computations, linear algebra, and fast element-wise operations, while pandas DataFrames are specifically designed for analyzing, manipulating, and working efficiently with tabular data.

The following sections explore the basics of using the pandas DataFrame.