5.1. Creating a DataFrame#
The library#
Similar to how arrays are not built into Python and are provided through the numpy library, DataFrames are offered by the NumPy-based pandas library.
And so, first, we need to make sure that the pandas library is installed.
Only then can we tell Python to make the pandas library available to our code, using the import statement. For example:
import pandas
Once this is done, the DataFrame type becomes available as pandas.DataFrame. In other words, we can access it “under” the name pandas, with a dot between the two names.
It is a convention to import pandas as pd to save the keystrokes in writing, i.e.
import pandas as pd
This way, we will be able to refer to the elements of the pandas library, such as DataFrame, as pd.DataFrame (just like np.array).
There are many ways to construct a DataFrame. In the following, we discuss a few of them.
1. Creating a DataFrame with Manual Input#
1.1 From list of lists#
One of the simplest ways to create a DataFrame is by using Python lists. Consider the following data in the form of a list with each element is a list containing name of a planet and its distance from the sun in million km.
planets_data = [
['Mercury', 57.9],
['Venus', 108.2],
['Earth', 149.6],
['Mars', 227.9],
['Jupiter', 778.6],
['Saturn', 1433.5],
['Uranus', 2872.5],
['Neptune', 4495.1]
]
planets_data
[['Mercury', 57.9],
['Venus', 108.2],
['Earth', 149.6],
['Mars', 227.9],
['Jupiter', 778.6],
['Saturn', 1433.5],
['Uranus', 2872.5],
['Neptune', 4495.1]]
We can construct a DataFrame for the above data and give names to each column by:
planets_df = pd.DataFrame(planets_data, columns=['Planet', 'solar_distance_km_6'])
planets_df
| Planet | solar_distance_km_6 | |
|---|---|---|
| 0 | Mercury | 57.9 |
| 1 | Venus | 108.2 |
| 2 | Earth | 149.6 |
| 3 | Mars | 227.9 |
| 4 | Jupiter | 778.6 |
| 5 | Saturn | 1433.5 |
| 6 | Uranus | 2872.5 |
| 7 | Neptune | 4495.1 |
This presentation of our data already looks more like a spreadsheet!
Now let us consider that we have more information about each planet i.e. we know about its absolute mass (in \(10^{24}\) kg), density (in kg/\(\text{m}^3\)), and gravity (in m/\(\text{s}^2\)) i.e.
planets_features = [
'name', # familiar name
'solar_distance_km_6', # distance from sun: 10**6 km
'mass_kg_24', # absolute mass: 10**24 kg
'density_kg_m3', # density: kg/m**3
'gravity_m_s2', # gravity: m/s**2
]
planets_data = [
['Mercury', 57.9, 0.33, 5427.0, 3.7],
['Venus', 108.2, 4.87, 5243.0, 8.9],
['Earth', 149.6, 5.97, 5514.0, 9.8],
['Mars', 227.9, 0.642, 3933.0, 3.7],
['Jupiter', 778.6, 1898.0, 1326.0, 23.1],
['Saturn', 1433.5, 568.0, 687.0, 9.0],
['Uranus', 2872.5, 86.8, 1271.0, 8.7],
['Neptune', 4495.1, 102.0, 1638.0, 11.0]
]
planets_data
[['Mercury', 57.9, 0.33, 5427.0, 3.7],
['Venus', 108.2, 4.87, 5243.0, 8.9],
['Earth', 149.6, 5.97, 5514.0, 9.8],
['Mars', 227.9, 0.642, 3933.0, 3.7],
['Jupiter', 778.6, 1898.0, 1326.0, 23.1],
['Saturn', 1433.5, 568.0, 687.0, 9.0],
['Uranus', 2872.5, 86.8, 1271.0, 8.7],
['Neptune', 4495.1, 102.0, 1638.0, 11.0]]
We can similarly convert it into pandas DataFrame by:
planets_df = pd.DataFrame(planets_data, columns = planets_features)
planets_df
| name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
|---|---|---|---|---|---|
| 0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
| 1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
| 2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
| 3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
| 4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
| 5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
| 6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
| 7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
1.2 From Dictionary of lists#
Consider specifying our data as a dictionary, where each key corresponds to a column name and its associated value is a list containing the data for that column.
planets_dict = {
'name': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
'solar_distance_km_6': [57.9, 108.2, 149.6, 227.9, 778.6, 1433.5, 2872.5, 4495.1],
'mass_kg_24': [0.33, 4.87, 5.97, 0.642, 1898.0, 568.0, 86.8, 102.0],
'density_kg_m3': [5427.0, 5243.0, 5514.0, 3933.0, 1326.0, 687.0, 1271.0, 1638.0],
'gravity_m_s2': [3.7, 8.9, 9.8, 3.7, 23.1, 9.0, 8.7, 11.0],
}
planets_df = pd.DataFrame(planets_dict)
planets_df
| name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
|---|---|---|---|---|---|
| 0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
| 1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
| 2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
| 3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
| 4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
| 5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
| 6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
| 7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Note
Choosing between a list of lists and a dictionary of lists depends on your data and workflow.
A list of lists is useful when your data is naturally row-oriented, with each inner list representing a complete row. It is a quick way to create a DataFrame from raw data. However, adding a new column using this approach can be cumbersome: you need to enter each value for the new column into every row list individually, which is tedious and prone to errors.
On the other hand, a dictionary of lists is more convenient when you want labeled columns from the start. Adding a new column is straightforward, as you can simply assign a new key-value pair in the dictionary.
In practice, if you anticipate modifying your DataFrame or adding new columns, using a dictionary of lists is generally easier and safer.
2. Creating a DataFrame with Series#
In pandas, a Series is a one-dimensional labeled array that can hold any data type. Think of it as a single column of a DataFrame with an index. Since a DataFrame is essentially a collection of Series that share the same index, you can construct a DataFrame by combining multiple Series.
This approach is especially useful when you already have individual Series objects representing different columns, or when you want to preserve meaningful row labels (indices) for your data instead of using the default integer indices.
In the following examples, we will see how to create a DataFrame by combining multiple Series into a single structured table.
# Creating a `pandas` Series object
planet_names = pd.Series(['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'])
planet_names
0 Mercury
1 Venus
2 Earth
3 Mars
4 Jupiter
5 Saturn
6 Uranus
7 Neptune
dtype: object
type(planet_names)
pandas.core.series.Series
Let us say we have the following data as pandas Series objects:
planets_solar_distance_km_6 = pd.Series([57.9, 108.2, 149.6, 227.9, 778.6, 1433.5, 2872.5, 4495.1])
planets_mass_kg_24 = pd.Series([0.33, 4.87, 5.97, 0.642, 1898.0, 568.0, 86.8, 102.0])
planets_density_kg_m3 = pd.Series([5427.0, 5243.0, 5514.0, 3933.0, 1326.0, 687.0, 1271.0, 1638.0])
planets_gravity_m_s2 = pd.Series([3.7, 8.9, 9.8, 3.7, 23.1, 9.0, 8.7, 11.0])
We can combine these Series objects to create a DataFrame in the way similar to how we create a DataFrame using dictionaries.
planets_df = pd.DataFrame({
'name': planet_names,
'solar_distance_km_6': planets_solar_distance_km_6,
'mass_kg_24': planets_mass_kg_24,
'density_kg_m3': planets_density_kg_m3,
'gravity_m_s2': planets_gravity_m_s2
})
planets_df
| name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
|---|---|---|---|---|---|
| 0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
| 1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
| 2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
| 3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
| 4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
| 5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
| 6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
| 7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
3. Creating a DataFrame from External Files#
So far, we have manually entered small amounts of data to construct DataFrames. While this is useful for learning, real-world datasets are typically much larger and stored in external files. A common scenario in data science is reading data that has already been collected and saved in formats such as .csv, .xlsx, or .txt. Instead of typing the data by hand, we can load it directly into a DataFrame, making it easy to explore, analyze, and manipulate large datasets efficiently. pandas supports many common data encoding formats, and makes it easy to construct DataFrames from them.
Let us understand how to construct a DataFrame using a CSV (comma-separated values) file. Other file formats can be worked with in a similar way.
We use the function pd.read_csv(file_path) to read data from a CSV file into a pandas DataFrame. Here, file_path is a string that specifies the location of the file. It can be an absolute path (the full location on your computer), a relative path (relative to your current working directory), or even a URL pointing to an online dataset.
a) Absolute file path: Starts from the root of your computer and gives the full location of the file.
For example, if there is a CSV file called data.csv in the Documents folder of a computer, its absolute path would be:
Windows:
C:\\Users\\<username>\\Documents\\data.csvMac/Linux:
/Users/<username>/Documents/data.csv
Here,<username>is the placeholder for the actual account name on your computer.
Note
Windows traditionally uses backslashes \ in file paths, while Mac/Linux use forward slashes /. A single slash (/ or \) in a file path means “go into this folder”.
In Python strings, a single backslash \ is treated as an escape character (e.g., \n for newline), so you need to use double backslashes \\ or just use forward slashes / (Python is able to handle it correctly for Windows path as well).
b) Relative file path: Starts from the folder where your notebook (or code) is located.
For example:
Suppose you are working in a notebook called
hw.ipynbwhich is in a folder calledHW. You also have a file calleddata.csvlocated in the sameHWfolder. Its relative path would be:Windows:
data.csvMac/Linux:
data.csv
Notice that we did not need to tell the computer about the common folderHWthat contains both the notebook and the data file, since Python automatically starts from the current working directory (i.e. folder in which your notebook is located).
Now suppose you are working in
hw.ipynbinside theHWfolder, but your filedata.csvis in a subfolder calledDatainsideHW. In this case, you need to tell Python to look in theDatafolder within your current working directory. The relative path would be:Windows:
Data\\data.csvMac/Linux:
Data/data.csv
Again, notice that we did not need to tell the computer about theHWfolder as Python starts from the current folder where the notebook lives.
Finally, suppose your notebook
hw.ipynbis inside theHWfolder, but your filedata.csvis located one level up, in a parent folder calledData118that contains theHWfolder. In this case, you use..to go up one level intoData118and then access the file. The relative path would be:Windows:
..\\data.csvMac/Linux:
../data.csv
Here,..means “go up one directory from the current working directory.” So starting inHW, Python moves up intoData118and findsdata.csvthere. You can also use../consecutively (e.g.,../../) to go up multiple folder levels if your file is located further away in the directory hierarchy.
Suppose your notebook
hw.ipynbis inside theHWfolder, which is in a parent folder calledData118. The filedata.csvis in another subfolder calledDatainsideData118. In this case, starting from the notebook inHW, you need to go up one level toData118and then into theDatafolder to accessdata.csv. The relative path would be:Windows:
..\\Data\\data.csvMac/Linux:
../Data/data.csv
Python will first move fromHWup toData118, then intoDatato finddata.csv.
c). URL: You can also read directly from an online dataset by using pd.read_csv("url").
Data formats#
Of course, it is very common to store data in a file format, such as CSV. pandas supports a great many common data encoding formats, and makes it easy to construct DataFrames from them.
For example, if we had a CSV file in our “Documents” folder, we might construct a DataFrame from it using the pandas read_csv function, like so:
data = pd.read_csv('/Users/MySelf/Documents/my-data.csv')
Above, we simply gave pandas the file system path to our CSV data. The read_csv function also supports file objects, such as those returned by Python’s open function.
Our planetary data, encoded as CSV, takes the following form:
name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
Mercury,57.9,0.33,5427.0,3.7
Venus,108.2,4.87,5243.0,8.9
Earth,149.6,5.97,5514.0,9.8
Mars,227.9,0.642,3933.0,3.7
Jupiter,778.6,1898.0,1326.0,23.1
Saturn,1433.5,568.0,687.0,9.0
Uranus,2872.5,86.8,1271.0,8.7
Neptune,4495.1,102.0,1638.0,11.0
Note that we’ve included our feature names as the first row of our data. (This is optional – but useful!)
And below we’ll reload our planets DataFrame, similarly to the above – (but from a file buffer of that data, planets_csv, the details of which are hidden below).
Show code cell content
#
# Hello!
#
# This code allows you to download and execute this notebook as-is –
# without a separate CSV file.
#
# We can just *pretend* that `planets_csv` is a path to a file –
# or, more apt, a file object opened with the Python `open` function.
#
# (Really it's an in-memory file object … but that's not important here!)
#
import io
planets_encoded = '''\
name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
Mercury,57.9,0.33,5427.0,3.7
Venus,108.2,4.87,5243.0,8.9
Earth,149.6,5.97,5514.0,9.8
Mars,227.9,0.642,3933.0,3.7
Jupiter,778.6,1898.0,1326.0,23.1
Saturn,1433.5,568.0,687.0,9.0
Uranus,2872.5,86.8,1271.0,8.7
Neptune,4495.1,102.0,1638.0,11.0
'''
planets_csv = io.StringIO(planets_encoded)
planets = pd.read_csv(planets_csv)
planets
| name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
|---|---|---|---|---|---|
| 0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
| 1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
| 2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
| 3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
| 4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
| 5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
| 6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
| 7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Note that pandas automatically inferred that the first row of our CSV data specified the feature names.
The index#
pandas’s default index – the familiar range of integers starting with 0 – is most often sensible for computational data.
This is represented by the RangeIndex type.
planets.index
RangeIndex(start=0, stop=8, step=1)
Of course, that’s not how we think about the planets!
We can tell pandas to use a more familiar index instead.
pd.RangeIndex(1, 9, name='number')
RangeIndex(start=1, stop=9, step=1, name='number')
pd.DataFrame(planets_data,
columns=planets_features,
index=pd.RangeIndex(1, 9, name='number'))
| name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
|---|---|---|---|---|---|
| number | |||||
| 1 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
| 2 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
| 3 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
| 4 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
| 5 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
| 6 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
| 7 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
| 8 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
We don’t even have to use ranges … or numbers!
ordinals = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth']
planet_ordinals = pd.DataFrame(planets_data,
columns=planets_features,
index=pd.Index(ordinals, name='ordinal'))
planet_ordinals
| name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
|---|---|---|---|---|---|
| ordinal | |||||
| first | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
| second | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
| third | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
| fourth | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
| fifth | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
| sixth | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
| seventh | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
| eighth | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
But, in the end, perhaps we’d prefer not to count the planets at all.
Whenever a data feature makes sense to use as the data index – that is, it’s sufficient to always uniquely identify individuals, we can just tell pandas to use that column as the index, instead.
We’ll learn more about manipulating DataFrames in subsequent sections. But, for now, here’s how we would set the name feature as our index, (at least when constructing a DataFrame from lists or dicts).
planets.set_index('name')
| solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
|---|---|---|---|---|
| name | ||||
| Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
| Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
| Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
| Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
| Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
| Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
| Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
| Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Show code cell content
#
# Hello!
#
# This cell has been hidden – it's just an implementation concern.
#
# Generally, when working with files, you won't need to worry about this.
#
planets_csv.seek(0)
0
The read_csv function, on the other hand, supports this case specifically.
pd.read_csv(planets_csv, index_col='name')
| solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
|---|---|---|---|---|
| name | ||||
| Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
| Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
| Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
| Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
| Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
| Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
| Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
| Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Now these DataFrames are looking great! Let’s see what we can do with them.
Operations#
As we’ve seen with the list, (and the string), the DataFrame can be manipulated by functions and built-in operators. Moreover, these offer special-purpose functions which have been bound to their types – that is, methods – which are invoked with expressions of the form below:
name_of_dataframe.name_of_method(argument0, argument1, ..., keyword0=value0, ...)
For example, above we used the set_index method to construct a new DataFrame with the name column set as the data index. Here it is again:
planets.set_index('name')
| solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
|---|---|---|---|---|
| name | ||||
| Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
| Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
| Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
| Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
| Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
| Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
| Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
| Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
And, similar to methods, there are attributes and properties. These are values which are similarly bound to the DataFrame, but which need not be called:
name_of_dataframe.name_of_property
We made use of the index property above as well, to inspect our DataFrame’s currently-assigned index:
planets.index
RangeIndex(start=0, stop=8, step=1)
pandas offers us many functions, methods and properties to explore!
And now we’re ready to explore the dimensions our data.