Creating a DataFrame#

The library#

Like the NumPy array – and unlike the Python dict and list – the pandas DataFrame is not built into Python.

And so, first, we might have to ensure that the pandas library is installed.

Only then can we tell Python to make the pandas module available to our code, using the import statement. For example:

import pandas

Having done so, the DataFrame type would be available as: pandas.DataFrame.

That is, unlike with the built-in list, we would refer to it as “under” the name pandas, with a dot between the two names.

Or, we could import just DataFrame, such that it’s available as just DataFrame, without the rigmarole:

from pandas import DataFrame

However, we’ll be using pandas a lot! And not just DataFrame. Following a common convention, we’ll tell Python to assign the library module the name pd. This way, we’ll be able to refer to the elements of the pandas interface, such as DataFrame, as e.g.: pd.DataFrame.

import pandas as pd

The data#

We began to consider tabular or two-dimensional data in Lists, with the distances of planets from our sun. Let’s expand on this example with the below data, adding the planets’ masses, densities and gravities.

planets_features = [
    'name',                # familiar name
    'solar_distance_km_6', # distance from sun: 10**6 km
    'mass_kg_24',          # absolute mass: 10**24 kg
    'density_kg_m3',       # density: kg/m**3
    'gravity_m_s2',        # gravity: m/s**2
]

planets_data = [
    ['Mercury', 57.9, 0.33, 5427.0, 3.7],
    ['Venus', 108.2, 4.87, 5243.0, 8.9],
    ['Earth', 149.6, 5.97, 5514.0, 9.8],
    ['Mars', 227.9, 0.642, 3933.0, 3.7],
    ['Jupiter', 778.6, 1898.0, 1326.0, 23.1],
    ['Saturn', 1433.5, 568.0, 687.0, 9.0],
    ['Uranus', 2872.5, 86.8, 1271.0, 8.7],
    ['Neptune', 4495.1, 102.0, 1638.0, 11.0]
]

planets_data
[['Mercury', 57.9, 0.33, 5427.0, 3.7],
 ['Venus', 108.2, 4.87, 5243.0, 8.9],
 ['Earth', 149.6, 5.97, 5514.0, 9.8],
 ['Mars', 227.9, 0.642, 3933.0, 3.7],
 ['Jupiter', 778.6, 1898.0, 1326.0, 23.1],
 ['Saturn', 1433.5, 568.0, 687.0, 9.0],
 ['Uranus', 2872.5, 86.8, 1271.0, 8.7],
 ['Neptune', 4495.1, 102.0, 1638.0, 11.0]]

Now let’s construct a DataFrame for these data.

planets = pd.DataFrame(planets_data)

planets
0 1 2 3 4
0 Mercury 57.9 0.330 5427.0 3.7
1 Venus 108.2 4.870 5243.0 8.9
2 Earth 149.6 5.970 5514.0 9.8
3 Mars 227.9 0.642 3933.0 3.7
4 Jupiter 778.6 1898.000 1326.0 23.1
5 Saturn 1433.5 568.000 687.0 9.0
6 Uranus 2872.5 86.800 1271.0 8.7
7 Neptune 4495.1 102.000 1638.0 11.0

This presentation of our data already looks more like a spreadsheet!

However, there’s something odd about the above. We’re accustomed now to numbering elements of a sequence by their index (or “offset”) – 0, 1, 2, 3, … – and this works in this case for numbering our rows. But this isn’t as useful a scheme for labeling our columns.

We’ll make manipulation of this data easier, and avoid confusion about what these values represent, by defining useful column labels.

Luckily, we already defined our features above.

planets_features
['name', 'solar_distance_km_6', 'mass_kg_24', 'density_kg_m3', 'gravity_m_s2']
planets = pd.DataFrame(planets_data, columns=planets_features)

planets
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
0 Mercury 57.9 0.330 5427.0 3.7
1 Venus 108.2 4.870 5243.0 8.9
2 Earth 149.6 5.970 5514.0 9.8
3 Mars 227.9 0.642 3933.0 3.7
4 Jupiter 778.6 1898.000 1326.0 23.1
5 Saturn 1433.5 568.000 687.0 9.0
6 Uranus 2872.5 86.800 1271.0 8.7
7 Neptune 4495.1 102.000 1638.0 11.0

That’s better!

Indeed, there are many ways to construct a DataFrame.

For another example, we might have specified our data as a single dictionary of features.

planets_dict = {
    'name': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
    'solar_distance_km_6': [57.9, 108.2, 149.6, 227.9, 778.6, 1433.5, 2872.5, 4495.1],
    'mass_kg_24': [0.33, 4.87, 5.97, 0.642, 1898.0, 568.0, 86.8, 102.0],
    'density_kg_m3': [5427.0, 5243.0, 5514.0, 3933.0, 1326.0, 687.0, 1271.0, 1638.0],
    'gravity_m_s2': [3.7, 8.9, 9.8, 3.7, 23.1, 9.0, 8.7, 11.0],
}

pd.DataFrame(planets_dict)
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
0 Mercury 57.9 0.330 5427.0 3.7
1 Venus 108.2 4.870 5243.0 8.9
2 Earth 149.6 5.970 5514.0 9.8
3 Mars 227.9 0.642 3933.0 3.7
4 Jupiter 778.6 1898.000 1326.0 23.1
5 Saturn 1433.5 568.000 687.0 9.0
6 Uranus 2872.5 86.800 1271.0 8.7
7 Neptune 4495.1 102.000 1638.0 11.0

Data formats#

Of course, it is very common to store data in a file format, such as CSV. pandas supports a great many common data encoding formats, and makes it easy to construct DataFrames from them.

For example, if we had a CSV file in our “Documents” folder, we might construct a DataFrame from it using the pandas read_csv function, like so:

data = pd.read_csv('/Users/MySelf/Documents/my-data.csv')

Above, we simply gave pandas the file system path to our CSV data. The read_csv function also supports file objects, such as those returned by Python’s open function.

Our planetary data, encoded as CSV, takes the following form:

name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
Mercury,57.9,0.33,5427.0,3.7
Venus,108.2,4.87,5243.0,8.9
Earth,149.6,5.97,5514.0,9.8
Mars,227.9,0.642,3933.0,3.7
Jupiter,778.6,1898.0,1326.0,23.1
Saturn,1433.5,568.0,687.0,9.0
Uranus,2872.5,86.8,1271.0,8.7
Neptune,4495.1,102.0,1638.0,11.0

Note that we’ve included our feature names as the first row of our data. (This is optional – but useful!)

And below we’ll reload our planets DataFrame, similarly to the above – (but from a file buffer of that data, planets_csv, the details of which are hidden below).

#
# Hello!
#
# This code allows you to download and execute this notebook as-is –
# without a separate CSV file.
#
# We can just *pretend* that `planets_csv` is a path to a file –
# or, more apt, a file object opened with the Python `open` function.
#
# (Really it's an in-memory file object … but that's not important here!)
#
import io


planets_encoded = '''\
name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
Mercury,57.9,0.33,5427.0,3.7
Venus,108.2,4.87,5243.0,8.9
Earth,149.6,5.97,5514.0,9.8
Mars,227.9,0.642,3933.0,3.7
Jupiter,778.6,1898.0,1326.0,23.1
Saturn,1433.5,568.0,687.0,9.0
Uranus,2872.5,86.8,1271.0,8.7
Neptune,4495.1,102.0,1638.0,11.0
'''

planets_csv = io.StringIO(planets_encoded)
planets = pd.read_csv(planets_csv)

planets
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
0 Mercury 57.9 0.330 5427.0 3.7
1 Venus 108.2 4.870 5243.0 8.9
2 Earth 149.6 5.970 5514.0 9.8
3 Mars 227.9 0.642 3933.0 3.7
4 Jupiter 778.6 1898.000 1326.0 23.1
5 Saturn 1433.5 568.000 687.0 9.0
6 Uranus 2872.5 86.800 1271.0 8.7
7 Neptune 4495.1 102.000 1638.0 11.0

Note that pandas automatically inferred that the first row of our CSV data specified the feature names.

The index#

pandas’s default index – the familiar range of integers starting with 0 – is most often sensible for computational data.

This is represented by the RangeIndex type.

planets.index
RangeIndex(start=0, stop=8, step=1)

Of course, that’s not how we think about the planets!

We can tell pandas to use a more familiar index instead.

pd.RangeIndex(1, 9, name='number')
RangeIndex(start=1, stop=9, step=1, name='number')
pd.DataFrame(planets_data,
             columns=planets_features,
             index=pd.RangeIndex(1, 9, name='number'))
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
number
1 Mercury 57.9 0.330 5427.0 3.7
2 Venus 108.2 4.870 5243.0 8.9
3 Earth 149.6 5.970 5514.0 9.8
4 Mars 227.9 0.642 3933.0 3.7
5 Jupiter 778.6 1898.000 1326.0 23.1
6 Saturn 1433.5 568.000 687.0 9.0
7 Uranus 2872.5 86.800 1271.0 8.7
8 Neptune 4495.1 102.000 1638.0 11.0

We don’t even have to use ranges … or numbers!

ordinals = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth']

planet_ordinals = pd.DataFrame(planets_data,
                               columns=planets_features,
                               index=pd.Index(ordinals, name='ordinal'))

planet_ordinals
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
ordinal
first Mercury 57.9 0.330 5427.0 3.7
second Venus 108.2 4.870 5243.0 8.9
third Earth 149.6 5.970 5514.0 9.8
fourth Mars 227.9 0.642 3933.0 3.7
fifth Jupiter 778.6 1898.000 1326.0 23.1
sixth Saturn 1433.5 568.000 687.0 9.0
seventh Uranus 2872.5 86.800 1271.0 8.7
eighth Neptune 4495.1 102.000 1638.0 11.0

But, in the end, perhaps we’d prefer not to count the planets at all.

Whenever a data feature makes sense to use as the data index – that is, it’s sufficient to always uniquely identify individuals, we can just tell pandas to use that column as the index, instead.

We’ll learn more about manipulating DataFrames in subsequent sections. But, for now, here’s how we would set the name feature as our index, (at least when constructing a DataFrame from lists or dicts).

planets.set_index('name')
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.330 5427.0 3.7
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
Uranus 2872.5 86.800 1271.0 8.7
Neptune 4495.1 102.000 1638.0 11.0
#
# Hello!
#
# This cell has been hidden – it's just an implementation concern.
#
# Generally, when working with files, you won't need to worry about this.
#
planets_csv.seek(0)
0

The read_csv function, on the other hand, supports this case specifically.

pd.read_csv(planets_csv, index_col='name')
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.330 5427.0 3.7
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
Uranus 2872.5 86.800 1271.0 8.7
Neptune 4495.1 102.000 1638.0 11.0

Now these DataFrames are looking great! Let’s see what we can do with them.

Operations#

As we’ve seen with the list, (and the string), the DataFrame can be manipulated by functions and built-in operators. Moreover, these offer special-purpose functions which have been bound to their types – that is, methods – which are invoked with expressions of the form below:

name_of_dataframe.name_of_method(argument0, argument1, ..., keyword0=value0, ...)

For example, above we used the set_index method to construct a new DataFrame with the name column set as the data index. Here it is again:

planets.set_index('name')
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.330 5427.0 3.7
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
Uranus 2872.5 86.800 1271.0 8.7
Neptune 4495.1 102.000 1638.0 11.0

And, similar to methods, there are attributes and properties. These are values which are similarly bound to the DataFrame, but which need not be called:

name_of_dataframe.name_of_property

We made use of the index property above as well, to inspect our DataFrame’s currently-assigned index:

planets.index
RangeIndex(start=0, stop=8, step=1)

pandas offers us many functions, methods and properties to explore!

And now we’re ready to explore the dimensions our data.