Creating a DataFrame
Contents
Creating a DataFrame#
The library#
Like the NumPy array
– and unlike the Python dict
and list
– the pandas DataFrame
is not built into Python.
And so, first, we might have to ensure that the pandas library is installed.
Only then can we tell Python to make the pandas
module available to our code, using the import
statement. For example:
import pandas
Having done so, the DataFrame
type would be available as: pandas.DataFrame
.
That is, unlike with the built-in list
, we would refer to it as “under” the name pandas
, with a dot between the two names.
Or, we could import just DataFrame
, such that it’s available as just DataFrame
, without the rigmarole:
from pandas import DataFrame
However, we’ll be using pandas
a lot! And not just DataFrame
. Following a common convention, we’ll tell Python to assign the library module the name pd
. This way, we’ll be able to refer to the elements of the pandas interface, such as DataFrame
, as e.g.: pd.DataFrame
.
import pandas as pd
The data#
We began to consider tabular or two-dimensional data in Lists, with the distances of planets from our sun. Let’s expand on this example with the below data, adding the planets’ masses, densities and gravities.
planets_features = [
'name', # familiar name
'solar_distance_km_6', # distance from sun: 10**6 km
'mass_kg_24', # absolute mass: 10**24 kg
'density_kg_m3', # density: kg/m**3
'gravity_m_s2', # gravity: m/s**2
]
planets_data = [
['Mercury', 57.9, 0.33, 5427.0, 3.7],
['Venus', 108.2, 4.87, 5243.0, 8.9],
['Earth', 149.6, 5.97, 5514.0, 9.8],
['Mars', 227.9, 0.642, 3933.0, 3.7],
['Jupiter', 778.6, 1898.0, 1326.0, 23.1],
['Saturn', 1433.5, 568.0, 687.0, 9.0],
['Uranus', 2872.5, 86.8, 1271.0, 8.7],
['Neptune', 4495.1, 102.0, 1638.0, 11.0]
]
planets_data
[['Mercury', 57.9, 0.33, 5427.0, 3.7],
['Venus', 108.2, 4.87, 5243.0, 8.9],
['Earth', 149.6, 5.97, 5514.0, 9.8],
['Mars', 227.9, 0.642, 3933.0, 3.7],
['Jupiter', 778.6, 1898.0, 1326.0, 23.1],
['Saturn', 1433.5, 568.0, 687.0, 9.0],
['Uranus', 2872.5, 86.8, 1271.0, 8.7],
['Neptune', 4495.1, 102.0, 1638.0, 11.0]]
Now let’s construct a DataFrame
for these data.
planets = pd.DataFrame(planets_data)
planets
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
This presentation of our data already looks more like a spreadsheet!
However, there’s something odd about the above. We’re accustomed now to numbering elements of a sequence by their index (or “offset”) – 0, 1, 2, 3, … – and this works in this case for numbering our rows. But this isn’t as useful a scheme for labeling our columns.
We’ll make manipulation of this data easier, and avoid confusion about what these values represent, by defining useful column labels.
Luckily, we already defined our features above.
planets_features
['name', 'solar_distance_km_6', 'mass_kg_24', 'density_kg_m3', 'gravity_m_s2']
planets = pd.DataFrame(planets_data, columns=planets_features)
planets
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
That’s better!
Indeed, there are many ways to construct a DataFrame
.
For another example, we might have specified our data as a single dictionary of features.
planets_dict = {
'name': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
'solar_distance_km_6': [57.9, 108.2, 149.6, 227.9, 778.6, 1433.5, 2872.5, 4495.1],
'mass_kg_24': [0.33, 4.87, 5.97, 0.642, 1898.0, 568.0, 86.8, 102.0],
'density_kg_m3': [5427.0, 5243.0, 5514.0, 3933.0, 1326.0, 687.0, 1271.0, 1638.0],
'gravity_m_s2': [3.7, 8.9, 9.8, 3.7, 23.1, 9.0, 8.7, 11.0],
}
pd.DataFrame(planets_dict)
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Data formats#
Of course, it is very common to store data in a file format, such as CSV. pandas supports a great many common data encoding formats, and makes it easy to construct DataFrames
from them.
For example, if we had a CSV file in our “Documents” folder, we might construct a DataFrame
from it using the pandas read_csv
function, like so:
data = pd.read_csv('/Users/MySelf/Documents/my-data.csv')
Above, we simply gave pandas the file system path to our CSV data. The read_csv
function also supports file objects, such as those returned by Python’s open
function.
Our planetary data, encoded as CSV, takes the following form:
name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
Mercury,57.9,0.33,5427.0,3.7
Venus,108.2,4.87,5243.0,8.9
Earth,149.6,5.97,5514.0,9.8
Mars,227.9,0.642,3933.0,3.7
Jupiter,778.6,1898.0,1326.0,23.1
Saturn,1433.5,568.0,687.0,9.0
Uranus,2872.5,86.8,1271.0,8.7
Neptune,4495.1,102.0,1638.0,11.0
Note that we’ve included our feature names as the first row of our data. (This is optional – but useful!)
And below we’ll reload our planets DataFrame
, similarly to the above – (but from a file buffer of that data, planets_csv
, the details of which are hidden below).
#
# Hello!
#
# This code allows you to download and execute this notebook as-is –
# without a separate CSV file.
#
# We can just *pretend* that `planets_csv` is a path to a file –
# or, more apt, a file object opened with the Python `open` function.
#
# (Really it's an in-memory file object … but that's not important here!)
#
import io
planets_encoded = '''\
name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
Mercury,57.9,0.33,5427.0,3.7
Venus,108.2,4.87,5243.0,8.9
Earth,149.6,5.97,5514.0,9.8
Mars,227.9,0.642,3933.0,3.7
Jupiter,778.6,1898.0,1326.0,23.1
Saturn,1433.5,568.0,687.0,9.0
Uranus,2872.5,86.8,1271.0,8.7
Neptune,4495.1,102.0,1638.0,11.0
'''
planets_csv = io.StringIO(planets_encoded)
planets = pd.read_csv(planets_csv)
planets
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Note that pandas automatically inferred that the first row of our CSV data specified the feature names.
The index#
pandas’s default index – the familiar range of integers starting with 0
– is most often sensible for computational data.
This is represented by the RangeIndex
type.
planets.index
RangeIndex(start=0, stop=8, step=1)
Of course, that’s not how we think about the planets!
We can tell pandas to use a more familiar index instead.
pd.RangeIndex(1, 9, name='number')
RangeIndex(start=1, stop=9, step=1, name='number')
pd.DataFrame(planets_data,
columns=planets_features,
index=pd.RangeIndex(1, 9, name='number'))
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
number | |||||
1 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
2 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
3 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
4 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
5 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
6 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
7 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
8 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
We don’t even have to use ranges … or numbers!
ordinals = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth']
planet_ordinals = pd.DataFrame(planets_data,
columns=planets_features,
index=pd.Index(ordinals, name='ordinal'))
planet_ordinals
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
ordinal | |||||
first | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
second | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
third | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
fourth | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
fifth | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
sixth | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
seventh | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
eighth | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
But, in the end, perhaps we’d prefer not to count the planets at all.
Whenever a data feature makes sense to use as the data index – that is, it’s sufficient to always uniquely identify individuals, we can just tell pandas to use that column as the index, instead.
We’ll learn more about manipulating DataFrames
in subsequent sections. But, for now, here’s how we would set the name
feature as our index, (at least when constructing a DataFrame
from lists
or dicts
).
planets.set_index('name')
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
#
# Hello!
#
# This cell has been hidden – it's just an implementation concern.
#
# Generally, when working with files, you won't need to worry about this.
#
planets_csv.seek(0)
0
The read_csv
function, on the other hand, supports this case specifically.
pd.read_csv(planets_csv, index_col='name')
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Now these DataFrames
are looking great! Let’s see what we can do with them.
Operations#
As we’ve seen with the list
, (and the string), the DataFrame
can be manipulated by functions and built-in operators. Moreover, these offer special-purpose functions which have been bound to their types – that is, methods – which are invoked with expressions of the form below:
name_of_dataframe.name_of_method(argument0, argument1, ..., keyword0=value0, ...)
For example, above we used the set_index
method to construct a new DataFrame
with the name
column set as the data index. Here it is again:
planets.set_index('name')
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
And, similar to methods, there are attributes and properties. These are values which are similarly bound to the DataFrame
, but which need not be called:
name_of_dataframe.name_of_property
We made use of the index
property above as well, to inspect our DataFrame
’s currently-assigned index:
planets.index
RangeIndex(start=0, stop=8, step=1)
pandas offers us many functions, methods and properties to explore!
And now we’re ready to explore the dimensions our data.