Creating a DataFrame

Creating a DataFrame#

The library#

Like the NumPy array – and unlike the Python dict and list – the pandas DataFrame is not built into Python.

And so, first, we might have to ensure that the pandas library is installed.

Only then can we tell Python to make the pandas module available to our code, using the import statement. For example:

import pandas

Having done so, the DataFrame type would be available as: pandas.DataFrame.

That is, unlike with the built-in list, we would refer to it as “under” the name pandas, with a dot between the two names.

Or, we could import just DataFrame, such that it’s available as just DataFrame, without the rigmarole:

from pandas import DataFrame

However, we’ll be using pandas a lot! And not just DataFrame. Following a common convention, we’ll tell Python to assign the library module the name pd. This way, we’ll be able to refer to the elements of the pandas interface, such as DataFrame, as e.g.: pd.DataFrame.

import pandas as pd

The data#

We began to consider tabular or two-dimensional data in Lists, with the distances of planets from our sun. Let’s expand on this example with the below data, adding the planets’ masses, densities and gravities.

planets_features = [
    'name',                # familiar name
    'solar_distance_km_6', # distance from sun: 10**6 km
    'mass_kg_24',          # absolute mass: 10**24 kg
    'density_kg_m3',       # density: kg/m**3
    'gravity_m_s2',        # gravity: m/s**2
]

planets_data = [
    ['Mercury', 57.9, 0.33, 5427.0, 3.7],
    ['Venus', 108.2, 4.87, 5243.0, 8.9],
    ['Earth', 149.6, 5.97, 5514.0, 9.8],
    ['Mars', 227.9, 0.642, 3933.0, 3.7],
    ['Jupiter', 778.6, 1898.0, 1326.0, 23.1],
    ['Saturn', 1433.5, 568.0, 687.0, 9.0],
    ['Uranus', 2872.5, 86.8, 1271.0, 8.7],
    ['Neptune', 4495.1, 102.0, 1638.0, 11.0]
]

planets_data

[['Mercury', 57.9, 0.33, 5427.0, 3.7],
 ['Venus', 108.2, 4.87, 5243.0, 8.9],
 ['Earth', 149.6, 5.97, 5514.0, 9.8],
 ['Mars', 227.9, 0.642, 3933.0, 3.7],
 ['Jupiter', 778.6, 1898.0, 1326.0, 23.1],
 ['Saturn', 1433.5, 568.0, 687.0, 9.0],
 ['Uranus', 2872.5, 86.8, 1271.0, 8.7],
 ['Neptune', 4495.1, 102.0, 1638.0, 11.0]]

Now let’s construct a DataFrame for these data.

planets = pd.DataFrame(planets_data)

planets

	0	1	2	3	4
0	Mercury	57.9	0.330	5427.0	3.7
1	Venus	108.2	4.870	5243.0	8.9
2	Earth	149.6	5.970	5514.0	9.8
3	Mars	227.9	0.642	3933.0	3.7
4	Jupiter	778.6	1898.000	1326.0	23.1
5	Saturn	1433.5	568.000	687.0	9.0
6	Uranus	2872.5	86.800	1271.0	8.7
7	Neptune	4495.1	102.000	1638.0	11.0

This presentation of our data already looks more like a spreadsheet!

However, there’s something odd about the above. We’re accustomed now to numbering elements of a sequence by their index (or “offset”) – 0, 1, 2, 3, … – and this works in this case for numbering our rows. But this isn’t as useful a scheme for labeling our columns.

We’ll make manipulation of this data easier, and avoid confusion about what these values represent, by defining useful column labels.

Luckily, we already defined our features above.

planets_features

['name', 'solar_distance_km_6', 'mass_kg_24', 'density_kg_m3', 'gravity_m_s2']

planets = pd.DataFrame(planets_data, columns=planets_features)

planets

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
0	Mercury	57.9	0.330	5427.0	3.7
1	Venus	108.2	4.870	5243.0	8.9
2	Earth	149.6	5.970	5514.0	9.8
3	Mars	227.9	0.642	3933.0	3.7
4	Jupiter	778.6	1898.000	1326.0	23.1
5	Saturn	1433.5	568.000	687.0	9.0
6	Uranus	2872.5	86.800	1271.0	8.7
7	Neptune	4495.1	102.000	1638.0	11.0

That’s better!

Indeed, there are many ways to construct a DataFrame.

For another example, we might have specified our data as a single dictionary of features.

planets_dict = {
    'name': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
    'solar_distance_km_6': [57.9, 108.2, 149.6, 227.9, 778.6, 1433.5, 2872.5, 4495.1],
    'mass_kg_24': [0.33, 4.87, 5.97, 0.642, 1898.0, 568.0, 86.8, 102.0],
    'density_kg_m3': [5427.0, 5243.0, 5514.0, 3933.0, 1326.0, 687.0, 1271.0, 1638.0],
    'gravity_m_s2': [3.7, 8.9, 9.8, 3.7, 23.1, 9.0, 8.7, 11.0],
}

pd.DataFrame(planets_dict)

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
0	Mercury	57.9	0.330	5427.0	3.7
1	Venus	108.2	4.870	5243.0	8.9
2	Earth	149.6	5.970	5514.0	9.8
3	Mars	227.9	0.642	3933.0	3.7
4	Jupiter	778.6	1898.000	1326.0	23.1
5	Saturn	1433.5	568.000	687.0	9.0
6	Uranus	2872.5	86.800	1271.0	8.7
7	Neptune	4495.1	102.000	1638.0	11.0

Data formats#

Of course, it is very common to store data in a file format, such as CSV. pandas supports a great many common data encoding formats, and makes it easy to construct DataFrames from them.

For example, if we had a CSV file in our “Documents” folder, we might construct a DataFrame from it using the pandas read_csv function, like so:

data = pd.read_csv('/Users/MySelf/Documents/my-data.csv')

Above, we simply gave pandas the file system path to our CSV data. The read_csv function also supports file objects, such as those returned by Python’s open function.

Our planetary data, encoded as CSV, takes the following form:

name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2
Mercury,57.9,0.33,5427.0,3.7
Venus,108.2,4.87,5243.0,8.9
Earth,149.6,5.97,5514.0,9.8
Mars,227.9,0.642,3933.0,3.7
Jupiter,778.6,1898.0,1326.0,23.1
Saturn,1433.5,568.0,687.0,9.0
Uranus,2872.5,86.8,1271.0,8.7
Neptune,4495.1,102.0,1638.0,11.0

Note that we’ve included our feature names as the first row of our data. (This is optional – but useful!)

And below we’ll reload our planets DataFrame, similarly to the above – (but from a file buffer of that data, planets_csv, the details of which are hidden below).

planets = pd.read_csv(planets_csv)

planets

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
0	Mercury	57.9	0.330	5427.0	3.7
1	Venus	108.2	4.870	5243.0	8.9
2	Earth	149.6	5.970	5514.0	9.8
3	Mars	227.9	0.642	3933.0	3.7
4	Jupiter	778.6	1898.000	1326.0	23.1
5	Saturn	1433.5	568.000	687.0	9.0
6	Uranus	2872.5	86.800	1271.0	8.7
7	Neptune	4495.1	102.000	1638.0	11.0

Note that pandas automatically inferred that the first row of our CSV data specified the feature names.

The index#

pandas’s default index – the familiar range of integers starting with 0 – is most often sensible for computational data.

This is represented by the RangeIndex type.

planets.index

RangeIndex(start=0, stop=8, step=1)

Of course, that’s not how we think about the planets!

We can tell pandas to use a more familiar index instead.

pd.RangeIndex(1, 9, name='number')

RangeIndex(start=1, stop=9, step=1, name='number')

pd.DataFrame(planets_data,
             columns=planets_features,
             index=pd.RangeIndex(1, 9, name='number'))

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
number
1	Mercury	57.9	0.330	5427.0	3.7
2	Venus	108.2	4.870	5243.0	8.9
3	Earth	149.6	5.970	5514.0	9.8
4	Mars	227.9	0.642	3933.0	3.7
5	Jupiter	778.6	1898.000	1326.0	23.1
6	Saturn	1433.5	568.000	687.0	9.0
7	Uranus	2872.5	86.800	1271.0	8.7
8	Neptune	4495.1	102.000	1638.0	11.0

We don’t even have to use ranges … or numbers!

ordinals = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth']

planet_ordinals = pd.DataFrame(planets_data,
                               columns=planets_features,
                               index=pd.Index(ordinals, name='ordinal'))

planet_ordinals

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
ordinal
first	Mercury	57.9	0.330	5427.0	3.7
second	Venus	108.2	4.870	5243.0	8.9
third	Earth	149.6	5.970	5514.0	9.8
fourth	Mars	227.9	0.642	3933.0	3.7
fifth	Jupiter	778.6	1898.000	1326.0	23.1
sixth	Saturn	1433.5	568.000	687.0	9.0
seventh	Uranus	2872.5	86.800	1271.0	8.7
eighth	Neptune	4495.1	102.000	1638.0	11.0

But, in the end, perhaps we’d prefer not to count the planets at all.

Whenever a data feature makes sense to use as the data index – that is, it’s sufficient to always uniquely identify individuals, we can just tell pandas to use that column as the index, instead.

We’ll learn more about manipulating DataFrames in subsequent sections. But, for now, here’s how we would set the name feature as our index, (at least when constructing a DataFrame from lists or dicts).

planets.set_index('name')

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.330	5427.0	3.7
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0
Uranus	2872.5	86.800	1271.0	8.7
Neptune	4495.1	102.000	1638.0	11.0

The read_csv function, on the other hand, supports this case specifically.

pd.read_csv(planets_csv, index_col='name')

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.330	5427.0	3.7
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0
Uranus	2872.5	86.800	1271.0	8.7
Neptune	4495.1	102.000	1638.0	11.0

Now these DataFrames are looking great! Let’s see what we can do with them.

Operations#

As we’ve seen with the list, (and the string), the DataFrame can be manipulated by functions and built-in operators. Moreover, these offer special-purpose functions which have been bound to their types – that is, methods – which are invoked with expressions of the form below:

name_of_dataframe.name_of_method(argument0, argument1, ..., keyword0=value0, ...)

For example, above we used the set_index method to construct a new DataFrame with the name column set as the data index. Here it is again:

planets.set_index('name')

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.330	5427.0	3.7
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0
Uranus	2872.5	86.800	1271.0	8.7
Neptune	4495.1	102.000	1638.0	11.0

And, similar to methods, there are attributes and properties. These are values which are similarly bound to the DataFrame, but which need not be called:

name_of_dataframe.name_of_property

We made use of the index property above as well, to inspect our DataFrame’s currently-assigned index:

planets.index

RangeIndex(start=0, stop=8, step=1)

pandas offers us many functions, methods and properties to explore!

And now we’re ready to explore the dimensions our data.