Accessing Columns#

Listing columns#

One DataFrame property is columns. And it shows that we assigned the following column names.

planets.columns
Index(['name', 'solar_distance_km_6', 'mass_kg_24', 'density_kg_m3',
       'gravity_m_s2'],
      dtype='object')

pandas has constructed an Index object for our columns. But we can get back our column list with the Index method tolist.

planets.columns.tolist()
['name', 'solar_distance_km_6', 'mass_kg_24', 'density_kg_m3', 'gravity_m_s2']

Extracting features#

Let’s take a look at the first dimension of our data, the features, such as name.

planets.name
0    Mercury
1      Venus
2      Earth
3       Mars
4    Jupiter
5     Saturn
6     Uranus
7    Neptune
Name: name, dtype: object

Above, we’ve extracted from our data a one-dimensional sequence of the names of the planets in our solar system – even though there was no such list in our initial input data planets_data.

Of course, we did specify a sequence like the above when constructing the DataFrame from planets_dict. And, regardless of how the DataFrame is constructed, we can treat it like a dictionary, as well.

Above, we extracted the name feature of our data using its namesake property. We can alternatively do the same via dictionary subscription, specifying the feature to extract as a string.

planets['name']
0    Mercury
1      Venus
2      Earth
3       Mars
4    Jupiter
5     Saturn
6     Uranus
7    Neptune
Name: name, dtype: object

Note

Accessing columns via their associated DataFrame properties, as in the first example, can be convenient. But the dictionary subscription syntax can be more explicit, and it becomes necessary when column names preclude their use as properties.

For example, a feature named name-old, accessed as planets.name-old, would be interpreted by Python as planets.name - old – that is, the name feature minus some entity named old….

The Series#

The sequence of our extracted feature is another data type provided by pandas: the Series.

type(planets.name)
pandas.core.series.Series

The pandas Series bears similarities to the DataFrame, but the Series handles data one-dimensionally – like Python’s list.

And like the list, the DataFrame, and the Index, the Series provides methods of its own.

We can also extract the next feature, representing the distances of these planets from the sun, in 106 km.

planets.solar_distance_km_6
0      57.9
1     108.2
2     149.6
3     227.9
4     778.6
5    1433.5
6    2872.5
7    4495.1
Name: solar_distance_km_6, dtype: float64

And we can compute aggregates of this data, such as the average or mean, thanks to the Series method, mean.

planets.solar_distance_km_6.mean()
1265.4125000000001

Feature selection#

We can also extract multiple features from our DataFrame, to produce another two-dimensional DataFrame, consisting of only the features specified.

This can also be achieved via dictionary subscription, specifying a list of features to include in the resulting DataFrame.

planets[['name', 'solar_distance_km_6']]
name solar_distance_km_6
0 Mercury 57.9
1 Venus 108.2
2 Earth 149.6
3 Mars 227.9
4 Jupiter 778.6
5 Saturn 1433.5
6 Uranus 2872.5
7 Neptune 4495.1

Attention

Above we doubled our square brackets – the outer set indicating the subscription operation and the inner set the list of features to include in our slice.

Omission of either set of brackets will result in an error.

In subsequent sections we’ll learn more methods of slicing a DataFrame – such as loc – in Selection by Label.