Selection by Label#

With iloc, we were able to select rows from our data according to their numeric position in the index.

With loc, we can also select rowsand columns – by their index value or “label”, selecting subsets of individuals and features, (and more!).

Individuals#

Let’s begin with the natural numeric index for our DataFrame – an integer range starting with 1 – and review some of what we’ve seen so far.

planets_natural = planets.set_index(pd.RangeIndex(1, 9, name='number'))

planets_natural
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
number
1 Mercury 57.9 0.330 5427.0 3.7
2 Venus 108.2 4.870 5243.0 8.9
3 Earth 149.6 5.970 5514.0 9.8
4 Mars 227.9 0.642 3933.0 3.7
5 Jupiter 778.6 1898.000 1326.0 23.1
6 Saturn 1433.5 568.000 687.0 9.0
7 Uranus 2872.5 86.800 1271.0 8.7
8 Neptune 4495.1 102.000 1638.0 11.0

With iloc, we retrieve the third individual from this DataFrame by specifying the offset 2.

planets_natural.iloc[2]
name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object

But with loc, we can retrieve this individual – Earth – by its index value or “row label” – now 3.

planets_natural.loc[3]
name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object

While numeric indexes are often more practical, we can further distinguish these individuals’ labels in the row index by using familiar strings instead.

ordinals = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth']

planets_ordinal = planets.set_index(pd.Index(ordinals, name='ordinal'))

planets_ordinal
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
ordinal
first Mercury 57.9 0.330 5427.0 3.7
second Venus 108.2 4.870 5243.0 8.9
third Earth 149.6 5.970 5514.0 9.8
fourth Mars 227.9 0.642 3933.0 3.7
fifth Jupiter 778.6 1898.000 1326.0 23.1
sixth Saturn 1433.5 568.000 687.0 9.0
seventh Uranus 2872.5 86.800 1271.0 8.7
eighth Neptune 4495.1 102.000 1638.0 11.0

Even better, let’s use the planets DataFrame column name.

planets_named = planets.set_index('name')

planets_named
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.330 5427.0 3.7
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
Uranus 2872.5 86.800 1271.0 8.7
Neptune 4495.1 102.000 1638.0 11.0

loc supports the same selection features as iloc, such as slicing, even as it considers labels rather than position in the index.

middle_planets = planets_named.loc['Earth':'Saturn']

middle_planets
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0

Reminder

In the above example, loc interpreted our slice differently than ilocboth the elements at the slice start and stop were included in the results.

This is convenient, because labels aren’t necessarily incremental; we may not have in mind the label following the last one we want. Nonetheless, this is an inconsistency.

That said, loc resolves the inconsistency of the reference used in selecting out individuals from slices of the original DataFrame.

Using iloc we must refer to Earth as the “zeroeth” (zero-offset or “first”) in the above data subset named middle_planets

middle_planets.iloc[0]
solar_distance_km_6     149.60
mass_kg_24                5.97
density_kg_m3          5514.00
gravity_m_s2              9.80
Name: Earth, dtype: float64

…but using loc, we may continue to refer to Earth by its proper label.

middle_planets.loc['Earth']
solar_distance_km_6     149.60
mass_kg_24                5.97
density_kg_m3          5514.00
gravity_m_s2              9.80
Name: Earth, dtype: float64

Features#

In addition to selecting out individuals by row label, loc also allows us to select features by column label.

We first saw this in operating directly on the DataFrame.

planets_named[['solar_distance_km_6', 'mass_kg_24']]
solar_distance_km_6 mass_kg_24
name
Mercury 57.9 0.330
Venus 108.2 4.870
Earth 149.6 5.970
Mars 227.9 0.642
Jupiter 778.6 1898.000
Saturn 1433.5 568.000
Uranus 2872.5 86.800
Neptune 4495.1 102.000

Above, we’ve constructed a new DataFrame of our data, consisting of only its index and the features solar_distance_km_6 and mass_kg_24.

In fact, this is column selection through loc, but with the default that all rows are also returned.

Previously we’ve passed only one argument to loc – the label or slice of labels of rows to select. Of course, we could have passed the empty slice, to return all rows.

planets_named.loc[:]
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.330 5427.0 3.7
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
Uranus 2872.5 86.800 1271.0 8.7
Neptune 4495.1 102.000 1638.0 11.0

Why would we ever pass the empty slice? For one, this permits us to pass a second argument to loc, and reproduce the above feature selection.

planets_named.loc[:, ['solar_distance_km_6', 'mass_kg_24']]
solar_distance_km_6 mass_kg_24
name
Mercury 57.9 0.330
Venus 108.2 4.870
Earth 149.6 5.970
Mars 227.9 0.642
Jupiter 778.6 1898.000
Saturn 1433.5 568.000
Uranus 2872.5 86.800
Neptune 4495.1 102.000

Unlike what we’ve seen so far, element retrieval from the loc property can accept multiple arguments.

  1. The first argument indicates row(s) to select – as above, by label, rather than offset.

  2. The second argument does the same, but for column(s).

Because these arguments are positional, we can’t provide the second argument without also providing the first. Luckily, we needn’t supply arguments to the slice itself, so indicating that a new sequence should be constructed from the same elements.

And of course now we can go further and specify a slice of individuals as well.

planets_named.loc['Earth':'Saturn', ['solar_distance_km_6', 'mass_kg_24']]
solar_distance_km_6 mass_kg_24
name
Earth 149.6 5.970
Mars 227.9 0.642
Jupiter 778.6 1898.000
Saturn 1433.5 568.000

Or a selection of both individuals and features.

planets_named.loc[['Earth', 'Saturn'], ['solar_distance_km_6', 'mass_kg_24']]
solar_distance_km_6 mass_kg_24
name
Earth 149.6 5.97
Saturn 1433.5 568.00

Or even a slice of both individuals and features.

planets_named.loc['Earth':'Saturn', 'solar_distance_km_6':'density_kg_m3']
solar_distance_km_6 mass_kg_24 density_kg_m3
name
Earth 149.6 5.970 5514.0
Mars 227.9 0.642 3933.0
Jupiter 778.6 1898.000 1326.0
Saturn 1433.5 568.000 687.0

We can even extract singular rows and columns with this syntax.

planets_named.loc[:, 'solar_distance_km_6']
name
Mercury      57.9
Venus       108.2
Earth       149.6
Mars        227.9
Jupiter     778.6
Saturn     1433.5
Uranus     2872.5
Neptune    4495.1
Name: solar_distance_km_6, dtype: float64

…And even singular scalar values.

planets_named.loc['Earth', 'solar_distance_km_6']
149.6

All together#

Say we wanted to contextualize distance_pct_change, reproducing our DataFrame, but this time with only the most relevant features – name and solar_distance_km_6and for only the “middle” planets.

Bear in mind that there are many ways about this.

For one, column assignment enables us to add a column to a new or existing DataFrame. And, as above, we can create a new DataFrame consisting of a subset of features and individuals via loc.

Our distance_pct_change feature computation took the following general form.

distance_pct_change = (
    100 * planets.solar_distance_km_6.pct_change()
).rename('percent distance')

distance_pct_change
0           NaN
1     86.873921
2     38.262477
3     52.339572
4    241.641071
5     84.112510
6    100.383676
7     56.487380
Name: percent distance, dtype: float64

We can construct our new DataFrame, with only the most relevant features, with loc….

planets_distances = planets.loc[2:5, ['name', 'solar_distance_km_6']]

planets_distances
name solar_distance_km_6
2 Earth 149.6
3 Mars 227.9
4 Jupiter 778.6
5 Saturn 1433.5

…And assign our feature to this DataFrame.

planets_distances['distance_pct_change'] = distance_pct_change

planets_distances
name solar_distance_km_6 distance_pct_change
2 Earth 149.6 38.262477
3 Mars 227.9 52.339572
4 Jupiter 778.6 241.641071
5 Saturn 1433.5 84.112510

Of course, we computed our feature from planets, which uses the default row index, and then assigned it to a DataFrame constructed from the same data and the same index.

What if we wanted to use another index?

We could of course change the index of the result.

planets_distances.set_index('name')
solar_distance_km_6 distance_pct_change
name
Earth 149.6 38.262477
Mars 227.9 52.339572
Jupiter 778.6 241.641071
Saturn 1433.5 84.112510

Or, we could have performed all of our operations from that starting point, using planets_named in lieu of planets.

What would happen if we attempted to mix and match?

planets_distances = planets_natural.loc[3:6, ['name', 'solar_distance_km_6']]

planets_distances['distance_pct_change'] = distance_pct_change

planets_distances
name solar_distance_km_6 distance_pct_change
number
3 Earth 149.6 52.339572
4 Mars 227.9 241.641071
5 Jupiter 778.6 84.112510
6 Saturn 1433.5 100.383676

The above should look almost correct – but not quite.

According to the above result, Earth’s distance from Venus is 52% of Venus’s distance from the sun.

But, previously, this value was 38%.

The problem is that we computed distance_pct_change from a DataFrame with an index starting at 0, and then assigned it to a DataFrame with an index starting at 1.

This is more obvious when we look at all the rows of our data.

(
    planets_natural[['name', 'solar_distance_km_6']]
    .assign(distance_pct_change=distance_pct_change)
)
name solar_distance_km_6 distance_pct_change
number
1 Mercury 57.9 86.873921
2 Venus 108.2 38.262477
3 Earth 149.6 52.339572
4 Mars 227.9 241.641071
5 Jupiter 778.6 84.112510
6 Saturn 1433.5 100.383676
7 Uranus 2872.5 56.487380
8 Neptune 4495.1 NaN

The above result would have us believe that Mercury’s distance from … nothing … is 86.87% of … nothing’s … distance from the sun; and, that Neptune’s distance from Uranus is NaN of Uranus’s distance from the sun.

Physically, of course, this is all wrong.

Luckily, pandas provides yet another sneaky way to change an index and fix this: adding 1!

distance_pct_change.index += 1

distance_pct_change
1           NaN
2     86.873921
3     38.262477
4     52.339572
5    241.641071
6     84.112510
7    100.383676
8     56.487380
Name: percent distance, dtype: float64

Now that its index uses natural numbering, let’s reassign this Series as a feature of our natural numbered planets.

(
    planets_natural[['name', 'solar_distance_km_6']]
    .assign(distance_pct_change=distance_pct_change)
)
name solar_distance_km_6 distance_pct_change
number
1 Mercury 57.9 NaN
2 Venus 108.2 86.873921
3 Earth 149.6 38.262477
4 Mars 227.9 52.339572
5 Jupiter 778.6 241.641071
6 Saturn 1433.5 84.112510
7 Uranus 2872.5 100.383676
8 Neptune 4495.1 56.487380

Looking better so far….

planets_distances['distance_pct_change'] = distance_pct_change

planets_distances
name solar_distance_km_6 distance_pct_change
number
3 Earth 149.6 38.262477
4 Mars 227.9 52.339572
5 Jupiter 778.6 241.641071
6 Saturn 1433.5 84.112510

Much better!

And what about planets_named?

planets_distances = planets_named.loc['Earth':'Saturn', ['solar_distance_km_6']]

planets_distances['distance_pct_change'] = distance_pct_change

planets_distances
solar_distance_km_6 distance_pct_change
name
Earth 149.6 NaN
Mars 227.9 NaN
Jupiter 778.6 NaN
Saturn 1433.5 NaN

Yikes! But we have ways of fixing that too.

Rather than manipulating the index of distance_pct_change, we can simply assign the index we want it to use, (provided it has the values we need!).

distance_pct_change.index = planets_named.index

distance_pct_change
name
Mercury           NaN
Venus       86.873921
Earth       38.262477
Mars        52.339572
Jupiter    241.641071
Saturn      84.112510
Uranus     100.383676
Neptune     56.487380
Name: percent distance, dtype: float64
planets_distances['distance_pct_change'] = distance_pct_change

planets_distances
solar_distance_km_6 distance_pct_change
name
Earth 149.6 38.262477
Mars 227.9 52.339572
Jupiter 778.6 241.641071
Saturn 1433.5 84.112510

And now we’ve seen how to select individuals, features and scalar values by label, perform simple statistics on these, and combine these into new sets of data!