Selection by Label
Contents
Selection by Label#
With iloc
, we were able to select rows from our data according to their numeric position in the index.
With loc
, we can also select rows – and columns – by their index value or “label”, selecting subsets of individuals and features, (and more!).
Individuals#
Let’s begin with the natural numeric index for our DataFrame
– an integer range starting with 1
– and review some of what we’ve seen so far.
planets_natural = planets.set_index(pd.RangeIndex(1, 9, name='number'))
planets_natural
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
number | |||||
1 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
2 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
3 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
4 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
5 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
6 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
7 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
8 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
With iloc
, we retrieve the third individual from this DataFrame
by specifying the offset 2
.
planets_natural.iloc[2]
name Earth
solar_distance_km_6 149.6
mass_kg_24 5.97
density_kg_m3 5514.0
gravity_m_s2 9.8
Name: 3, dtype: object
But with loc
, we can retrieve this individual – Earth – by its index value or “row label” – now 3
.
planets_natural.loc[3]
name Earth
solar_distance_km_6 149.6
mass_kg_24 5.97
density_kg_m3 5514.0
gravity_m_s2 9.8
Name: 3, dtype: object
While numeric indexes are often more practical, we can further distinguish these individuals’ labels in the row index by using familiar strings instead.
ordinals = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth']
planets_ordinal = planets.set_index(pd.Index(ordinals, name='ordinal'))
planets_ordinal
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
ordinal | |||||
first | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
second | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
third | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
fourth | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
fifth | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
sixth | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
seventh | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
eighth | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Even better, let’s use the planets DataFrame
column name
.
planets_named = planets.set_index('name')
planets_named
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
loc
supports the same selection features as iloc
, such as slicing, even as it considers labels rather than position in the index.
middle_planets = planets_named.loc['Earth':'Saturn']
middle_planets
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Reminder
In the above example, loc
interpreted our slice differently than iloc
– both the elements at the slice start
and stop
were included in the results.
This is convenient, because labels aren’t necessarily incremental; we may not have in mind the label following the last one we want. Nonetheless, this is an inconsistency.
That said, loc
resolves the inconsistency of the reference used in selecting out individuals from slices of the original DataFrame
.
Using iloc
we must refer to Earth as the “zeroeth” (zero-offset or “first”) in the above data subset named middle_planets
…
middle_planets.iloc[0]
solar_distance_km_6 149.60
mass_kg_24 5.97
density_kg_m3 5514.00
gravity_m_s2 9.80
Name: Earth, dtype: float64
…but using loc
, we may continue to refer to Earth by its proper label.
middle_planets.loc['Earth']
solar_distance_km_6 149.60
mass_kg_24 5.97
density_kg_m3 5514.00
gravity_m_s2 9.80
Name: Earth, dtype: float64
Features#
In addition to selecting out individuals by row label, loc
also allows us to select features by column label.
We first saw this in operating directly on the DataFrame
.
planets_named[['solar_distance_km_6', 'mass_kg_24']]
solar_distance_km_6 | mass_kg_24 | |
---|---|---|
name | ||
Mercury | 57.9 | 0.330 |
Venus | 108.2 | 4.870 |
Earth | 149.6 | 5.970 |
Mars | 227.9 | 0.642 |
Jupiter | 778.6 | 1898.000 |
Saturn | 1433.5 | 568.000 |
Uranus | 2872.5 | 86.800 |
Neptune | 4495.1 | 102.000 |
Above, we’ve constructed a new DataFrame
of our data, consisting of only its index and the features solar_distance_km_6
and mass_kg_24
.
In fact, this is column selection through loc
, but with the default that all rows are also returned.
Previously we’ve passed only one argument to loc
– the label or slice of labels of rows to select. Of course, we could have passed the empty slice, to return all rows.
planets_named.loc[:]
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Why would we ever pass the empty slice? For one, this permits us to pass a second argument to loc
, and reproduce the above feature selection.
planets_named.loc[:, ['solar_distance_km_6', 'mass_kg_24']]
solar_distance_km_6 | mass_kg_24 | |
---|---|---|
name | ||
Mercury | 57.9 | 0.330 |
Venus | 108.2 | 4.870 |
Earth | 149.6 | 5.970 |
Mars | 227.9 | 0.642 |
Jupiter | 778.6 | 1898.000 |
Saturn | 1433.5 | 568.000 |
Uranus | 2872.5 | 86.800 |
Neptune | 4495.1 | 102.000 |
Unlike what we’ve seen so far, element retrieval from the loc
property can accept multiple arguments.
The first argument indicates row(s) to select – as above, by label, rather than offset.
The second argument does the same, but for column(s).
Because these arguments are positional, we can’t provide the second argument without also providing the first. Luckily, we needn’t supply arguments to the slice itself, so indicating that a new sequence should be constructed from the same elements.
And of course now we can go further and specify a slice of individuals as well.
planets_named.loc['Earth':'Saturn', ['solar_distance_km_6', 'mass_kg_24']]
solar_distance_km_6 | mass_kg_24 | |
---|---|---|
name | ||
Earth | 149.6 | 5.970 |
Mars | 227.9 | 0.642 |
Jupiter | 778.6 | 1898.000 |
Saturn | 1433.5 | 568.000 |
Or a selection of both individuals and features.
planets_named.loc[['Earth', 'Saturn'], ['solar_distance_km_6', 'mass_kg_24']]
solar_distance_km_6 | mass_kg_24 | |
---|---|---|
name | ||
Earth | 149.6 | 5.97 |
Saturn | 1433.5 | 568.00 |
Or even a slice of both individuals and features.
planets_named.loc['Earth':'Saturn', 'solar_distance_km_6':'density_kg_m3']
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | |
---|---|---|---|
name | |||
Earth | 149.6 | 5.970 | 5514.0 |
Mars | 227.9 | 0.642 | 3933.0 |
Jupiter | 778.6 | 1898.000 | 1326.0 |
Saturn | 1433.5 | 568.000 | 687.0 |
We can even extract singular rows and columns with this syntax.
planets_named.loc[:, 'solar_distance_km_6']
name
Mercury 57.9
Venus 108.2
Earth 149.6
Mars 227.9
Jupiter 778.6
Saturn 1433.5
Uranus 2872.5
Neptune 4495.1
Name: solar_distance_km_6, dtype: float64
…And even singular scalar values.
planets_named.loc['Earth', 'solar_distance_km_6']
149.6
All together#
Say we wanted to contextualize distance_pct_change
, reproducing our DataFrame
, but this time with only the most relevant features – name
and solar_distance_km_6
– and for only the “middle” planets.
Bear in mind that there are many ways about this.
For one, column assignment enables us to add a column to a new or existing DataFrame
. And, as above, we can create a new DataFrame
consisting of a subset of features and individuals via loc
.
Our distance_pct_change
feature computation took the following general form.
distance_pct_change = (
100 * planets.solar_distance_km_6.pct_change()
).rename('percent distance')
distance_pct_change
0 NaN
1 86.873921
2 38.262477
3 52.339572
4 241.641071
5 84.112510
6 100.383676
7 56.487380
Name: percent distance, dtype: float64
We can construct our new DataFrame
, with only the most relevant features, with loc
….
planets_distances = planets.loc[2:5, ['name', 'solar_distance_km_6']]
planets_distances
name | solar_distance_km_6 | |
---|---|---|
2 | Earth | 149.6 |
3 | Mars | 227.9 |
4 | Jupiter | 778.6 |
5 | Saturn | 1433.5 |
…And assign our feature to this DataFrame
.
planets_distances['distance_pct_change'] = distance_pct_change
planets_distances
name | solar_distance_km_6 | distance_pct_change | |
---|---|---|---|
2 | Earth | 149.6 | 38.262477 |
3 | Mars | 227.9 | 52.339572 |
4 | Jupiter | 778.6 | 241.641071 |
5 | Saturn | 1433.5 | 84.112510 |
Of course, we computed our feature from planets
, which uses the default row index, and then assigned it to a DataFrame
constructed from the same data and the same index.
What if we wanted to use another index?
We could of course change the index of the result.
planets_distances.set_index('name')
solar_distance_km_6 | distance_pct_change | |
---|---|---|
name | ||
Earth | 149.6 | 38.262477 |
Mars | 227.9 | 52.339572 |
Jupiter | 778.6 | 241.641071 |
Saturn | 1433.5 | 84.112510 |
Or, we could have performed all of our operations from that starting point, using planets_named
in lieu of planets
.
What would happen if we attempted to mix and match?
planets_distances = planets_natural.loc[3:6, ['name', 'solar_distance_km_6']]
planets_distances['distance_pct_change'] = distance_pct_change
planets_distances
name | solar_distance_km_6 | distance_pct_change | |
---|---|---|---|
number | |||
3 | Earth | 149.6 | 52.339572 |
4 | Mars | 227.9 | 241.641071 |
5 | Jupiter | 778.6 | 84.112510 |
6 | Saturn | 1433.5 | 100.383676 |
The above should look almost correct – but not quite.
According to the above result, Earth’s distance from Venus is 52% of Venus’s distance from the sun.
But, previously, this value was 38%.
The problem is that we computed distance_pct_change
from a DataFrame
with an index starting at 0
, and then assigned it to a DataFrame
with an index starting at 1
.
This is more obvious when we look at all the rows of our data.
(
planets_natural[['name', 'solar_distance_km_6']]
.assign(distance_pct_change=distance_pct_change)
)
name | solar_distance_km_6 | distance_pct_change | |
---|---|---|---|
number | |||
1 | Mercury | 57.9 | 86.873921 |
2 | Venus | 108.2 | 38.262477 |
3 | Earth | 149.6 | 52.339572 |
4 | Mars | 227.9 | 241.641071 |
5 | Jupiter | 778.6 | 84.112510 |
6 | Saturn | 1433.5 | 100.383676 |
7 | Uranus | 2872.5 | 56.487380 |
8 | Neptune | 4495.1 | NaN |
The above result would have us believe that Mercury’s distance from … nothing … is 86.87% of … nothing’s … distance from the sun; and, that Neptune’s distance from Uranus is NaN
of Uranus’s distance from the sun.
Physically, of course, this is all wrong.
Luckily, pandas provides yet another sneaky way to change an index and fix this: adding 1
!
distance_pct_change.index += 1
distance_pct_change
1 NaN
2 86.873921
3 38.262477
4 52.339572
5 241.641071
6 84.112510
7 100.383676
8 56.487380
Name: percent distance, dtype: float64
Now that its index uses natural numbering, let’s reassign this Series
as a feature of our natural numbered planets.
(
planets_natural[['name', 'solar_distance_km_6']]
.assign(distance_pct_change=distance_pct_change)
)
name | solar_distance_km_6 | distance_pct_change | |
---|---|---|---|
number | |||
1 | Mercury | 57.9 | NaN |
2 | Venus | 108.2 | 86.873921 |
3 | Earth | 149.6 | 38.262477 |
4 | Mars | 227.9 | 52.339572 |
5 | Jupiter | 778.6 | 241.641071 |
6 | Saturn | 1433.5 | 84.112510 |
7 | Uranus | 2872.5 | 100.383676 |
8 | Neptune | 4495.1 | 56.487380 |
Looking better so far….
planets_distances['distance_pct_change'] = distance_pct_change
planets_distances
name | solar_distance_km_6 | distance_pct_change | |
---|---|---|---|
number | |||
3 | Earth | 149.6 | 38.262477 |
4 | Mars | 227.9 | 52.339572 |
5 | Jupiter | 778.6 | 241.641071 |
6 | Saturn | 1433.5 | 84.112510 |
Much better!
And what about planets_named
?
planets_distances = planets_named.loc['Earth':'Saturn', ['solar_distance_km_6']]
planets_distances['distance_pct_change'] = distance_pct_change
planets_distances
solar_distance_km_6 | distance_pct_change | |
---|---|---|
name | ||
Earth | 149.6 | NaN |
Mars | 227.9 | NaN |
Jupiter | 778.6 | NaN |
Saturn | 1433.5 | NaN |
Yikes! But we have ways of fixing that too.
Rather than manipulating the index of distance_pct_change
, we can simply assign the index we want it to use, (provided it has the values we need!).
distance_pct_change.index = planets_named.index
distance_pct_change
name
Mercury NaN
Venus 86.873921
Earth 38.262477
Mars 52.339572
Jupiter 241.641071
Saturn 84.112510
Uranus 100.383676
Neptune 56.487380
Name: percent distance, dtype: float64
planets_distances['distance_pct_change'] = distance_pct_change
planets_distances
solar_distance_km_6 | distance_pct_change | |
---|---|---|
name | ||
Earth | 149.6 | 38.262477 |
Mars | 227.9 | 52.339572 |
Jupiter | 778.6 | 241.641071 |
Saturn | 1433.5 | 84.112510 |
And now we’ve seen how to select individuals, features and scalar values by label, perform simple statistics on these, and combine these into new sets of data!