Selection by Label#

With iloc, we were able to select rows from our data according to their numeric position in the index.

With loc, we can also select rows – and columns – by their index value or “label”, selecting subsets of individuals and features, (and more!).

Individuals#

Let’s begin with the natural numeric index for our DataFrame – an integer range starting with 1 – and review some of what we’ve seen so far.

planets_natural = planets.set_index(pd.RangeIndex(1, 9, name='number'))

planets_natural

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
number
1	Mercury	57.9	0.330	5427.0	3.7
2	Venus	108.2	4.870	5243.0	8.9
3	Earth	149.6	5.970	5514.0	9.8
4	Mars	227.9	0.642	3933.0	3.7
5	Jupiter	778.6	1898.000	1326.0	23.1
6	Saturn	1433.5	568.000	687.0	9.0
7	Uranus	2872.5	86.800	1271.0	8.7
8	Neptune	4495.1	102.000	1638.0	11.0

With iloc, we retrieve the third individual from this DataFrame by specifying the offset 2.

planets_natural.iloc[2]

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object

But with loc, we can retrieve this individual – Earth – by its index value or “row label” – now 3.

planets_natural.loc[3]

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object

While numeric indexes are often more practical, we can further distinguish these individuals’ labels in the row index by using familiar strings instead.

ordinals = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth']

planets_ordinal = planets.set_index(pd.Index(ordinals, name='ordinal'))

planets_ordinal

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
ordinal
first	Mercury	57.9	0.330	5427.0	3.7
second	Venus	108.2	4.870	5243.0	8.9
third	Earth	149.6	5.970	5514.0	9.8
fourth	Mars	227.9	0.642	3933.0	3.7
fifth	Jupiter	778.6	1898.000	1326.0	23.1
sixth	Saturn	1433.5	568.000	687.0	9.0
seventh	Uranus	2872.5	86.800	1271.0	8.7
eighth	Neptune	4495.1	102.000	1638.0	11.0

Even better, let’s use the planets DataFrame column name.

planets_named = planets.set_index('name')

planets_named

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.330	5427.0	3.7
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0
Uranus	2872.5	86.800	1271.0	8.7
Neptune	4495.1	102.000	1638.0	11.0

loc supports the same selection features as iloc, such as slicing, even as it considers labels rather than position in the index.

middle_planets = planets_named.loc['Earth':'Saturn']

middle_planets

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0

Reminder

In the above example, loc interpreted our slice differently than iloc – both the elements at the slice start and stop were included in the results.

This is convenient, because labels aren’t necessarily incremental; we may not have in mind the label following the last one we want. Nonetheless, this is an inconsistency.

That said, loc resolves the inconsistency of the reference used in selecting out individuals from slices of the original DataFrame.

Using iloc we must refer to Earth as the “zeroeth” (zero-offset or “first”) in the above data subset named middle_planets…

middle_planets.iloc[0]

solar_distance_km_6     149.60
mass_kg_24                5.97
density_kg_m3          5514.00
gravity_m_s2              9.80
Name: Earth, dtype: float64

…but using loc, we may continue to refer to Earth by its proper label.

middle_planets.loc['Earth']

solar_distance_km_6     149.60
mass_kg_24                5.97
density_kg_m3          5514.00
gravity_m_s2              9.80
Name: Earth, dtype: float64

Features#

In addition to selecting out individuals by row label, loc also allows us to select features by column label.

We first saw this in operating directly on the DataFrame.

planets_named[['solar_distance_km_6', 'mass_kg_24']]

	solar_distance_km_6	mass_kg_24
name
Mercury	57.9	0.330
Venus	108.2	4.870
Earth	149.6	5.970
Mars	227.9	0.642
Jupiter	778.6	1898.000
Saturn	1433.5	568.000
Uranus	2872.5	86.800
Neptune	4495.1	102.000

Above, we’ve constructed a new DataFrame of our data, consisting of only its index and the features solar_distance_km_6 and mass_kg_24.

In fact, this is column selection through loc, but with the default that all rows are also returned.

Previously we’ve passed only one argument to loc – the label or slice of labels of rows to select. Of course, we could have passed the empty slice, to return all rows.

planets_named.loc[:]

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.330	5427.0	3.7
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0
Uranus	2872.5	86.800	1271.0	8.7
Neptune	4495.1	102.000	1638.0	11.0

Why would we ever pass the empty slice? For one, this permits us to pass a second argument to loc, and reproduce the above feature selection.

planets_named.loc[:, ['solar_distance_km_6', 'mass_kg_24']]

	solar_distance_km_6	mass_kg_24
name
Mercury	57.9	0.330
Venus	108.2	4.870
Earth	149.6	5.970
Mars	227.9	0.642
Jupiter	778.6	1898.000
Saturn	1433.5	568.000
Uranus	2872.5	86.800
Neptune	4495.1	102.000

Unlike what we’ve seen so far, element retrieval from the loc property can accept multiple arguments.

The first argument indicates row(s) to select – as above, by label, rather than offset.
The second argument does the same, but for column(s).

Because these arguments are positional, we can’t provide the second argument without also providing the first. Luckily, we needn’t supply arguments to the slice itself, so indicating that a new sequence should be constructed from the same elements.

And of course now we can go further and specify a slice of individuals as well.

planets_named.loc['Earth':'Saturn', ['solar_distance_km_6', 'mass_kg_24']]

	solar_distance_km_6	mass_kg_24
name
Earth	149.6	5.970
Mars	227.9	0.642
Jupiter	778.6	1898.000
Saturn	1433.5	568.000

Or a selection of both individuals and features.

planets_named.loc[['Earth', 'Saturn'], ['solar_distance_km_6', 'mass_kg_24']]

	solar_distance_km_6	mass_kg_24
name
Earth	149.6	5.97
Saturn	1433.5	568.00

Or even a slice of both individuals and features.

planets_named.loc['Earth':'Saturn', 'solar_distance_km_6':'density_kg_m3']

	solar_distance_km_6	mass_kg_24	density_kg_m3
name
Earth	149.6	5.970	5514.0
Mars	227.9	0.642	3933.0
Jupiter	778.6	1898.000	1326.0
Saturn	1433.5	568.000	687.0

We can even extract singular rows and columns with this syntax.

planets_named.loc[:, 'solar_distance_km_6']

name
Mercury      57.9
Venus       108.2
Earth       149.6
Mars        227.9
Jupiter     778.6
Saturn     1433.5
Uranus     2872.5
Neptune    4495.1
Name: solar_distance_km_6, dtype: float64

…And even singular scalar values.

planets_named.loc['Earth', 'solar_distance_km_6']

149.6

All together#

Say we wanted to contextualize distance_pct_change, reproducing our DataFrame, but this time with only the most relevant features – name and solar_distance_km_6 – and for only the “middle” planets.

Bear in mind that there are many ways about this.

For one, column assignment enables us to add a column to a new or existing DataFrame. And, as above, we can create a new DataFrame consisting of a subset of features and individuals via loc.

Our distance_pct_change feature computation took the following general form.

distance_pct_change = (
    100 * planets.solar_distance_km_6.pct_change()
).rename('percent distance')

distance_pct_change

         NaN
   86.873921
   38.262477
   52.339572
  241.641071
   84.112510
  100.383676
   56.487380
Name: percent distance, dtype: float64

We can construct our new DataFrame, with only the most relevant features, with loc….

planets_distances = planets.loc[2:5, ['name', 'solar_distance_km_6']]

planets_distances

	name	solar_distance_km_6
2	Earth	149.6
3	Mars	227.9
4	Jupiter	778.6
5	Saturn	1433.5

…And assign our feature to this DataFrame.

planets_distances['distance_pct_change'] = distance_pct_change

planets_distances

	name	solar_distance_km_6	distance_pct_change
2	Earth	149.6	38.262477
3	Mars	227.9	52.339572
4	Jupiter	778.6	241.641071
5	Saturn	1433.5	84.112510

Of course, we computed our feature from planets, which uses the default row index, and then assigned it to a DataFrame constructed from the same data and the same index.

What if we wanted to use another index?

We could of course change the index of the result.

planets_distances.set_index('name')

	solar_distance_km_6	distance_pct_change
name
Earth	149.6	38.262477
Mars	227.9	52.339572
Jupiter	778.6	241.641071
Saturn	1433.5	84.112510

Or, we could have performed all of our operations from that starting point, using planets_named in lieu of planets.

What would happen if we attempted to mix and match?

planets_distances = planets_natural.loc[3:6, ['name', 'solar_distance_km_6']]

planets_distances['distance_pct_change'] = distance_pct_change

planets_distances

	name	solar_distance_km_6	distance_pct_change
number
3	Earth	149.6	52.339572
4	Mars	227.9	241.641071
5	Jupiter	778.6	84.112510
6	Saturn	1433.5	100.383676

The above should look almost correct – but not quite.

According to the above result, Earth’s distance from Venus is 52% of Venus’s distance from the sun.

But, previously, this value was 38%.

The problem is that we computed distance_pct_change from a DataFrame with an index starting at 0, and then assigned it to a DataFrame with an index starting at 1.

This is more obvious when we look at all the rows of our data.

(
    planets_natural[['name', 'solar_distance_km_6']]
    .assign(distance_pct_change=distance_pct_change)
)

	name	solar_distance_km_6	distance_pct_change
number
1	Mercury	57.9	86.873921
2	Venus	108.2	38.262477
3	Earth	149.6	52.339572
4	Mars	227.9	241.641071
5	Jupiter	778.6	84.112510
6	Saturn	1433.5	100.383676
7	Uranus	2872.5	56.487380
8	Neptune	4495.1	NaN

The above result would have us believe that Mercury’s distance from … nothing … is 86.87% of … nothing’s … distance from the sun; and, that Neptune’s distance from Uranus is NaN of Uranus’s distance from the sun.

Physically, of course, this is all wrong.

Luckily, pandas provides yet another sneaky way to change an index and fix this: adding 1!

distance_pct_change.index += 1

distance_pct_change

         NaN
   86.873921
   38.262477
   52.339572
  241.641071
   84.112510
  100.383676
   56.487380
Name: percent distance, dtype: float64

Now that its index uses natural numbering, let’s reassign this Series as a feature of our natural numbered planets.

(
    planets_natural[['name', 'solar_distance_km_6']]
    .assign(distance_pct_change=distance_pct_change)
)

	name	solar_distance_km_6	distance_pct_change
number
1	Mercury	57.9	NaN
2	Venus	108.2	86.873921
3	Earth	149.6	38.262477
4	Mars	227.9	52.339572
5	Jupiter	778.6	241.641071
6	Saturn	1433.5	84.112510
7	Uranus	2872.5	100.383676
8	Neptune	4495.1	56.487380

Looking better so far….

planets_distances['distance_pct_change'] = distance_pct_change

planets_distances

	name	solar_distance_km_6	distance_pct_change
number
3	Earth	149.6	38.262477
4	Mars	227.9	52.339572
5	Jupiter	778.6	241.641071
6	Saturn	1433.5	84.112510

Much better!

And what about planets_named?

planets_distances = planets_named.loc['Earth':'Saturn', ['solar_distance_km_6']]

planets_distances['distance_pct_change'] = distance_pct_change

planets_distances

	solar_distance_km_6	distance_pct_change
name
Earth	149.6	NaN
Mars	227.9	NaN
Jupiter	778.6	NaN
Saturn	1433.5	NaN

Yikes! But we have ways of fixing that too.

Rather than manipulating the index of distance_pct_change, we can simply assign the index we want it to use, (provided it has the values we need!).

distance_pct_change.index = planets_named.index

distance_pct_change

name
Mercury           NaN
Venus       86.873921
Earth       38.262477
Mars        52.339572
Jupiter    241.641071
Saturn      84.112510
Uranus     100.383676
Neptune     56.487380
Name: percent distance, dtype: float64

planets_distances['distance_pct_change'] = distance_pct_change

planets_distances

	solar_distance_km_6	distance_pct_change
name
Earth	149.6	38.262477
Mars	227.9	52.339572
Jupiter	778.6	241.641071
Saturn	1433.5	84.112510

And now we’ve seen how to select individuals, features and scalar values by label, perform simple statistics on these, and combine these into new sets of data!

Introduction to Data Science I & II

Selection by Label

Contents

Selection by Label#

Individuals#

Features#

All together#