Selection by Condition#

So far we’ve selected individuals based on their position in the row index and based on their row index value or label. But, particularly in larger data sets, this alone may not be practical.

Sorting individuals#

To begin, we might sort our data by the values of a feature or a set of features, using the sort_values method. Say we were having trouble finding Earth – we could produce a new DataFrame sorted by the name feature.

planets.sort_values('name')
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
2 Earth 149.6 5.970 5514.0 9.8
4 Jupiter 778.6 1898.000 1326.0 23.1
3 Mars 227.9 0.642 3933.0 3.7
0 Mercury 57.9 0.330 5427.0 3.7
7 Neptune 4495.1 102.000 1638.0 11.0
5 Saturn 1433.5 568.000 687.0 9.0
6 Uranus 2872.5 86.800 1271.0 8.7
1 Venus 108.2 4.870 5243.0 8.9

By default, this sorts individuals in “ascending” order – from alphabetical “first” to “last” and numerical least to greatest.

If instead we wanted to see the most massive planets, we would sort by the mass_kg_24 feature, but in “descending” order.

planets_massive = planets.sort_values('mass_kg_24', ascending=False)

planets_massive
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
4 Jupiter 778.6 1898.000 1326.0 23.1
5 Saturn 1433.5 568.000 687.0 9.0
7 Neptune 4495.1 102.000 1638.0 11.0
6 Uranus 2872.5 86.800 1271.0 8.7
2 Earth 149.6 5.970 5514.0 9.8
1 Venus 108.2 4.870 5243.0 8.9
3 Mars 227.9 0.642 3933.0 3.7
0 Mercury 57.9 0.330 5427.0 3.7

In both cases, we are interested in the first-listed rows. For example, we can extract the most massive planet, Jupiter, from the sorted DataFrame.

In this case, we’re not interested in its label in the row index, 4 – this is arbitrary from the standpoint of our query – rather, we want the first individual, i.e. the individual with an index offset of 0. We can access this individual with iloc.

planets_massive.iloc[0]
name                   Jupiter
solar_distance_km_6      778.6
mass_kg_24              1898.0
density_kg_m3           1326.0
gravity_m_s2              23.1
Name: 4, dtype: object

Sorting in this way can get you pretty far. But, more powerfully, you can specify a condition or a set of conditions which individuals must pass in order to be selected for a resulting DataFrame.

Conditions#

pandas will compute more complex queries in the form of Boolean conditions.

Again, the DataFrame property loc will handle the selection. This property supports the output of a DataFrame with the same features as or a subset of features of the original data, but containing only the individuals whose features satisfy the specified condition.

Rather than using an offset, label or slice, this is specified to loc using a Boolean sequence – a mask – which itself indicates the rows satisfying our condition, such as:

[True, False, True]

But don’t worry! We needn’t produce this list ourselves. We can generate it from a simple conditional expression in Python, applied to the Series of data underlying the feature itself.

Let’s review our planetary mass feature.

planets.mass_kg_24
0       0.330
1       4.870
2       5.970
3       0.642
4    1898.000
5     568.000
6      86.800
7     102.000
Name: mass_kg_24, dtype: float64

Our solar system’s inner planets never get any more massive than Earth – less than 6 x 1024 kilograms in mass. We can exclude such lightweights with the mask produced by the following comparison expression.

planets.mass_kg_24 > 6
0    False
1    False
2    False
3    False
4     True
5     True
6     True
7     True
Name: mass_kg_24, dtype: bool

As you can see, we’ve produced a new Series, populated by Boolean values which reflect that the “statement” of our comparison expression – that the planets’ masses are “greater than 6” thousand yottagrams – is False for the first four planets, but True for the remainder.

We can specify to loc this mask – if we like, of course, the expression itself – and produce our new DataFrame of individuals satisfying our condition.

planets.loc[planets.mass_kg_24 > 6]
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
4 Jupiter 778.6 1898.0 1326.0 23.1
5 Saturn 1433.5 568.0 687.0 9.0
6 Uranus 2872.5 86.8 1271.0 8.7
7 Neptune 4495.1 102.0 1638.0 11.0

Even simpler, we can construct the same sort of look-up to find Earth.

planets.loc[planets.name == 'Earth']
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
2 Earth 149.6 5.97 5514.0 9.8

Above we built our conditions from known values: 6 and “Earth”. But we can build a value look-up into our comparison as well.

Let’s use loc again as a shortcut to a scalar reference value.

planets.loc[2, 'mass_kg_24']
5.97

Above we see that our scalar look-up expression evaluated to the reference value we’d like to specify for our condition.

Now let’s insert this expression into our conditional look-up.

planets.loc[
    planets.mass_kg_24 > planets.loc[2, 'mass_kg_24']
]
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
4 Jupiter 778.6 1898.0 1326.0 23.1
5 Saturn 1433.5 568.0 687.0 9.0
6 Uranus 2872.5 86.8 1271.0 8.7
7 Neptune 4495.1 102.0 1638.0 11.0

We can also select individuals that satisfy multiple conditions.

Let’s compare the planets of our solar system to the Earth. We can begin by selecting only those planets whose gravity is within approximately 50% of Earth’s – less than \(14.8\frac{m}{s^2}\) and more than \(4.8\frac{m}{s^2}\).

not_too_much_gravity = planets.gravity_m_s2 < planets.loc[2, 'gravity_m_s2'] + 5

not_too_much_gravity
0     True
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: gravity_m_s2, dtype: bool

The above condition will exclude only Jupiter – its gravity, \(23.1\frac{m}{s^2}\), is well outside of bounds.

not_too_little_gravity = planets.gravity_m_s2 > planets.loc[2, 'gravity_m_s2'] - 5

not_too_little_gravity
0    False
1     True
2     True
3    False
4     True
5     True
6     True
7     True
Name: gravity_m_s2, dtype: bool

And this condition will exclude both Mercury and Mars, for their common gravity of \(3.7\frac{m}{s^2}\).

We could now simply invoke loc twice, once for each condition.

(
    planets
    .loc[not_too_much_gravity]
    .loc[not_too_little_gravity]
)
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
1 Venus 108.2 4.87 5243.0 8.9
2 Earth 149.6 5.97 5514.0 9.8
5 Saturn 1433.5 568.00 687.0 9.0
6 Uranus 2872.5 86.80 1271.0 8.7
7 Neptune 4495.1 102.00 1638.0 11.0

Or, more powerfully, we can combine our conditions into a single conditional mask.

In this case, we want both conditions to be satisfied, and so we’ll combine them using the bitwise AND operator: &.

not_too_much_gravity & not_too_little_gravity
0    False
1     True
2     True
3    False
4    False
5     True
6     True
7     True
Name: gravity_m_s2, dtype: bool

Above we see that Mercury, Mars and Jupiter will all be excluded.

planets.loc[not_too_much_gravity & not_too_little_gravity]
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
1 Venus 108.2 4.87 5243.0 8.9
2 Earth 149.6 5.97 5514.0 9.8
5 Saturn 1433.5 568.00 687.0 9.0
6 Uranus 2872.5 86.80 1271.0 8.7
7 Neptune 4495.1 102.00 1638.0 11.0

We’re left with the majority of the planets. As anticipated, only Mercury, Mars and Jupiter have been excluded – Mercury and Mars for having too little gravity, and Jupiter for having too much.

In fact, we know that Saturn is a gas giant, consisting almost entirely of hydrogen gas – nothing at all like Earth! This fact is indirectly evident from its density: less than 13% of the density of the Earth.

planets.density_kg_m3 / planets.loc[2, 'density_kg_m3']
0    0.984222
1    0.950852
2    1.000000
3    0.713275
4    0.240479
5    0.124592
6    0.230504
7    0.297062
Name: density_kg_m3, dtype: float64

We can exclude these giants as well by combining our gravity-based conditions with a density-based condition: that the planets’ densities are at least 50% of Earth’s.

dense_enough = planets.density_kg_m3 >= planets.loc[2, 'density_kg_m3'] * 0.5

dense_enough
0     True
1     True
2     True
3     True
4    False
5    False
6    False
7    False
Name: density_kg_m3, dtype: bool

All the giant planets – with densities less than 50% of Earth’s – will be excluded by the above mask.

planets.loc[
    not_too_much_gravity
    & not_too_little_gravity
    & dense_enough
]
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
1 Venus 108.2 4.87 5243.0 8.9
2 Earth 149.6 5.97 5514.0 9.8

Now we’re left with only Venus and Earth.

And, we’re given an excellent example of the danger of drawing conclusions from too-small a data set!

The two planets appear similar, judging by the above. Venus is considerably closer to the sun – 41.4 million kilometers closer than the Earth; but, as we saw, this is the smallest distance between any of the planets, and it’s difficult to gauge the significance of this feature with regard to its similarity to Earth … at least, this feature on its own.

That said, if our data included a feature like average_temperature_celsius – for which Venus would be 462° – then we’d know why Mars is considered the most likely habitable planet in our solar system, outside of the Earth … never mind its average temperature of -63° Celsius and its much lower gravity!

Wrapping up#

pandas and its DataFrame offer a great many useful additions to the data structures, functions and methods of Python in support of processing and analyzing data.

This introduction only covers some of the basic ways in which pandas builds upon, and differs from, what we’ve seen so far. But don’t be overwhelmed! The best way to learn is to dive in.

Apply what you’ve learned here, consult the pandas documentation – (and the Internet) – and read on!