Selection by Condition
Contents
Selection by Condition#
So far we’ve selected individuals based on their position in the row index and based on their row index value or label. But, particularly in larger data sets, this alone may not be practical.
Sorting individuals#
To begin, we might sort our data by the values of a feature or a set of features, using the sort_values
method. Say we were having trouble finding Earth – we could produce a new DataFrame
sorted by the name
feature.
planets.sort_values('name')
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
By default, this sorts individuals in “ascending” order – from alphabetical “first” to “last” and numerical least to greatest.
If instead we wanted to see the most massive planets, we would sort by the mass_kg_24
feature, but in “descending” order.
planets_massive = planets.sort_values('mass_kg_24', ascending=False)
planets_massive
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
4 | Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
5 | Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
7 | Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
6 | Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
2 | Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
1 | Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
3 | Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
0 | Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
In both cases, we are interested in the first-listed rows. For example, we can extract the most massive planet, Jupiter, from the sorted DataFrame
.
In this case, we’re not interested in its label in the row index, 4
– this is arbitrary from the standpoint of our query – rather, we want the first individual, i.e. the individual with an index offset of 0
. We can access this individual with iloc
.
planets_massive.iloc[0]
name Jupiter
solar_distance_km_6 778.6
mass_kg_24 1898.0
density_kg_m3 1326.0
gravity_m_s2 23.1
Name: 4, dtype: object
Sorting in this way can get you pretty far. But, more powerfully, you can specify a condition or a set of conditions which individuals must pass in order to be selected for a resulting DataFrame
.
Conditions#
pandas will compute more complex queries in the form of Boolean conditions.
Again, the DataFrame
property loc
will handle the selection. This property supports the output of a DataFrame
with the same features as or a subset of features of the original data, but containing only the individuals whose features satisfy the specified condition.
Rather than using an offset, label or slice, this is specified to loc
using a Boolean sequence – a mask – which itself indicates the rows satisfying our condition, such as:
[True, False, True]
But don’t worry! We needn’t produce this list
ourselves. We can generate it from a simple conditional expression in Python, applied to the Series
of data underlying the feature itself.
Let’s review our planetary mass feature.
planets.mass_kg_24
0 0.330
1 4.870
2 5.970
3 0.642
4 1898.000
5 568.000
6 86.800
7 102.000
Name: mass_kg_24, dtype: float64
Our solar system’s inner planets never get any more massive than Earth – less than 6 x 1024 kilograms in mass. We can exclude such lightweights with the mask produced by the following comparison expression.
planets.mass_kg_24 > 6
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
Name: mass_kg_24, dtype: bool
As you can see, we’ve produced a new Series
, populated by Boolean values which reflect that the “statement” of our comparison expression – that the planets’ masses are “greater than 6” thousand yottagrams – is False
for the first four planets, but True
for the remainder.
We can specify to loc
this mask – if we like, of course, the expression itself – and produce our new DataFrame
of individuals satisfying our condition.
planets.loc[planets.mass_kg_24 > 6]
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
4 | Jupiter | 778.6 | 1898.0 | 1326.0 | 23.1 |
5 | Saturn | 1433.5 | 568.0 | 687.0 | 9.0 |
6 | Uranus | 2872.5 | 86.8 | 1271.0 | 8.7 |
7 | Neptune | 4495.1 | 102.0 | 1638.0 | 11.0 |
Even simpler, we can construct the same sort of look-up to find Earth.
planets.loc[planets.name == 'Earth']
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
2 | Earth | 149.6 | 5.97 | 5514.0 | 9.8 |
Above we built our conditions from known values: 6
and “Earth”. But we can build a value look-up into our comparison as well.
Let’s use loc
again as a shortcut to a scalar reference value.
planets.loc[2, 'mass_kg_24']
5.97
Above we see that our scalar look-up expression evaluated to the reference value we’d like to specify for our condition.
Now let’s insert this expression into our conditional look-up.
planets.loc[
planets.mass_kg_24 > planets.loc[2, 'mass_kg_24']
]
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
4 | Jupiter | 778.6 | 1898.0 | 1326.0 | 23.1 |
5 | Saturn | 1433.5 | 568.0 | 687.0 | 9.0 |
6 | Uranus | 2872.5 | 86.8 | 1271.0 | 8.7 |
7 | Neptune | 4495.1 | 102.0 | 1638.0 | 11.0 |
We can also select individuals that satisfy multiple conditions.
Let’s compare the planets of our solar system to the Earth. We can begin by selecting only those planets whose gravity is within approximately 50% of Earth’s – less than \(14.8\frac{m}{s^2}\) and more than \(4.8\frac{m}{s^2}\).
not_too_much_gravity = planets.gravity_m_s2 < planets.loc[2, 'gravity_m_s2'] + 5
not_too_much_gravity
0 True
1 True
2 True
3 True
4 False
5 True
6 True
7 True
Name: gravity_m_s2, dtype: bool
The above condition will exclude only Jupiter – its gravity, \(23.1\frac{m}{s^2}\), is well outside of bounds.
not_too_little_gravity = planets.gravity_m_s2 > planets.loc[2, 'gravity_m_s2'] - 5
not_too_little_gravity
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 True
Name: gravity_m_s2, dtype: bool
And this condition will exclude both Mercury and Mars, for their common gravity of \(3.7\frac{m}{s^2}\).
We could now simply invoke loc
twice, once for each condition.
(
planets
.loc[not_too_much_gravity]
.loc[not_too_little_gravity]
)
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
1 | Venus | 108.2 | 4.87 | 5243.0 | 8.9 |
2 | Earth | 149.6 | 5.97 | 5514.0 | 9.8 |
5 | Saturn | 1433.5 | 568.00 | 687.0 | 9.0 |
6 | Uranus | 2872.5 | 86.80 | 1271.0 | 8.7 |
7 | Neptune | 4495.1 | 102.00 | 1638.0 | 11.0 |
Or, more powerfully, we can combine our conditions into a single conditional mask.
In this case, we want both conditions to be satisfied, and so we’ll combine them using the bitwise AND operator: &
.
not_too_much_gravity & not_too_little_gravity
0 False
1 True
2 True
3 False
4 False
5 True
6 True
7 True
Name: gravity_m_s2, dtype: bool
Above we see that Mercury, Mars and Jupiter will all be excluded.
planets.loc[not_too_much_gravity & not_too_little_gravity]
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
1 | Venus | 108.2 | 4.87 | 5243.0 | 8.9 |
2 | Earth | 149.6 | 5.97 | 5514.0 | 9.8 |
5 | Saturn | 1433.5 | 568.00 | 687.0 | 9.0 |
6 | Uranus | 2872.5 | 86.80 | 1271.0 | 8.7 |
7 | Neptune | 4495.1 | 102.00 | 1638.0 | 11.0 |
We’re left with the majority of the planets. As anticipated, only Mercury, Mars and Jupiter have been excluded – Mercury and Mars for having too little gravity, and Jupiter for having too much.
In fact, we know that Saturn is a gas giant, consisting almost entirely of hydrogen gas – nothing at all like Earth! This fact is indirectly evident from its density: less than 13% of the density of the Earth.
planets.density_kg_m3 / planets.loc[2, 'density_kg_m3']
0 0.984222
1 0.950852
2 1.000000
3 0.713275
4 0.240479
5 0.124592
6 0.230504
7 0.297062
Name: density_kg_m3, dtype: float64
We can exclude these giants as well by combining our gravity-based conditions with a density-based condition: that the planets’ densities are at least 50% of Earth’s.
dense_enough = planets.density_kg_m3 >= planets.loc[2, 'density_kg_m3'] * 0.5
dense_enough
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
Name: density_kg_m3, dtype: bool
All the giant planets – with densities less than 50% of Earth’s – will be excluded by the above mask.
planets.loc[
not_too_much_gravity
& not_too_little_gravity
& dense_enough
]
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
1 | Venus | 108.2 | 4.87 | 5243.0 | 8.9 |
2 | Earth | 149.6 | 5.97 | 5514.0 | 9.8 |
Now we’re left with only Venus and Earth.
And, we’re given an excellent example of the danger of drawing conclusions from too-small a data set!
The two planets appear similar, judging by the above. Venus is considerably closer to the sun – 41.4 million kilometers closer than the Earth; but, as we saw, this is the smallest distance between any of the planets, and it’s difficult to gauge the significance of this feature with regard to its similarity to Earth … at least, this feature on its own.
That said, if our data included a feature like average_temperature_celsius
– for which Venus would be 462° – then we’d know why Mars is considered the most likely habitable planet in our solar system, outside of the Earth … never mind its average temperature of -63° Celsius and its much lower gravity!
Wrapping up#
pandas and its DataFrame
offer a great many useful additions to the data structures, functions and methods of Python in support of processing and analyzing data.
This introduction only covers some of the basic ways in which pandas builds upon, and differs from, what we’ve seen so far. But don’t be overwhelmed! The best way to learn is to dive in.
Apply what you’ve learned here, consult the pandas documentation – (and the Internet) – and read on!