Selection by Condition#

So far we’ve selected individuals based on their position in the row index and based on their row index value or label. But, particularly in larger data sets, this alone may not be practical.

Sorting individuals#

To begin, we might sort our data by the values of a feature or a set of features, using the sort_values method. Say we were having trouble finding Earth – we could produce a new DataFrame sorted by the name feature.

planets.sort_values('name')

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
2	Earth	149.6	5.970	5514.0	9.8
4	Jupiter	778.6	1898.000	1326.0	23.1
3	Mars	227.9	0.642	3933.0	3.7
0	Mercury	57.9	0.330	5427.0	3.7
7	Neptune	4495.1	102.000	1638.0	11.0
5	Saturn	1433.5	568.000	687.0	9.0
6	Uranus	2872.5	86.800	1271.0	8.7
1	Venus	108.2	4.870	5243.0	8.9

By default, this sorts individuals in “ascending” order – from alphabetical “first” to “last” and numerical least to greatest.

If instead we wanted to see the most massive planets, we would sort by the mass_kg_24 feature, but in “descending” order.

planets_massive = planets.sort_values('mass_kg_24', ascending=False)

planets_massive

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
4	Jupiter	778.6	1898.000	1326.0	23.1
5	Saturn	1433.5	568.000	687.0	9.0
7	Neptune	4495.1	102.000	1638.0	11.0
6	Uranus	2872.5	86.800	1271.0	8.7
2	Earth	149.6	5.970	5514.0	9.8
1	Venus	108.2	4.870	5243.0	8.9
3	Mars	227.9	0.642	3933.0	3.7
0	Mercury	57.9	0.330	5427.0	3.7

In both cases, we are interested in the first-listed rows. For example, we can extract the most massive planet, Jupiter, from the sorted DataFrame.

In this case, we’re not interested in its label in the row index, 4 – this is arbitrary from the standpoint of our query – rather, we want the first individual, i.e. the individual with an index offset of 0. We can access this individual with iloc.

planets_massive.iloc[0]

name                   Jupiter
solar_distance_km_6      778.6
mass_kg_24              1898.0
density_kg_m3           1326.0
gravity_m_s2              23.1
Name: 4, dtype: object

Sorting in this way can get you pretty far. But, more powerfully, you can specify a condition or a set of conditions which individuals must pass in order to be selected for a resulting DataFrame.

Conditions#

pandas will compute more complex queries in the form of Boolean conditions.

Again, the DataFrame property loc will handle the selection. This property supports the output of a DataFrame with the same features as or a subset of features of the original data, but containing only the individuals whose features satisfy the specified condition.

Rather than using an offset, label or slice, this is specified to loc using a Boolean sequence – a mask – which itself indicates the rows satisfying our condition, such as:

[True, False, True]

But don’t worry! We needn’t produce this list ourselves. We can generate it from a simple conditional expression in Python, applied to the Series of data underlying the feature itself.

Let’s review our planetary mass feature.

planets.mass_kg_24

     0.330
     4.870
     5.970
     0.642
  1898.000
   568.000
    86.800
   102.000
Name: mass_kg_24, dtype: float64

Our solar system’s inner planets never get any more massive than Earth – less than 6 x 10²⁴ kilograms in mass. We can exclude such lightweights with the mask produced by the following comparison expression.

planets.mass_kg_24 > 6

  False
  False
  False
  False
   True
   True
   True
   True
Name: mass_kg_24, dtype: bool

As you can see, we’ve produced a new Series, populated by Boolean values which reflect that the “statement” of our comparison expression – that the planets’ masses are “greater than 6” thousand yottagrams – is False for the first four planets, but True for the remainder.

We can specify to loc this mask – if we like, of course, the expression itself – and produce our new DataFrame of individuals satisfying our condition.

planets.loc[planets.mass_kg_24 > 6]

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
4	Jupiter	778.6	1898.0	1326.0	23.1
5	Saturn	1433.5	568.0	687.0	9.0
6	Uranus	2872.5	86.8	1271.0	8.7
7	Neptune	4495.1	102.0	1638.0	11.0

Even simpler, we can construct the same sort of look-up to find Earth.

planets.loc[planets.name == 'Earth']

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
2	Earth	149.6	5.97	5514.0	9.8

Above we built our conditions from known values: 6 and “Earth”. But we can build a value look-up into our comparison as well.

Let’s use loc again as a shortcut to a scalar reference value.

planets.loc[2, 'mass_kg_24']

5.97

Above we see that our scalar look-up expression evaluated to the reference value we’d like to specify for our condition.

Now let’s insert this expression into our conditional look-up.

planets.loc[
    planets.mass_kg_24 > planets.loc[2, 'mass_kg_24']
]

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
4	Jupiter	778.6	1898.0	1326.0	23.1
5	Saturn	1433.5	568.0	687.0	9.0
6	Uranus	2872.5	86.8	1271.0	8.7
7	Neptune	4495.1	102.0	1638.0	11.0

We can also select individuals that satisfy multiple conditions.

Let’s compare the planets of our solar system to the Earth. We can begin by selecting only those planets whose gravity is within approximately 50% of Earth’s – less than \(14.8\frac{m}{s^2}\) and more than \(4.8\frac{m}{s^2}\).

not_too_much_gravity = planets.gravity_m_s2 < planets.loc[2, 'gravity_m_s2'] + 5

not_too_much_gravity

   True
   True
   True
   True
  False
   True
   True
   True
Name: gravity_m_s2, dtype: bool

The above condition will exclude only Jupiter – its gravity, \(23.1\frac{m}{s^2}\), is well outside of bounds.

not_too_little_gravity = planets.gravity_m_s2 > planets.loc[2, 'gravity_m_s2'] - 5

not_too_little_gravity

  False
   True
   True
  False
   True
   True
   True
   True
Name: gravity_m_s2, dtype: bool

And this condition will exclude both Mercury and Mars, for their common gravity of \(3.7\frac{m}{s^2}\).

We could now simply invoke loc twice, once for each condition.

(
    planets
    .loc[not_too_much_gravity]
    .loc[not_too_little_gravity]
)

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
1	Venus	108.2	4.87	5243.0	8.9
2	Earth	149.6	5.97	5514.0	9.8
5	Saturn	1433.5	568.00	687.0	9.0
6	Uranus	2872.5	86.80	1271.0	8.7
7	Neptune	4495.1	102.00	1638.0	11.0

Or, more powerfully, we can combine our conditions into a single conditional mask.

In this case, we want both conditions to be satisfied, and so we’ll combine them using the bitwise AND operator: &.

not_too_much_gravity & not_too_little_gravity

  False
   True
   True
  False
  False
   True
   True
   True
Name: gravity_m_s2, dtype: bool

Above we see that Mercury, Mars and Jupiter will all be excluded.

planets.loc[not_too_much_gravity & not_too_little_gravity]

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
1	Venus	108.2	4.87	5243.0	8.9
2	Earth	149.6	5.97	5514.0	9.8
5	Saturn	1433.5	568.00	687.0	9.0
6	Uranus	2872.5	86.80	1271.0	8.7
7	Neptune	4495.1	102.00	1638.0	11.0

We’re left with the majority of the planets. As anticipated, only Mercury, Mars and Jupiter have been excluded – Mercury and Mars for having too little gravity, and Jupiter for having too much.

In fact, we know that Saturn is a gas giant, consisting almost entirely of hydrogen gas – nothing at all like Earth! This fact is indirectly evident from its density: less than 13% of the density of the Earth.

planets.density_kg_m3 / planets.loc[2, 'density_kg_m3']

  0.984222
  0.950852
  1.000000
  0.713275
  0.240479
  0.124592
  0.230504
  0.297062
Name: density_kg_m3, dtype: float64

We can exclude these giants as well by combining our gravity-based conditions with a density-based condition: that the planets’ densities are at least 50% of Earth’s.

dense_enough = planets.density_kg_m3 >= planets.loc[2, 'density_kg_m3'] * 0.5

dense_enough

   True
   True
   True
   True
  False
  False
  False
  False
Name: density_kg_m3, dtype: bool

All the giant planets – with densities less than 50% of Earth’s – will be excluded by the above mask.

planets.loc[
    not_too_much_gravity
    & not_too_little_gravity
    & dense_enough
]

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
1	Venus	108.2	4.87	5243.0	8.9
2	Earth	149.6	5.97	5514.0	9.8

Now we’re left with only Venus and Earth.

And, we’re given an excellent example of the danger of drawing conclusions from too-small a data set!

The two planets appear similar, judging by the above. Venus is considerably closer to the sun – 41.4 million kilometers closer than the Earth; but, as we saw, this is the smallest distance between any of the planets, and it’s difficult to gauge the significance of this feature with regard to its similarity to Earth … at least, this feature on its own.

That said, if our data included a feature like average_temperature_celsius – for which Venus would be 462° – then we’d know why Mars is considered the most likely habitable planet in our solar system, outside of the Earth … never mind its average temperature of -63° Celsius and its much lower gravity!

Wrapping up#

pandas and its DataFrame offer a great many useful additions to the data structures, functions and methods of Python in support of processing and analyzing data.

This introduction only covers some of the basic ways in which pandas builds upon, and differs from, what we’ve seen so far. But don’t be overwhelmed! The best way to learn is to dive in.

Apply what you’ve learned here, consult the pandas documentation – (and the Internet) – and read on!

Introduction to Data Science I & II

Selection by Condition

Contents

Selection by Condition#

Sorting individuals#

Conditions#

Wrapping up#