Column Operations#

Methods on Series#

We saw in the previous section how data features are represented as columns in the two-dimensional DataFrame, and these columns are implemented by the pandas Series. And, Series objects expose methods, such as mean.

planets.solar_distance_km_6.mean()
1265.4125000000001

pandas supports a number of statistical methods, such as median, standard deviation std, mean absolute deviation mad, and more.

Generally, we might note that the median distance of planets from the sun is less than half their mean.

planets.solar_distance_km_6.median() / planets.solar_distance_km_6.mean()
0.39769640334673473

We know that the planets are spread out. But how so?

planets.solar_distance_km_6
0      57.9
1     108.2
2     149.6
3     227.9
4     778.6
5    1433.5
6    2872.5
7    4495.1
Name: solar_distance_km_6, dtype: float64

Let’s take a look at how their distances increase, each relative to the one before it, with the method diff.

In physical terms, this is simply the distance between each planet.

planets.solar_distance_km_6.diff()
0       NaN
1      50.3
2      41.4
3      78.3
4     550.7
5     654.9
6    1439.0
7    1622.6
Name: solar_distance_km_6, dtype: float64

The above Series tells us that Venus is 50.3 million kilometers farther from the sun than Mercury, Earth is 41.4 million kilometers farther than Venus, etc. (The first value, for Mercury, is NaN – “not a number” – because there is no planet closer than it to the sun, with which to compare its distance.)

We can make this a little clearer by setting the planet names as our index, as we learned in Creating a DataFrame. pandas will carry over this index into computed Series and DataFrame objects.

planets_named = planets.set_index('name')

planets_named.solar_distance_km_6.diff()
name
Mercury       NaN
Venus        50.3
Earth        41.4
Mars         78.3
Jupiter     550.7
Saturn      654.9
Uranus     1439.0
Neptune    1622.6
Name: solar_distance_km_6, dtype: float64

Even better, we can invoke the rename method of the resulting Series such that this label reflects what it actually is.

(
    planets_named
    .solar_distance_km_6
    .diff()
    .rename('relative distance (km^6)')
)
name
Mercury       NaN
Venus        50.3
Earth        41.4
Mars         78.3
Jupiter     550.7
Saturn      654.9
Uranus     1439.0
Neptune    1622.6
Name: relative distance (km^6), dtype: float64

And, indeed, the distances between the planets increase dramatically.

We can see roughly that the first big jump is in the distance between Mars and Jupiter. The distances between the outer planets then continue to be greater than those between the inner planets, and continue to increase. But there’s another big intermediate jump, between Saturn and Uranus.

We can express the above quantitatively, as well, with the pct_change method.

distance_rel_change = (
    planets_named
    .solar_distance_km_6
    .pct_change()
    .rename('fractional distance')
)

distance_rel_change
name
Mercury         NaN
Venus      0.868739
Earth      0.382625
Mars       0.523396
Jupiter    2.416411
Saturn     0.841125
Uranus     1.003837
Neptune    0.564874
Name: fractional distance, dtype: float64

The pct_change method computes the fractional change (despite its name!) between successive elements.

That is, pct_change maps each difference to a ratio of the absolute value before it. Ultimately, this compares the distance of each planet from the sun, \(distance_1\), to the distance of the planet preceding it, \(distance_0\), according to:

\[ \frac{distance_1 - distance_0}{distance_0} \]

We see that Venus is farther from the sun than Mercury – but its distance from Mercury (50.3 km) is a little less than Mercury’s distance from the sun (57.9 km) – only 86.8739% as great. Earth’s relative distance from Venus is even smaller.

Operators on Series#

Series objects also support Python’s built-in operators, such as * for multiplication and / for division.

This is similar to such operations with Python’s list, but note! The Series responds to such operations differently.

planets.solar_distance_km_6
0      57.9
1     108.2
2     149.6
3     227.9
4     778.6
5    1433.5
6    2872.5
7    4495.1
Name: solar_distance_km_6, dtype: float64
2 * planets.solar_distance_km_6
0     115.8
1     216.4
2     299.2
3     455.8
4    1557.2
5    2867.0
6    5745.0
7    8990.2
Name: solar_distance_km_6, dtype: float64

Above, pandas applied our multiplication operation to each value in the Series, and returned a new Series of the results.

The number of rows – or individuals – really, the number of planets referenced by the data, didn’t change. That’s good!

This is useful for data analysis. But it’s quite different from the list:

2 * planets.solar_distance_km_6.tolist()
[57.9,
 108.2,
 149.6,
 227.9,
 778.6,
 1433.5,
 2872.5,
 4495.1,
 57.9,
 108.2,
 149.6,
 227.9,
 778.6,
 1433.5,
 2872.5,
 4495.1]

That is, the list applied the operation to itself, not to its individual elements – having the effect of simply listing the data twice.

Now we can represent our fractional distance changes as conventional percentages – by simply multiplying them by 100.

distance_rel_change * 100
name
Mercury           NaN
Venus       86.873921
Earth       38.262477
Mars        52.339572
Jupiter    241.641071
Saturn      84.112510
Uranus     100.383676
Neptune     56.487380
Name: fractional distance, dtype: float64

…And, named appropriately:

distance_pct_change = (distance_rel_change * 100).rename('percent distance')

distance_pct_change
name
Mercury           NaN
Venus       86.873921
Earth       38.262477
Mars        52.339572
Jupiter    241.641071
Saturn      84.112510
Uranus     100.383676
Neptune     56.487380
Name: percent distance, dtype: float64

Adding columns#

It might be more meaningful to see the above values in context, along with our other planetary data. Indeed, we might consider our computed Series a new, computed, feature of our data.

Assign method#

We can use the assign method to construct a new DataFrame, one which begins with the data from planets or planets_named, and which adds to this our new column.

planets_named.assign(distance_pct_change=distance_pct_change)
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2 distance_pct_change
name
Mercury 57.9 0.330 5427.0 3.7 NaN
Venus 108.2 4.870 5243.0 8.9 86.873921
Earth 149.6 5.970 5514.0 9.8 38.262477
Mars 227.9 0.642 3933.0 3.7 52.339572
Jupiter 778.6 1898.000 1326.0 23.1 241.641071
Saturn 1433.5 568.000 687.0 9.0 84.112510
Uranus 2872.5 86.800 1271.0 8.7 100.383676
Neptune 4495.1 102.000 1638.0 11.0 56.487380

As you can see, the assign method accepts only named keyword arguments. Notably, these can be any syntactically-valid name – pandas adds the sequence of values you specify to the existing data as a new column, and assigns to the new column this name.

Above we assigned to this column the data contained in a Series. pandas will collect values from any sequence – also including list and array.

Note also that the column may already exist – in which case it is simply overwritten.

Assignment#

It’s far more common, however, to add columns by treating the DataFrame like a dictionary.

The syntax is identical to that when assigning items to the Python dict.

As with the assign method, any sequence of values may be read into the new (or overwritten) column.

planets_named1 = planets.set_index('name')

planets_named1['temperature_mean_c'] = [167, 464, 15, -65, -110, -140, -195, -200]

planets_named1
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2 temperature_mean_c
name
Mercury 57.9 0.330 5427.0 3.7 167
Venus 108.2 4.870 5243.0 8.9 464
Earth 149.6 5.970 5514.0 9.8 15
Mars 227.9 0.642 3933.0 3.7 -65
Jupiter 778.6 1898.000 1326.0 23.1 -110
Saturn 1433.5 568.000 687.0 9.0 -140
Uranus 2872.5 86.800 1271.0 8.7 -195
Neptune 4495.1 102.000 1638.0 11.0 -200

The syntax of dictionary item assignment also has the advantage that the key – or column name – need not conform to Python’s syntax for names and keywords. Rather, any string is allowed.

planets_named1['how cold is it in degrees centigrade, anyway? ☃'] = [167, 464, 15, -65, -110, -140, -195, -200]

Attention

Assigning columns via this syntax also mirrors the operation with dict in that it modifies the DataFrame “in place” – the expression above did not evaluate to a new DataFrame, because no new DataFrame was constructed.

This is often the desired result.

However, it can lead to mistakes – as the name we’ve assigned to the DataFrame, such as planets or planets_named, then refers to an object with different values than it did before. For this reason, “in place” operations are supported, but not generally the default for pandas methods, such as assign.

Let’s take a look at our now-changed planets_named1.

planets_named1
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2 temperature_mean_c how cold is it in degrees centigrade, anyway? ☃
name
Mercury 57.9 0.330 5427.0 3.7 167 167
Venus 108.2 4.870 5243.0 8.9 464 464
Earth 149.6 5.970 5514.0 9.8 15 15
Mars 227.9 0.642 3933.0 3.7 -65 -65
Jupiter 778.6 1898.000 1326.0 23.1 -110 -110
Saturn 1433.5 568.000 687.0 9.0 -140 -140
Uranus 2872.5 86.800 1271.0 8.7 -195 -195
Neptune 4495.1 102.000 1638.0 11.0 -200 -200

We re-constructed our “named planets” DataFrame, calling it planets_named1, to avoid modifying planets_named.

(Of course, if this weren’t a simple example, we’d want to use a more descriptive name than that, to reflect our intent with the new data structure.)

Our expression to re-construct the DataFrame wasn’t complex; but, this can be cumbersome.

More generally, when we’d like to maintain multiple separate representations of our data, we can simply make a copy of our initial DataFrame prior to modifying it, using the copy method.

planets_for_holiday = planets_named.copy()

planets_for_holiday and planets_named are now separate objects – they just refer to the same data.

And now we can modify one without affecting the other.

planets_for_holiday['temperature_mean_c'] = [167, 464, 15, -65, -110, -140, -195, -200]

planets_for_holiday
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2 temperature_mean_c
name
Mercury 57.9 0.330 5427.0 3.7 167
Venus 108.2 4.870 5243.0 8.9 464
Earth 149.6 5.970 5514.0 9.8 15
Mars 227.9 0.642 3933.0 3.7 -65
Jupiter 778.6 1898.000 1326.0 23.1 -110
Saturn 1433.5 568.000 687.0 9.0 -140
Uranus 2872.5 86.800 1271.0 8.7 -195
Neptune 4495.1 102.000 1638.0 11.0 -200
planets_named
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.330 5427.0 3.7
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
Uranus 2872.5 86.800 1271.0 8.7
Neptune 4495.1 102.000 1638.0 11.0

Our column has been added to planets_for_holiday, but not to planets_named.

Styled presentation#

Arguably, of course, our multiplication by 100 was only an aesthetic change. We might instead preserve the output of our pct_change calculation, and merely adjust the presentation of our DataFrame.

pandas offers the DataFrame property style, whose format method accepts either functions or strings, with which it determines how to present its data. String arguments to this format method follow Python’s standard form for indicating how a value should be presented as text.

Python’s strings offer their own format method. And, for example, we might construct a string presenting the float value 0.868739 as a conventional percentage, as follows.

'{:.2%}'.format(0.868739)
'86.87%'

In a similar manner we can apply this formatting to our column, without altering the underlying values.

First, starting from our basic DataFrameplanets – let’s use what we’ve covered above to prepare just what we’d like to see.

planets_friendly = planets.set_index('name')

planets_friendly['distance_rel_change'] = distance_rel_change

planets_friendly
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2 distance_rel_change
name
Mercury 57.9 0.330 5427.0 3.7 NaN
Venus 108.2 4.870 5243.0 8.9 0.868739
Earth 149.6 5.970 5514.0 9.8 0.382625
Mars 227.9 0.642 3933.0 3.7 0.523396
Jupiter 778.6 1898.000 1326.0 23.1 2.416411
Saturn 1433.5 568.000 687.0 9.0 0.841125
Uranus 2872.5 86.800 1271.0 8.7 1.003837
Neptune 4495.1 102.000 1638.0 11.0 0.564874

We can now format the presentation of our relative distance column in a “friendly” manner.

planets_friendly.style.format({
    'distance_rel_change': '{:.2%}',
})
  solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2 distance_rel_change
name          
Mercury 57.900000 0.330000 5427.000000 3.700000 nan%
Venus 108.200000 4.870000 5243.000000 8.900000 86.87%
Earth 149.600000 5.970000 5514.000000 9.800000 38.26%
Mars 227.900000 0.642000 3933.000000 3.700000 52.34%
Jupiter 778.600000 1898.000000 1326.000000 23.100000 241.64%
Saturn 1433.500000 568.000000 687.000000 9.000000 84.11%
Uranus 2872.500000 86.800000 1271.000000 8.700000 100.38%
Neptune 4495.100000 102.000000 1638.000000 11.000000 56.49%

Looks good!

The above presentation of our data looks like a DataFrame, but formatted as we wanted. But – to avoid confusion – it’s not actually a DataFrame. And so conventional DataFrame methods won’t apply.

Renaming columns#

We can also rename columns, in this case also for clarity, and again making use of a dict.

planets_friendly = planets_friendly.rename(
    columns={
        'distance_rel_change': 'distance relative change',
    }
)

planets_friendly.style.format({
    'distance relative change': '{:.2%}',
})
  solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2 distance relative change
name          
Mercury 57.900000 0.330000 5427.000000 3.700000 nan%
Venus 108.200000 4.870000 5243.000000 8.900000 86.87%
Earth 149.600000 5.970000 5514.000000 9.800000 38.26%
Mars 227.900000 0.642000 3933.000000 3.700000 52.34%
Jupiter 778.600000 1898.000000 1326.000000 23.100000 241.64%
Saturn 1433.500000 568.000000 687.000000 9.000000 84.11%
Uranus 2872.500000 86.800000 1271.000000 8.700000 100.38%
Neptune 4495.100000 102.000000 1638.000000 11.000000 56.49%

Because we’ve added spaces to our column name, it’s no longer valid in the syntax of Python, and so we can no longer refer to it as we have the other columns, such as solar_distance_km_6:

planets_friendly.solar_distance_km_6
name
Mercury      57.9
Venus       108.2
Earth       149.6
Mars        227.9
Jupiter     778.6
Saturn     1433.5
Uranus     2872.5
Neptune    4495.1
Name: solar_distance_km_6, dtype: float64

But, we can still refer to it using the alternate syntax mentioned above:

planets_friendly['distance relative change']
name
Mercury         NaN
Venus      0.868739
Earth      0.382625
Mars       0.523396
Jupiter    2.416411
Saturn     0.841125
Uranus     1.003837
Neptune    0.564874
Name: distance relative change, dtype: float64

The data underlying our computed feature remains as it was, despite the styled presentation and renaming the column.

Dropping columns#

Finally, we can remove columns from a DataFrame with the drop method.

Let’s pretend that we added that last column in error.

We can remove it – and any others we like – by specifying the column names in a list to this method. By default, drop constructs a new DataFrame with the specified columns removed.

planets_friendly.drop(columns=['distance relative change'])
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.330 5427.0 3.7
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
Uranus 2872.5 86.800 1271.0 8.7
Neptune 4495.1 102.000 1638.0 11.0