Column Operations
Contents
Column Operations#
Methods on Series#
We saw in the previous section how data features are represented as columns in the two-dimensional DataFrame
, and these columns are implemented by the pandas Series
. And, Series
objects expose methods, such as mean
.
planets.solar_distance_km_6.mean()
1265.4125000000001
pandas supports a number of statistical methods, such as median
, standard deviation std
, mean absolute deviation mad
, and more.
Generally, we might note that the median distance of planets from the sun is less than half their mean.
planets.solar_distance_km_6.median() / planets.solar_distance_km_6.mean()
0.39769640334673473
We know that the planets are spread out. But how so?
planets.solar_distance_km_6
0 57.9
1 108.2
2 149.6
3 227.9
4 778.6
5 1433.5
6 2872.5
7 4495.1
Name: solar_distance_km_6, dtype: float64
Let’s take a look at how their distances increase, each relative to the one before it, with the method diff
.
In physical terms, this is simply the distance between each planet.
planets.solar_distance_km_6.diff()
0 NaN
1 50.3
2 41.4
3 78.3
4 550.7
5 654.9
6 1439.0
7 1622.6
Name: solar_distance_km_6, dtype: float64
The above Series
tells us that Venus is 50.3 million kilometers farther from the sun than Mercury, Earth is 41.4 million kilometers farther than Venus, etc. (The first value, for Mercury, is NaN
– “not a number” – because there is no planet closer than it to the sun, with which to compare its distance.)
We can make this a little clearer by setting the planet names as our index, as we learned in Creating a DataFrame. pandas will carry over this index into computed Series
and DataFrame
objects.
planets_named = planets.set_index('name')
planets_named.solar_distance_km_6.diff()
name
Mercury NaN
Venus 50.3
Earth 41.4
Mars 78.3
Jupiter 550.7
Saturn 654.9
Uranus 1439.0
Neptune 1622.6
Name: solar_distance_km_6, dtype: float64
Even better, we can invoke the rename
method of the resulting Series
such that this label reflects what it actually is.
(
planets_named
.solar_distance_km_6
.diff()
.rename('relative distance (km^6)')
)
name
Mercury NaN
Venus 50.3
Earth 41.4
Mars 78.3
Jupiter 550.7
Saturn 654.9
Uranus 1439.0
Neptune 1622.6
Name: relative distance (km^6), dtype: float64
And, indeed, the distances between the planets increase dramatically.
We can see roughly that the first big jump is in the distance between Mars and Jupiter. The distances between the outer planets then continue to be greater than those between the inner planets, and continue to increase. But there’s another big intermediate jump, between Saturn and Uranus.
We can express the above quantitatively, as well, with the pct_change
method.
distance_rel_change = (
planets_named
.solar_distance_km_6
.pct_change()
.rename('fractional distance')
)
distance_rel_change
name
Mercury NaN
Venus 0.868739
Earth 0.382625
Mars 0.523396
Jupiter 2.416411
Saturn 0.841125
Uranus 1.003837
Neptune 0.564874
Name: fractional distance, dtype: float64
The pct_change
method computes the fractional change (despite its name!) between successive elements.
That is, pct_change
maps each difference to a ratio of the absolute value before it. Ultimately, this compares the distance of each planet from the sun, \(distance_1\), to the distance of the planet preceding it, \(distance_0\), according to:
We see that Venus is farther from the sun than Mercury – but its distance from Mercury (50.3 km) is a little less than Mercury’s distance from the sun (57.9 km) – only 86.8739% as great. Earth’s relative distance from Venus is even smaller.
Operators on Series#
Series
objects also support Python’s built-in operators, such as *
for multiplication and /
for division.
This is similar to such operations with Python’s list
, but note! The Series
responds to such operations differently.
planets.solar_distance_km_6
0 57.9
1 108.2
2 149.6
3 227.9
4 778.6
5 1433.5
6 2872.5
7 4495.1
Name: solar_distance_km_6, dtype: float64
2 * planets.solar_distance_km_6
0 115.8
1 216.4
2 299.2
3 455.8
4 1557.2
5 2867.0
6 5745.0
7 8990.2
Name: solar_distance_km_6, dtype: float64
Above, pandas applied our multiplication operation to each value in the Series
, and returned a new Series
of the results.
The number of rows – or individuals – really, the number of planets referenced by the data, didn’t change. That’s good!
This is useful for data analysis. But it’s quite different from the list
:
2 * planets.solar_distance_km_6.tolist()
[57.9,
108.2,
149.6,
227.9,
778.6,
1433.5,
2872.5,
4495.1,
57.9,
108.2,
149.6,
227.9,
778.6,
1433.5,
2872.5,
4495.1]
That is, the list
applied the operation to itself, not to its individual elements – having the effect of simply listing the data twice.
Now we can represent our fractional distance changes as conventional percentages – by simply multiplying them by 100
.
distance_rel_change * 100
name
Mercury NaN
Venus 86.873921
Earth 38.262477
Mars 52.339572
Jupiter 241.641071
Saturn 84.112510
Uranus 100.383676
Neptune 56.487380
Name: fractional distance, dtype: float64
…And, named appropriately:
distance_pct_change = (distance_rel_change * 100).rename('percent distance')
distance_pct_change
name
Mercury NaN
Venus 86.873921
Earth 38.262477
Mars 52.339572
Jupiter 241.641071
Saturn 84.112510
Uranus 100.383676
Neptune 56.487380
Name: percent distance, dtype: float64
Adding columns#
It might be more meaningful to see the above values in context, along with our other planetary data. Indeed, we might consider our computed Series
a new, computed, feature of our data.
Assign method#
We can use the assign
method to construct a new DataFrame
, one which begins with the data from planets
or planets_named
, and which adds to this our new column.
planets_named.assign(distance_pct_change=distance_pct_change)
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | distance_pct_change | |
---|---|---|---|---|---|
name | |||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 | NaN |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 | 86.873921 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 | 38.262477 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 | 52.339572 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 | 241.641071 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 | 84.112510 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 | 100.383676 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 | 56.487380 |
As you can see, the assign
method accepts only named keyword arguments. Notably, these can be any syntactically-valid name – pandas adds the sequence of values you specify to the existing data as a new column, and assigns to the new column this name.
Above we assigned to this column the data contained in a Series
. pandas will collect values from any sequence – also including list
and array
.
Note also that the column may already exist – in which case it is simply overwritten.
Assignment#
It’s far more common, however, to add columns by treating the DataFrame
like a dictionary.
The syntax is identical to that when assigning items to the Python dict
.
As with the assign
method, any sequence of values may be read into the new (or overwritten) column.
planets_named1 = planets.set_index('name')
planets_named1['temperature_mean_c'] = [167, 464, 15, -65, -110, -140, -195, -200]
planets_named1
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | temperature_mean_c | |
---|---|---|---|---|---|
name | |||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 | 167 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 | 464 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 | 15 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 | -65 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 | -110 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 | -140 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 | -195 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 | -200 |
The syntax of dictionary item assignment also has the advantage that the key – or column name – need not conform to Python’s syntax for names and keywords. Rather, any string is allowed.
planets_named1['how cold is it in degrees centigrade, anyway? ☃'] = [167, 464, 15, -65, -110, -140, -195, -200]
Attention
Assigning columns via this syntax also mirrors the operation with dict
in that it modifies the DataFrame
“in place” – the expression above did not evaluate to a new DataFrame
, because no new DataFrame
was constructed.
This is often the desired result.
However, it can lead to mistakes – as the name we’ve assigned to the DataFrame
, such as planets
or planets_named
, then refers to an object with different values than it did before. For this reason, “in place” operations are supported, but not generally the default for pandas methods, such as assign
.
Let’s take a look at our now-changed planets_named1
.
planets_named1
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | temperature_mean_c | how cold is it in degrees centigrade, anyway? ☃ | |
---|---|---|---|---|---|---|
name | ||||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 | 167 | 167 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 | 464 | 464 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 | 15 | 15 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 | -65 | -65 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 | -110 | -110 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 | -140 | -140 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 | -195 | -195 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 | -200 | -200 |
We re-constructed our “named planets” DataFrame
, calling it planets_named1
, to avoid modifying planets_named
.
(Of course, if this weren’t a simple example, we’d want to use a more descriptive name than that, to reflect our intent with the new data structure.)
Our expression to re-construct the DataFrame
wasn’t complex; but, this can be cumbersome.
More generally, when we’d like to maintain multiple separate representations of our data, we can simply make a copy of our initial DataFrame
prior to modifying it, using the copy
method.
planets_for_holiday = planets_named.copy()
planets_for_holiday
and planets_named
are now separate objects – they just refer to the same data.
And now we can modify one without affecting the other.
planets_for_holiday['temperature_mean_c'] = [167, 464, 15, -65, -110, -140, -195, -200]
planets_for_holiday
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | temperature_mean_c | |
---|---|---|---|---|---|
name | |||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 | 167 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 | 464 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 | 15 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 | -65 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 | -110 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 | -140 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 | -195 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 | -200 |
planets_named
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Our column has been added to planets_for_holiday
, but not to planets_named
.
Styled presentation#
Arguably, of course, our multiplication by 100 was only an aesthetic change. We might instead preserve the output of our pct_change
calculation, and merely adjust the presentation of our DataFrame
.
pandas offers the DataFrame
property style
, whose format
method accepts either functions or strings, with which it determines how to present its data. String arguments to this format
method follow Python’s standard form for indicating how a value should be presented as text.
Python’s strings offer their own format
method. And, for example, we might construct a string presenting the float
value 0.868739
as a conventional percentage, as follows.
'{:.2%}'.format(0.868739)
'86.87%'
In a similar manner we can apply this formatting to our column, without altering the underlying values.
First, starting from our basic DataFrame
– planets
– let’s use what we’ve covered above to prepare just what we’d like to see.
planets_friendly = planets.set_index('name')
planets_friendly['distance_rel_change'] = distance_rel_change
planets_friendly
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | distance_rel_change | |
---|---|---|---|---|---|
name | |||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 | NaN |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 | 0.868739 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 | 0.382625 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 | 0.523396 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 | 2.416411 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 | 0.841125 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 | 1.003837 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 | 0.564874 |
We can now format the presentation of our relative distance column in a “friendly” manner.
planets_friendly.style.format({
'distance_rel_change': '{:.2%}',
})
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | distance_rel_change | |
---|---|---|---|---|---|
name | |||||
Mercury | 57.900000 | 0.330000 | 5427.000000 | 3.700000 | nan% |
Venus | 108.200000 | 4.870000 | 5243.000000 | 8.900000 | 86.87% |
Earth | 149.600000 | 5.970000 | 5514.000000 | 9.800000 | 38.26% |
Mars | 227.900000 | 0.642000 | 3933.000000 | 3.700000 | 52.34% |
Jupiter | 778.600000 | 1898.000000 | 1326.000000 | 23.100000 | 241.64% |
Saturn | 1433.500000 | 568.000000 | 687.000000 | 9.000000 | 84.11% |
Uranus | 2872.500000 | 86.800000 | 1271.000000 | 8.700000 | 100.38% |
Neptune | 4495.100000 | 102.000000 | 1638.000000 | 11.000000 | 56.49% |
Looks good!
The above presentation of our data looks like a DataFrame
, but formatted as we wanted. But – to avoid confusion – it’s not actually a DataFrame
. And so conventional DataFrame
methods won’t apply.
Renaming columns#
We can also rename columns, in this case also for clarity, and again making use of a dict
.
planets_friendly = planets_friendly.rename(
columns={
'distance_rel_change': 'distance relative change',
}
)
planets_friendly.style.format({
'distance relative change': '{:.2%}',
})
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | distance relative change | |
---|---|---|---|---|---|
name | |||||
Mercury | 57.900000 | 0.330000 | 5427.000000 | 3.700000 | nan% |
Venus | 108.200000 | 4.870000 | 5243.000000 | 8.900000 | 86.87% |
Earth | 149.600000 | 5.970000 | 5514.000000 | 9.800000 | 38.26% |
Mars | 227.900000 | 0.642000 | 3933.000000 | 3.700000 | 52.34% |
Jupiter | 778.600000 | 1898.000000 | 1326.000000 | 23.100000 | 241.64% |
Saturn | 1433.500000 | 568.000000 | 687.000000 | 9.000000 | 84.11% |
Uranus | 2872.500000 | 86.800000 | 1271.000000 | 8.700000 | 100.38% |
Neptune | 4495.100000 | 102.000000 | 1638.000000 | 11.000000 | 56.49% |
Because we’ve added spaces to our column name, it’s no longer valid in the syntax of Python, and so we can no longer refer to it as we have the other columns, such as solar_distance_km_6
:
planets_friendly.solar_distance_km_6
name
Mercury 57.9
Venus 108.2
Earth 149.6
Mars 227.9
Jupiter 778.6
Saturn 1433.5
Uranus 2872.5
Neptune 4495.1
Name: solar_distance_km_6, dtype: float64
But, we can still refer to it using the alternate syntax mentioned above:
planets_friendly['distance relative change']
name
Mercury NaN
Venus 0.868739
Earth 0.382625
Mars 0.523396
Jupiter 2.416411
Saturn 0.841125
Uranus 1.003837
Neptune 0.564874
Name: distance relative change, dtype: float64
The data underlying our computed feature remains as it was, despite the styled presentation and renaming the column.
Dropping columns#
Finally, we can remove columns from a DataFrame
with the drop
method.
Let’s pretend that we added that last column in error.
We can remove it – and any others we like – by specifying the column names in a list
to this method. By default, drop
constructs a new DataFrame
with the specified columns removed.
planets_friendly.drop(columns=['distance relative change'])
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |