Column Operations#

Methods on Series#

We saw in the previous section how data features are represented as columns in the two-dimensional DataFrame, and these columns are implemented by the pandas Series. And, Series objects expose methods, such as mean.

planets.solar_distance_km_6.mean()

1265.4125000000001

pandas supports a number of statistical methods, such as median, standard deviation std, mean absolute deviation mad, and more.

Generally, we might note that the median distance of planets from the sun is less than half their mean.

planets.solar_distance_km_6.median() / planets.solar_distance_km_6.mean()

0.39769640334673473

We know that the planets are spread out. But how so?

planets.solar_distance_km_6

    57.9
   108.2
   149.6
   227.9
   778.6
  1433.5
  2872.5
  4495.1
Name: solar_distance_km_6, dtype: float64

Let’s take a look at how their distances increase, each relative to the one before it, with the method diff.

In physical terms, this is simply the distance between each planet.

planets.solar_distance_km_6.diff()

     NaN
    50.3
    41.4
    78.3
   550.7
   654.9
  1439.0
  1622.6
Name: solar_distance_km_6, dtype: float64

The above Series tells us that Venus is 50.3 million kilometers farther from the sun than Mercury, Earth is 41.4 million kilometers farther than Venus, etc. (The first value, for Mercury, is NaN – “not a number” – because there is no planet closer than it to the sun, with which to compare its distance.)

We can make this a little clearer by setting the planet names as our index, as we learned in Creating a DataFrame. pandas will carry over this index into computed Series and DataFrame objects.

planets_named = planets.set_index('name')

planets_named.solar_distance_km_6.diff()

name
Mercury       NaN
Venus        50.3
Earth        41.4
Mars         78.3
Jupiter     550.7
Saturn      654.9
Uranus     1439.0
Neptune    1622.6
Name: solar_distance_km_6, dtype: float64

Even better, we can invoke the rename method of the resulting Series such that this label reflects what it actually is.

(
    planets_named
    .solar_distance_km_6
    .diff()
    .rename('relative distance (km^6)')
)

name
Mercury       NaN
Venus        50.3
Earth        41.4
Mars         78.3
Jupiter     550.7
Saturn      654.9
Uranus     1439.0
Neptune    1622.6
Name: relative distance (km^6), dtype: float64

And, indeed, the distances between the planets increase dramatically.

We can see roughly that the first big jump is in the distance between Mars and Jupiter. The distances between the outer planets then continue to be greater than those between the inner planets, and continue to increase. But there’s another big intermediate jump, between Saturn and Uranus.

We can express the above quantitatively, as well, with the pct_change method.

distance_rel_change = (
    planets_named
    .solar_distance_km_6
    .pct_change()
    .rename('fractional distance')
)

distance_rel_change

name
Mercury         NaN
Venus      0.868739
Earth      0.382625
Mars       0.523396
Jupiter    2.416411
Saturn     0.841125
Uranus     1.003837
Neptune    0.564874
Name: fractional distance, dtype: float64

The pct_change method computes the fractional change (despite its name!) between successive elements.

That is, pct_change maps each difference to a ratio of the absolute value before it. Ultimately, this compares the distance of each planet from the sun, \(distance_1\), to the distance of the planet preceding it, \(distance_0\), according to:

\[ \frac{distance_1 - distance_0}{distance_0} \]

We see that Venus is farther from the sun than Mercury – but its distance from Mercury (50.3 km) is a little less than Mercury’s distance from the sun (57.9 km) – only 86.8739% as great. Earth’s relative distance from Venus is even smaller.

Operators on Series#

Series objects also support Python’s built-in operators, such as * for multiplication and / for division.

This is similar to such operations with Python’s list, but note! The Series responds to such operations differently.

planets.solar_distance_km_6

    57.9
   108.2
   149.6
   227.9
   778.6
  1433.5
  2872.5
  4495.1
Name: solar_distance_km_6, dtype: float64

2 * planets.solar_distance_km_6

   115.8
   216.4
   299.2
   455.8
  1557.2
  2867.0
  5745.0
  8990.2
Name: solar_distance_km_6, dtype: float64

Above, pandas applied our multiplication operation to each value in the Series, and returned a new Series of the results.

The number of rows – or individuals – really, the number of planets referenced by the data, didn’t change. That’s good!

This is useful for data analysis. But it’s quite different from the list:

2 * planets.solar_distance_km_6.tolist()

[57.9,
2,
6,
9,
6,
5,
5,
1,
9,
2,
6,
9,
6,
5,
5,
1]

That is, the list applied the operation to itself, not to its individual elements – having the effect of simply listing the data twice.

Now we can represent our fractional distance changes as conventional percentages – by simply multiplying them by 100.

distance_rel_change * 100

name
Mercury           NaN
Venus       86.873921
Earth       38.262477
Mars        52.339572
Jupiter    241.641071
Saturn      84.112510
Uranus     100.383676
Neptune     56.487380
Name: fractional distance, dtype: float64

…And, named appropriately:

distance_pct_change = (distance_rel_change * 100).rename('percent distance')

distance_pct_change

name
Mercury           NaN
Venus       86.873921
Earth       38.262477
Mars        52.339572
Jupiter    241.641071
Saturn      84.112510
Uranus     100.383676
Neptune     56.487380
Name: percent distance, dtype: float64

Adding columns#

It might be more meaningful to see the above values in context, along with our other planetary data. Indeed, we might consider our computed Series a new, computed, feature of our data.

Assign method#

We can use the assign method to construct a new DataFrame, one which begins with the data from planets or planets_named, and which adds to this our new column.

planets_named.assign(distance_pct_change=distance_pct_change)

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2	distance_pct_change
name
Mercury	57.9	0.330	5427.0	3.7	NaN
Venus	108.2	4.870	5243.0	8.9	86.873921
Earth	149.6	5.970	5514.0	9.8	38.262477
Mars	227.9	0.642	3933.0	3.7	52.339572
Jupiter	778.6	1898.000	1326.0	23.1	241.641071
Saturn	1433.5	568.000	687.0	9.0	84.112510
Uranus	2872.5	86.800	1271.0	8.7	100.383676
Neptune	4495.1	102.000	1638.0	11.0	56.487380

As you can see, the assign method accepts only named keyword arguments. Notably, these can be any syntactically-valid name – pandas adds the sequence of values you specify to the existing data as a new column, and assigns to the new column this name.

Above we assigned to this column the data contained in a Series. pandas will collect values from any sequence – also including list and array.

Note also that the column may already exist – in which case it is simply overwritten.

Assignment#

It’s far more common, however, to add columns by treating the DataFrame like a dictionary.

The syntax is identical to that when assigning items to the Python dict.

As with the assign method, any sequence of values may be read into the new (or overwritten) column.

planets_named1 = planets.set_index('name')

planets_named1['temperature_mean_c'] = [167, 464, 15, -65, -110, -140, -195, -200]

planets_named1

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2	temperature_mean_c
name
Mercury	57.9	0.330	5427.0	3.7	167
Venus	108.2	4.870	5243.0	8.9	464
Earth	149.6	5.970	5514.0	9.8	15
Mars	227.9	0.642	3933.0	3.7	-65
Jupiter	778.6	1898.000	1326.0	23.1	-110
Saturn	1433.5	568.000	687.0	9.0	-140
Uranus	2872.5	86.800	1271.0	8.7	-195
Neptune	4495.1	102.000	1638.0	11.0	-200

The syntax of dictionary item assignment also has the advantage that the key – or column name – need not conform to Python’s syntax for names and keywords. Rather, any string is allowed.

planets_named1['how cold is it in degrees centigrade, anyway? ☃'] = [167, 464, 15, -65, -110, -140, -195, -200]

Attention

Assigning columns via this syntax also mirrors the operation with dict in that it modifies the DataFrame “in place” – the expression above did not evaluate to a new DataFrame, because no new DataFrame was constructed.

This is often the desired result.

However, it can lead to mistakes – as the name we’ve assigned to the DataFrame, such as planets or planets_named, then refers to an object with different values than it did before. For this reason, “in place” operations are supported, but not generally the default for pandas methods, such as assign.

Let’s take a look at our now-changed planets_named1.

planets_named1

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2	temperature_mean_c	how cold is it in degrees centigrade, anyway? ☃
name
Mercury	57.9	0.330	5427.0	3.7	167	167
Venus	108.2	4.870	5243.0	8.9	464	464
Earth	149.6	5.970	5514.0	9.8	15	15
Mars	227.9	0.642	3933.0	3.7	-65	-65
Jupiter	778.6	1898.000	1326.0	23.1	-110	-110
Saturn	1433.5	568.000	687.0	9.0	-140	-140
Uranus	2872.5	86.800	1271.0	8.7	-195	-195
Neptune	4495.1	102.000	1638.0	11.0	-200	-200

We re-constructed our “named planets” DataFrame, calling it planets_named1, to avoid modifying planets_named.

(Of course, if this weren’t a simple example, we’d want to use a more descriptive name than that, to reflect our intent with the new data structure.)

Our expression to re-construct the DataFrame wasn’t complex; but, this can be cumbersome.

More generally, when we’d like to maintain multiple separate representations of our data, we can simply make a copy of our initial DataFrame prior to modifying it, using the copy method.

planets_for_holiday = planets_named.copy()

planets_for_holiday and planets_named are now separate objects – they just refer to the same data.

And now we can modify one without affecting the other.

planets_for_holiday['temperature_mean_c'] = [167, 464, 15, -65, -110, -140, -195, -200]

planets_for_holiday

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2	temperature_mean_c
name
Mercury	57.9	0.330	5427.0	3.7	167
Venus	108.2	4.870	5243.0	8.9	464
Earth	149.6	5.970	5514.0	9.8	15
Mars	227.9	0.642	3933.0	3.7	-65
Jupiter	778.6	1898.000	1326.0	23.1	-110
Saturn	1433.5	568.000	687.0	9.0	-140
Uranus	2872.5	86.800	1271.0	8.7	-195
Neptune	4495.1	102.000	1638.0	11.0	-200

planets_named

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.330	5427.0	3.7
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0
Uranus	2872.5	86.800	1271.0	8.7
Neptune	4495.1	102.000	1638.0	11.0

Our column has been added to planets_for_holiday, but not to planets_named.

Styled presentation#

Arguably, of course, our multiplication by 100 was only an aesthetic change. We might instead preserve the output of our pct_change calculation, and merely adjust the presentation of our DataFrame.

pandas offers the DataFrame property style, whose format method accepts either functions or strings, with which it determines how to present its data. String arguments to this format method follow Python’s standard form for indicating how a value should be presented as text.

Python’s strings offer their own format method. And, for example, we might construct a string presenting the float value 0.868739 as a conventional percentage, as follows.

'{:.2%}'.format(0.868739)

'86.87%'

In a similar manner we can apply this formatting to our column, without altering the underlying values.

First, starting from our basic DataFrame – planets – let’s use what we’ve covered above to prepare just what we’d like to see.

planets_friendly = planets.set_index('name')

planets_friendly['distance_rel_change'] = distance_rel_change

planets_friendly

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2	distance_rel_change
name
Mercury	57.9	0.330	5427.0	3.7	NaN
Venus	108.2	4.870	5243.0	8.9	0.868739
Earth	149.6	5.970	5514.0	9.8	0.382625
Mars	227.9	0.642	3933.0	3.7	0.523396
Jupiter	778.6	1898.000	1326.0	23.1	2.416411
Saturn	1433.5	568.000	687.0	9.0	0.841125
Uranus	2872.5	86.800	1271.0	8.7	1.003837
Neptune	4495.1	102.000	1638.0	11.0	0.564874

We can now format the presentation of our relative distance column in a “friendly” manner.

planets_friendly.style.format({
    'distance_rel_change': '{:.2%}',
})

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2	distance_rel_change
name
Mercury	57.900000	0.330000	5427.000000	3.700000	nan%
Venus	108.200000	4.870000	5243.000000	8.900000	86.87%
Earth	149.600000	5.970000	5514.000000	9.800000	38.26%
Mars	227.900000	0.642000	3933.000000	3.700000	52.34%
Jupiter	778.600000	1898.000000	1326.000000	23.100000	241.64%
Saturn	1433.500000	568.000000	687.000000	9.000000	84.11%
Uranus	2872.500000	86.800000	1271.000000	8.700000	100.38%
Neptune	4495.100000	102.000000	1638.000000	11.000000	56.49%

Looks good!

The above presentation of our data looks like a DataFrame, but formatted as we wanted. But – to avoid confusion – it’s not actually a DataFrame. And so conventional DataFrame methods won’t apply.

Renaming columns#

We can also rename columns, in this case also for clarity, and again making use of a dict.

planets_friendly = planets_friendly.rename(
    columns={
        'distance_rel_change': 'distance relative change',
    }
)

planets_friendly.style.format({
    'distance relative change': '{:.2%}',
})

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2	distance relative change
name
Mercury	57.900000	0.330000	5427.000000	3.700000	nan%
Venus	108.200000	4.870000	5243.000000	8.900000	86.87%
Earth	149.600000	5.970000	5514.000000	9.800000	38.26%
Mars	227.900000	0.642000	3933.000000	3.700000	52.34%
Jupiter	778.600000	1898.000000	1326.000000	23.100000	241.64%
Saturn	1433.500000	568.000000	687.000000	9.000000	84.11%
Uranus	2872.500000	86.800000	1271.000000	8.700000	100.38%
Neptune	4495.100000	102.000000	1638.000000	11.000000	56.49%

Because we’ve added spaces to our column name, it’s no longer valid in the syntax of Python, and so we can no longer refer to it as we have the other columns, such as solar_distance_km_6:

planets_friendly.solar_distance_km_6

name
Mercury      57.9
Venus       108.2
Earth       149.6
Mars        227.9
Jupiter     778.6
Saturn     1433.5
Uranus     2872.5
Neptune    4495.1
Name: solar_distance_km_6, dtype: float64

But, we can still refer to it using the alternate syntax mentioned above:

planets_friendly['distance relative change']

name
Mercury         NaN
Venus      0.868739
Earth      0.382625
Mars       0.523396
Jupiter    2.416411
Saturn     0.841125
Uranus     1.003837
Neptune    0.564874
Name: distance relative change, dtype: float64

The data underlying our computed feature remains as it was, despite the styled presentation and renaming the column.

Dropping columns#

Finally, we can remove columns from a DataFrame with the drop method.

Let’s pretend that we added that last column in error.

We can remove it – and any others we like – by specifying the column names in a list to this method. By default, drop constructs a new DataFrame with the specified columns removed.

planets_friendly.drop(columns=['distance relative change'])

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.330	5427.0	3.7
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0
Uranus	2872.5	86.800	1271.0	8.7
Neptune	4495.1	102.000	1638.0	11.0

Introduction to Data Science I & II

Column Operations

Contents