Accessing Rows#

The second dimension of our data consists of its rows or individuals.

The index#

As you’ll recall from Creating a DataFrame – and just as it did for our columns – pandas has constructed an index for our rows.

By default, the values of this index are the familiar 0, 1, 2, , and represented by the RangeIndex type.

planets.index
RangeIndex(start=0, stop=8, step=1)

But we can always set alternative indices.

planets_natural = planets.set_index(pd.RangeIndex(1, 9, name='number'))

planets_natural.index
RangeIndex(start=1, stop=9, step=1, name='number')
planets_named = planets.set_index('name')

planets_named.index
Index(['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus',
       'Neptune'],
      dtype='object', name='name')

Counting#

As with the list, we can use the built-in function len to see that there are eight planets.

len(planets)
8

Slicing#

We can also slice the DataFrame, for example to extract only its first three rows.

planets[:3]
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
0 Mercury 57.9 0.33 5427.0 3.7
1 Venus 108.2 4.87 5243.0 8.9
2 Earth 149.6 5.97 5514.0 9.8

Above, our slice has constructed a new DataFrame, consisting of only the data for the first three planets.

This works the same with our alternative indices – generic row slices step through the index regardless of the values within the index.

planets_natural[:3]
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
number
1 Mercury 57.9 0.33 5427.0 3.7
2 Venus 108.2 4.87 5243.0 8.9
3 Earth 149.6 5.97 5514.0 9.8
planets_named[:3]
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.33 5427.0 3.7
Venus 108.2 4.87 5243.0 8.9
Earth 149.6 5.97 5514.0 9.8

Of course, slices may be more sophisticated than the above.

planets_named[2:5]
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
planets_named[::2]
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.33 5427.0 3.7
Earth 149.6 5.97 5514.0 9.8
Jupiter 778.6 1898.00 1326.0 23.1
Uranus 2872.5 86.80 1271.0 8.7

Head & Tail#

But as a shortcut to inspecting just the first few – or last few – rows of a DataFrame, there are the methods head and tail.

planets_named.head()
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.330 5427.0 3.7
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
planets_named.tail()
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
Uranus 2872.5 86.800 1271.0 8.7
Neptune 4495.1 102.000 1638.0 11.0

By default, these methods display the first or last five rows. This number may also be specified.

planets_named.head(2)
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.33 5427.0 3.7
Venus 108.2 4.87 5243.0 8.9
planets_named.tail(2)
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Uranus 2872.5 86.8 1271.0 8.7
Neptune 4495.1 102.0 1638.0 11.0

As with slices, even negative integers are supported – to display all but the first or last n rows.

planets_named.head(-2)
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.330 5427.0 3.7
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
planets_named.tail(-2)
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7
Jupiter 778.6 1898.000 1326.0 23.1
Saturn 1433.5 568.000 687.0 9.0
Uranus 2872.5 86.800 1271.0 8.7
Neptune 4495.1 102.000 1638.0 11.0

Retrieving#

Attention

Above, we treated our DataFrame like Python’s list to count and slice its rows.

However, you can’t access individual rows in the same manner as with a list.

The below expression raises an exception!

(Click below to view the scary traceback.)

planets[2]  # WRONG 😦
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
   3789 try:
-> 3790     return self._engine.get_loc(casted_key)
   3791 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 2

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[17], line 1
----> 1 planets[2]  # WRONG 😦

File /opt/conda/lib/python3.11/site-packages/pandas/core/frame.py:3896, in DataFrame.__getitem__(self, key)
   3894 if self.columns.nlevels > 1:
   3895     return self._getitem_multilevel(key)
-> 3896 indexer = self.columns.get_loc(key)
   3897 if is_integer(indexer):
   3898     indexer = [indexer]

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key)
   3792     if isinstance(casted_key, slice) or (
   3793         isinstance(casted_key, abc.Iterable)
   3794         and any(isinstance(x, slice) for x in casted_key)
   3795     ):
   3796         raise InvalidIndexError(key)
-> 3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
   3802     self._check_indexing_error(key)

KeyError: 2

After all, the DataFrame is a more complex structure than the list – the above reference to the index value 2 was treated as a reference to a column!

Instead, DataFrame offers the properties iloc and loc, which may themselves be queried with a syntax based on the syntax for retrieving elements from a list.

By position#

Individuals#

iloc is intended for integer-location based look-up of elements by their position in the index.

And, to start, we can now extract the third individual in our DataFrame, at the index offset or position 2, with iloc.

planets.iloc[2]
name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 2, dtype: object

Much better.

And, hey – it’s Earth: the third planet from the sun.

Since iloc deals in integer offsets – rather than values within the index, we can do the same with our alternative indices.

planets_natural.iloc[2]
name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object
earth = planets_named.iloc[2]

earth
solar_distance_km_6     149.60
mass_kg_24                5.97
density_kg_m3          5514.00
gravity_m_s2              9.80
Name: Earth, dtype: float64

The above presentation of data might look familiar – because it is, again, a Series.

type(earth)
pandas.core.series.Series

And we can still do all the things we did before with the Series.

But careful! What does it mean to take the median … of Earth?

earth.median()
79.7

Answer: nothing!

pandas is happy to apply formulas to series of data. But we’ll have to be a bit more clever than that to come up with a meaningful statistic based on these diverse features.

We can meaningfully access features from the Series of our individual’s data.

earth.solar_distance_km_6
149.6

Above, we’ve extracted just the Earth’s distance from the Sun.

And we can see again how spread out the planets are.

The Earth is the third of eight planets, yet its distance from the sun is less than 12% their average.

earth.solar_distance_km_6 / planets.solar_distance_km_6.mean()
0.11822231880908397

Slices & selection#

We can also reproduce our slice, more explicitly now, using iloc.

planets.iloc[:3]
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
0 Mercury 57.9 0.33 5427.0 3.7
1 Venus 108.2 4.87 5243.0 8.9
2 Earth 149.6 5.97 5514.0 9.8

We can also do something new – construct a new DataFrame consisting of only the individuals at the specified offsets.

planets.iloc[[0, 7]]
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
0 Mercury 57.9 0.33 5427.0 3.7
7 Neptune 4495.1 102.00 1638.0 11.0

Above, we’ve specified to our iloc-based look-up a list[0, 7] – indicating that we are interested in selecting out the rows at those offsets.

Again, we can perform this operation on our alternative indices, regardless of the different values within them.

planets_natural.iloc[[0, 7]]
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
number
1 Mercury 57.9 0.33 5427.0 3.7
8 Neptune 4495.1 102.00 1638.0 11.0
planets_named.iloc[[0, 7]]
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Mercury 57.9 0.33 5427.0 3.7
Neptune 4495.1 102.00 1638.0 11.0

Note that in the new DataFrame, the planets’ row index values have been preserved. This is highly useful – indeed, Neptune is still the same planet as it was before. But iloc is strictly intended for offsets, like in a list.

If we were to repeat our selection of [0, 7] on the above, this would fail. Rather, the offset references for these two planets in the new DataFrame are now given by [0, 1]. According to iloc in this new DataFrame, Neptune will now be available at offset 1.

bookends = planets.iloc[[0, 7]]

bookends.iloc[1]
name                   Neptune
solar_distance_km_6     4495.1
mass_kg_24               102.0
density_kg_m3           1638.0
gravity_m_s2              11.0
Name: 7, dtype: object

Resetting the index#

If we wanted to correct the inconsistency produced above between our rows’ positions in the index and the values of our index – that is, to create a DataFrame reflecting a solar system consisting of only Mercury and Neptune – of course we could.

The reset_index method will recreate our DataFrame with an index regenerated for the elements it contains.

By default, reset_index preserves the data of our old index – just in case – as an extra column.

bookends.reset_index()
index name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
0 0 Mercury 57.9 0.33 5427.0 3.7
1 7 Neptune 4495.1 102.00 1638.0 11.0

But we can tell pandas not to worry about that old index.

bookends.reset_index(drop=True)
name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
0 Mercury 57.9 0.33 5427.0 3.7
1 Neptune 4495.1 102.00 1638.0 11.0

By label#

The loc property, in the other hand, allows us to look up rows (and more!) according to their index value or “label.”

We may repeat the above operations we performed with iloc, but the arguments we supply to loc must reflect the index of the DataFrame.

We can still find Earth at the index value 2 in our simple DataFrame, because its index’s values are integer positions.

planets.loc[2]
name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 2, dtype: object

In our “natural” numeric index, however, Earth is listed under the value 3.

planets_natural.loc[3]
name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object

And in our name-based index, we can’t supply integer values to loc at all!

planets_named.loc[3]  # WRONG 😦
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
   3789 try:
-> 3790     return self._engine.get_loc(casted_key)
   3791 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 3

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[34], line 1
----> 1 planets_named.loc[3]  # WRONG 😦

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1153, in _LocationIndexer.__getitem__(self, key)
   1150 axis = self.axis or 0
   1152 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1153 return self._getitem_axis(maybe_callable, axis=axis)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1393, in _LocIndexer._getitem_axis(self, key, axis)
   1391 # fall thru to straight lookup
   1392 self._validate_key(key, axis)
-> 1393 return self._get_label(key, axis=axis)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1343, in _LocIndexer._get_label(self, label, axis)
   1341 def _get_label(self, label, axis: AxisInt):
   1342     # GH#5567 this will fail if the label is not present in the axis.
-> 1343     return self.obj.xs(label, axis=axis)

File /opt/conda/lib/python3.11/site-packages/pandas/core/generic.py:4236, in NDFrame.xs(self, key, axis, level, drop_level)
   4234             new_index = index[loc]
   4235 else:
-> 4236     loc = index.get_loc(key)
   4238     if isinstance(loc, np.ndarray):
   4239         if loc.dtype == np.bool_:

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key)
   3792     if isinstance(casted_key, slice) or (
   3793         isinstance(casted_key, abc.Iterable)
   3794         and any(isinstance(x, slice) for x in casted_key)
   3795     ):
   3796         raise InvalidIndexError(key)
-> 3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
   3802     self._check_indexing_error(key)

KeyError: 3

Rather, we may supply loc the string value under which the individual is stored in that index.

planets_named.loc['Earth']
solar_distance_km_6     149.60
mass_kg_24                5.97
density_kg_m3          5514.00
gravity_m_s2              9.80
Name: Earth, dtype: float64

This even goes for slices as well!

planets_named.loc['Venus':'Mars']
solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2
name
Venus 108.2 4.870 5243.0 8.9
Earth 149.6 5.970 5514.0 9.8
Mars 227.9 0.642 3933.0 3.7

Attention

And, note: loc handled the above slice differently than iloc!

The upper bound – as well as the lower bound – of the range were included in the result.

In the following sections, we’ll continue to explore slicing and selecting elements by index position and value, via loc and iloc.