Accessing Rows#

The second dimension of our data consists of its rows or individuals.

The index#

As you’ll recall from Creating a DataFrame – and just as it did for our columns – pandas has constructed an index for our rows.

By default, the values of this index are the familiar 0, 1, 2, …, and represented by the RangeIndex type.

planets.index

RangeIndex(start=0, stop=8, step=1)

But we can always set alternative indices.

planets_natural = planets.set_index(pd.RangeIndex(1, 9, name='number'))

planets_natural.index

RangeIndex(start=1, stop=9, step=1, name='number')

planets_named = planets.set_index('name')

planets_named.index

Index(['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus',
       'Neptune'],
      dtype='object', name='name')

Counting#

As with the list, we can use the built-in function len to see that there are eight planets.

len(planets)

Slicing#

We can also slice the DataFrame, for example to extract only its first three rows.

planets[:3]

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
0	Mercury	57.9	0.33	5427.0	3.7
1	Venus	108.2	4.87	5243.0	8.9
2	Earth	149.6	5.97	5514.0	9.8

Above, our slice has constructed a new DataFrame, consisting of only the data for the first three planets.

This works the same with our alternative indices – generic row slices step through the index regardless of the values within the index.

planets_natural[:3]

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
number
1	Mercury	57.9	0.33	5427.0	3.7
2	Venus	108.2	4.87	5243.0	8.9
3	Earth	149.6	5.97	5514.0	9.8

planets_named[:3]

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.33	5427.0	3.7
Venus	108.2	4.87	5243.0	8.9
Earth	149.6	5.97	5514.0	9.8

Of course, slices may be more sophisticated than the above.

planets_named[2:5]

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1

planets_named[::2]

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.33	5427.0	3.7
Earth	149.6	5.97	5514.0	9.8
Jupiter	778.6	1898.00	1326.0	23.1
Uranus	2872.5	86.80	1271.0	8.7

Head & Tail#

But as a shortcut to inspecting just the first few – or last few – rows of a DataFrame, there are the methods head and tail.

planets_named.head()

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.330	5427.0	3.7
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1

planets_named.tail()

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0
Uranus	2872.5	86.800	1271.0	8.7
Neptune	4495.1	102.000	1638.0	11.0

By default, these methods display the first or last five rows. This number may also be specified.

planets_named.head(2)

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.33	5427.0	3.7
Venus	108.2	4.87	5243.0	8.9

planets_named.tail(2)

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Uranus	2872.5	86.8	1271.0	8.7
Neptune	4495.1	102.0	1638.0	11.0

As with slices, even negative integers are supported – to display all but the first or last n rows.

planets_named.head(-2)

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.330	5427.0	3.7
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0

planets_named.tail(-2)

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7
Jupiter	778.6	1898.000	1326.0	23.1
Saturn	1433.5	568.000	687.0	9.0
Uranus	2872.5	86.800	1271.0	8.7
Neptune	4495.1	102.000	1638.0	11.0

Retrieving#

Attention

Above, we treated our DataFrame like Python’s list to count and slice its rows.

However, you can’t access individual rows in the same manner as with a list.

The below expression raises an exception!

(Click below to view the scary traceback.)

planets[2]  # WRONG 😦

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
   3789 try:
-> 3790     return self._engine.get_loc(casted_key)
   3791 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 2

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[17], line 1
----> 1 planets[2]  # WRONG 😦

File /opt/conda/lib/python3.11/site-packages/pandas/core/frame.py:3896, in DataFrame.__getitem__(self, key)
   3894 if self.columns.nlevels > 1:
   3895     return self._getitem_multilevel(key)
-> 3896 indexer = self.columns.get_loc(key)
   3897 if is_integer(indexer):
   3898     indexer = [indexer]

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key)
   3792     if isinstance(casted_key, slice) or (
   3793         isinstance(casted_key, abc.Iterable)
   3794         and any(isinstance(x, slice) for x in casted_key)
   3795     ):
   3796         raise InvalidIndexError(key)
-> 3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
   3802     self._check_indexing_error(key)

KeyError: 2

After all, the DataFrame is a more complex structure than the list – the above reference to the index value 2 was treated as a reference to a column!

Instead, DataFrame offers the properties iloc and loc, which may themselves be queried with a syntax based on the syntax for retrieving elements from a list.

By position#

Individuals#

iloc is intended for integer-location based look-up of elements by their position in the index.

And, to start, we can now extract the third individual in our DataFrame, at the index offset or position 2, with iloc.

planets.iloc[2]

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 2, dtype: object

Much better.

And, hey – it’s Earth: the third planet from the sun.

Since iloc deals in integer offsets – rather than values within the index, we can do the same with our alternative indices.

planets_natural.iloc[2]

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object

earth = planets_named.iloc[2]

earth

solar_distance_km_6     149.60
mass_kg_24                5.97
density_kg_m3          5514.00
gravity_m_s2              9.80
Name: Earth, dtype: float64

The above presentation of data might look familiar – because it is, again, a Series.

type(earth)

pandas.core.series.Series

And we can still do all the things we did before with the Series.

But careful! What does it mean to take the median … of Earth?

earth.median()

79.7

Answer: nothing!

pandas is happy to apply formulas to series of data. But we’ll have to be a bit more clever than that to come up with a meaningful statistic based on these diverse features.

We can meaningfully access features from the Series of our individual’s data.

earth.solar_distance_km_6

149.6

Above, we’ve extracted just the Earth’s distance from the Sun.

And we can see again how spread out the planets are.

The Earth is the third of eight planets, yet its distance from the sun is less than 12% their average.

earth.solar_distance_km_6 / planets.solar_distance_km_6.mean()

0.11822231880908397

Slices & selection#

We can also reproduce our slice, more explicitly now, using iloc.

planets.iloc[:3]

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
0	Mercury	57.9	0.33	5427.0	3.7
1	Venus	108.2	4.87	5243.0	8.9
2	Earth	149.6	5.97	5514.0	9.8

We can also do something new – construct a new DataFrame consisting of only the individuals at the specified offsets.

planets.iloc[[0, 7]]

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
0	Mercury	57.9	0.33	5427.0	3.7
7	Neptune	4495.1	102.00	1638.0	11.0

Above, we’ve specified to our iloc-based look-up a list – [0, 7] – indicating that we are interested in selecting out the rows at those offsets.

Again, we can perform this operation on our alternative indices, regardless of the different values within them.

planets_natural.iloc[[0, 7]]

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
number
1	Mercury	57.9	0.33	5427.0	3.7
8	Neptune	4495.1	102.00	1638.0	11.0

planets_named.iloc[[0, 7]]

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Mercury	57.9	0.33	5427.0	3.7
Neptune	4495.1	102.00	1638.0	11.0

Note that in the new DataFrame, the planets’ row index values have been preserved. This is highly useful – indeed, Neptune is still the same planet as it was before. But iloc is strictly intended for offsets, like in a list.

If we were to repeat our selection of [0, 7] on the above, this would fail. Rather, the offset references for these two planets in the new DataFrame are now given by [0, 1]. According to iloc in this new DataFrame, Neptune will now be available at offset 1.

bookends = planets.iloc[[0, 7]]

bookends.iloc[1]

name                   Neptune
solar_distance_km_6     4495.1
mass_kg_24               102.0
density_kg_m3           1638.0
gravity_m_s2              11.0
Name: 7, dtype: object

Resetting the index#

If we wanted to correct the inconsistency produced above between our rows’ positions in the index and the values of our index – that is, to create a DataFrame reflecting a solar system consisting of only Mercury and Neptune – of course we could.

The reset_index method will recreate our DataFrame with an index regenerated for the elements it contains.

By default, reset_index preserves the data of our old index – just in case – as an extra column.

bookends.reset_index()

	index	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
0	0	Mercury	57.9	0.33	5427.0	3.7
1	7	Neptune	4495.1	102.00	1638.0	11.0

But we can tell pandas not to worry about that old index.

bookends.reset_index(drop=True)

	name	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
0	Mercury	57.9	0.33	5427.0	3.7
1	Neptune	4495.1	102.00	1638.0	11.0

By label#

The loc property, in the other hand, allows us to look up rows (and more!) according to their index value or “label.”

We may repeat the above operations we performed with iloc, but the arguments we supply to loc must reflect the index of the DataFrame.

We can still find Earth at the index value 2 in our simple DataFrame, because its index’s values are integer positions.

planets.loc[2]

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 2, dtype: object

In our “natural” numeric index, however, Earth is listed under the value 3.

planets_natural.loc[3]

name                    Earth
solar_distance_km_6     149.6
mass_kg_24               5.97
density_kg_m3          5514.0
gravity_m_s2              9.8
Name: 3, dtype: object

And in our name-based index, we can’t supply integer values to loc at all!

planets_named.loc[3]  # WRONG 😦

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
   3789 try:
-> 3790     return self._engine.get_loc(casted_key)
   3791 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 3

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[34], line 1
----> 1 planets_named.loc[3]  # WRONG 😦

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1153, in _LocationIndexer.__getitem__(self, key)
   1150 axis = self.axis or 0
   1152 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1153 return self._getitem_axis(maybe_callable, axis=axis)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1393, in _LocIndexer._getitem_axis(self, key, axis)
   1391 # fall thru to straight lookup
   1392 self._validate_key(key, axis)
-> 1393 return self._get_label(key, axis=axis)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1343, in _LocIndexer._get_label(self, label, axis)
   1341 def _get_label(self, label, axis: AxisInt):
   1342     # GH#5567 this will fail if the label is not present in the axis.
-> 1343     return self.obj.xs(label, axis=axis)

File /opt/conda/lib/python3.11/site-packages/pandas/core/generic.py:4236, in NDFrame.xs(self, key, axis, level, drop_level)
   4234             new_index = index[loc]
   4235 else:
-> 4236     loc = index.get_loc(key)
   4238     if isinstance(loc, np.ndarray):
   4239         if loc.dtype == np.bool_:

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key)
   3792     if isinstance(casted_key, slice) or (
   3793         isinstance(casted_key, abc.Iterable)
   3794         and any(isinstance(x, slice) for x in casted_key)
   3795     ):
   3796         raise InvalidIndexError(key)
-> 3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
   3802     self._check_indexing_error(key)

KeyError: 3

Rather, we may supply loc the string value under which the individual is stored in that index.

planets_named.loc['Earth']

solar_distance_km_6     149.60
mass_kg_24                5.97
density_kg_m3          5514.00
gravity_m_s2              9.80
Name: Earth, dtype: float64

This even goes for slices as well!

planets_named.loc['Venus':'Mars']

	solar_distance_km_6	mass_kg_24	density_kg_m3	gravity_m_s2
name
Venus	108.2	4.870	5243.0	8.9
Earth	149.6	5.970	5514.0	9.8
Mars	227.9	0.642	3933.0	3.7

Attention

And, note: loc handled the above slice differently than iloc!

The upper bound – as well as the lower bound – of the range were included in the result.

In the following sections, we’ll continue to explore slicing and selecting elements by index position and value, via loc and iloc.

Introduction to Data Science I & II

Accessing Rows

Contents

Accessing Rows#

The index#

Counting#

Slicing#

Head & Tail#

Retrieving#

By position#

Individuals#

Slices & selection#

Resetting the index#

By label#