Accessing Rows
Contents
Accessing Rows#
The second dimension of our data consists of its rows or individuals.
The index#
As you’ll recall from Creating a DataFrame – and just as it did for our columns – pandas has constructed an index for our rows.
By default, the values of this index are the familiar 0, 1, 2, …
, and represented by the RangeIndex
type.
planets.index
RangeIndex(start=0, stop=8, step=1)
But we can always set alternative indices.
planets_natural = planets.set_index(pd.RangeIndex(1, 9, name='number'))
planets_natural.index
RangeIndex(start=1, stop=9, step=1, name='number')
planets_named = planets.set_index('name')
planets_named.index
Index(['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus',
'Neptune'],
dtype='object', name='name')
Counting#
As with the list
, we can use the built-in function len
to see that there are eight planets.
len(planets)
8
Slicing#
We can also slice the DataFrame
, for example to extract only its first three rows.
planets[:3]
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
0 | Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
1 | Venus | 108.2 | 4.87 | 5243.0 | 8.9 |
2 | Earth | 149.6 | 5.97 | 5514.0 | 9.8 |
Above, our slice has constructed a new DataFrame
, consisting of only the data for the first three planets.
This works the same with our alternative indices – generic row slices step through the index regardless of the values within the index.
planets_natural[:3]
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
number | |||||
1 | Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
2 | Venus | 108.2 | 4.87 | 5243.0 | 8.9 |
3 | Earth | 149.6 | 5.97 | 5514.0 | 9.8 |
planets_named[:3]
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
Venus | 108.2 | 4.87 | 5243.0 | 8.9 |
Earth | 149.6 | 5.97 | 5514.0 | 9.8 |
Of course, slices may be more sophisticated than the above.
planets_named[2:5]
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
planets_named[::2]
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
Earth | 149.6 | 5.97 | 5514.0 | 9.8 |
Jupiter | 778.6 | 1898.00 | 1326.0 | 23.1 |
Uranus | 2872.5 | 86.80 | 1271.0 | 8.7 |
Head & Tail#
But as a shortcut to inspecting just the first few – or last few – rows of a DataFrame
, there are the methods head
and tail
.
planets_named.head()
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
planets_named.tail()
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
By default, these methods display the first or last five rows. This number may also be specified.
planets_named.head(2)
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
Venus | 108.2 | 4.87 | 5243.0 | 8.9 |
planets_named.tail(2)
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Uranus | 2872.5 | 86.8 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.0 | 1638.0 | 11.0 |
As with slices, even negative integers are supported – to display all but the first or last n
rows.
planets_named.head(-2)
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.330 | 5427.0 | 3.7 |
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
planets_named.tail(-2)
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Jupiter | 778.6 | 1898.000 | 1326.0 | 23.1 |
Saturn | 1433.5 | 568.000 | 687.0 | 9.0 |
Uranus | 2872.5 | 86.800 | 1271.0 | 8.7 |
Neptune | 4495.1 | 102.000 | 1638.0 | 11.0 |
Retrieving#
Attention
Above, we treated our DataFrame
like Python’s list
to count and slice its rows.
However, you can’t access individual rows in the same manner as with a list
.
The below expression raises an exception!
(Click below to view the scary traceback.)
planets[2] # WRONG 😦
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
3789 try:
-> 3790 return self._engine.get_loc(casted_key)
3791 except KeyError as err:
File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 2
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[17], line 1
----> 1 planets[2] # WRONG 😦
File /opt/conda/lib/python3.11/site-packages/pandas/core/frame.py:3896, in DataFrame.__getitem__(self, key)
3894 if self.columns.nlevels > 1:
3895 return self._getitem_multilevel(key)
-> 3896 indexer = self.columns.get_loc(key)
3897 if is_integer(indexer):
3898 indexer = [indexer]
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key)
3792 if isinstance(casted_key, slice) or (
3793 isinstance(casted_key, abc.Iterable)
3794 and any(isinstance(x, slice) for x in casted_key)
3795 ):
3796 raise InvalidIndexError(key)
-> 3797 raise KeyError(key) from err
3798 except TypeError:
3799 # If we have a listlike key, _check_indexing_error will raise
3800 # InvalidIndexError. Otherwise we fall through and re-raise
3801 # the TypeError.
3802 self._check_indexing_error(key)
KeyError: 2
After all, the DataFrame
is a more complex structure than the list
– the above reference to the index value 2
was treated as a reference to a column!
Instead, DataFrame
offers the properties iloc
and loc
, which may themselves be queried with a syntax based on the syntax for retrieving elements from a list
.
By position#
Individuals#
iloc
is intended for integer-location based look-up of elements by their position in the index.
And, to start, we can now extract the third individual in our DataFrame
, at the index offset or position 2
, with iloc
.
planets.iloc[2]
name Earth
solar_distance_km_6 149.6
mass_kg_24 5.97
density_kg_m3 5514.0
gravity_m_s2 9.8
Name: 2, dtype: object
Much better.
And, hey – it’s Earth: the third planet from the sun.
Since iloc
deals in integer offsets – rather than values within the index, we can do the same with our alternative indices.
planets_natural.iloc[2]
name Earth
solar_distance_km_6 149.6
mass_kg_24 5.97
density_kg_m3 5514.0
gravity_m_s2 9.8
Name: 3, dtype: object
earth = planets_named.iloc[2]
earth
solar_distance_km_6 149.60
mass_kg_24 5.97
density_kg_m3 5514.00
gravity_m_s2 9.80
Name: Earth, dtype: float64
The above presentation of data might look familiar – because it is, again, a Series
.
type(earth)
pandas.core.series.Series
And we can still do all the things we did before with the Series
.
But careful! What does it mean to take the median … of Earth?
earth.median()
79.7
Answer: nothing!
pandas is happy to apply formulas to series of data. But we’ll have to be a bit more clever than that to come up with a meaningful statistic based on these diverse features.
We can meaningfully access features from the Series
of our individual’s data.
earth.solar_distance_km_6
149.6
Above, we’ve extracted just the Earth’s distance from the Sun.
And we can see again how spread out the planets are.
The Earth is the third of eight planets, yet its distance from the sun is less than 12% their average.
earth.solar_distance_km_6 / planets.solar_distance_km_6.mean()
0.11822231880908397
Slices & selection#
We can also reproduce our slice, more explicitly now, using iloc
.
planets.iloc[:3]
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
0 | Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
1 | Venus | 108.2 | 4.87 | 5243.0 | 8.9 |
2 | Earth | 149.6 | 5.97 | 5514.0 | 9.8 |
We can also do something new – construct a new DataFrame
consisting of only the individuals at the specified offsets.
planets.iloc[[0, 7]]
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
0 | Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
7 | Neptune | 4495.1 | 102.00 | 1638.0 | 11.0 |
Above, we’ve specified to our iloc
-based look-up a list
– [0, 7]
– indicating that we are interested in selecting out the rows at those offsets.
Again, we can perform this operation on our alternative indices, regardless of the different values within them.
planets_natural.iloc[[0, 7]]
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
number | |||||
1 | Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
8 | Neptune | 4495.1 | 102.00 | 1638.0 | 11.0 |
planets_named.iloc[[0, 7]]
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
Neptune | 4495.1 | 102.00 | 1638.0 | 11.0 |
Note that in the new DataFrame
, the planets’ row index values have been preserved. This is highly useful – indeed, Neptune is still the same planet as it was before. But iloc
is strictly intended for offsets, like in a list
.
If we were to repeat our selection of [0, 7]
on the above, this would fail. Rather, the offset references for these two planets in the new DataFrame
are now given by [0, 1]
. According to iloc
in this new DataFrame
, Neptune will now be available at offset 1
.
bookends = planets.iloc[[0, 7]]
bookends.iloc[1]
name Neptune
solar_distance_km_6 4495.1
mass_kg_24 102.0
density_kg_m3 1638.0
gravity_m_s2 11.0
Name: 7, dtype: object
Resetting the index#
If we wanted to correct the inconsistency produced above between our rows’ positions in the index and the values of our index – that is, to create a DataFrame
reflecting a solar system consisting of only Mercury and Neptune – of course we could.
The reset_index
method will recreate our DataFrame
with an index regenerated for the elements it contains.
By default, reset_index
preserves the data of our old index – just in case – as an extra column.
bookends.reset_index()
index | name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|---|
0 | 0 | Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
1 | 7 | Neptune | 4495.1 | 102.00 | 1638.0 | 11.0 |
But we can tell pandas not to worry about that old index.
bookends.reset_index(drop=True)
name | solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|---|
0 | Mercury | 57.9 | 0.33 | 5427.0 | 3.7 |
1 | Neptune | 4495.1 | 102.00 | 1638.0 | 11.0 |
By label#
The loc
property, in the other hand, allows us to look up rows (and more!) according to their index value or “label.”
We may repeat the above operations we performed with iloc
, but the arguments we supply to loc
must reflect the index of the DataFrame
.
We can still find Earth at the index value 2
in our simple DataFrame
, because its index’s values are integer positions.
planets.loc[2]
name Earth
solar_distance_km_6 149.6
mass_kg_24 5.97
density_kg_m3 5514.0
gravity_m_s2 9.8
Name: 2, dtype: object
In our “natural” numeric index, however, Earth is listed under the value 3
.
planets_natural.loc[3]
name Earth
solar_distance_km_6 149.6
mass_kg_24 5.97
density_kg_m3 5514.0
gravity_m_s2 9.8
Name: 3, dtype: object
And in our name-based index, we can’t supply integer values to loc
at all!
planets_named.loc[3] # WRONG 😦
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
3789 try:
-> 3790 return self._engine.get_loc(casted_key)
3791 except KeyError as err:
File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 3
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[34], line 1
----> 1 planets_named.loc[3] # WRONG 😦
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1153, in _LocationIndexer.__getitem__(self, key)
1150 axis = self.axis or 0
1152 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1153 return self._getitem_axis(maybe_callable, axis=axis)
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1393, in _LocIndexer._getitem_axis(self, key, axis)
1391 # fall thru to straight lookup
1392 self._validate_key(key, axis)
-> 1393 return self._get_label(key, axis=axis)
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1343, in _LocIndexer._get_label(self, label, axis)
1341 def _get_label(self, label, axis: AxisInt):
1342 # GH#5567 this will fail if the label is not present in the axis.
-> 1343 return self.obj.xs(label, axis=axis)
File /opt/conda/lib/python3.11/site-packages/pandas/core/generic.py:4236, in NDFrame.xs(self, key, axis, level, drop_level)
4234 new_index = index[loc]
4235 else:
-> 4236 loc = index.get_loc(key)
4238 if isinstance(loc, np.ndarray):
4239 if loc.dtype == np.bool_:
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key)
3792 if isinstance(casted_key, slice) or (
3793 isinstance(casted_key, abc.Iterable)
3794 and any(isinstance(x, slice) for x in casted_key)
3795 ):
3796 raise InvalidIndexError(key)
-> 3797 raise KeyError(key) from err
3798 except TypeError:
3799 # If we have a listlike key, _check_indexing_error will raise
3800 # InvalidIndexError. Otherwise we fall through and re-raise
3801 # the TypeError.
3802 self._check_indexing_error(key)
KeyError: 3
Rather, we may supply loc
the string value under which the individual is stored in that index.
planets_named.loc['Earth']
solar_distance_km_6 149.60
mass_kg_24 5.97
density_kg_m3 5514.00
gravity_m_s2 9.80
Name: Earth, dtype: float64
This even goes for slices as well!
planets_named.loc['Venus':'Mars']
solar_distance_km_6 | mass_kg_24 | density_kg_m3 | gravity_m_s2 | |
---|---|---|---|---|
name | ||||
Venus | 108.2 | 4.870 | 5243.0 | 8.9 |
Earth | 149.6 | 5.970 | 5514.0 | 9.8 |
Mars | 227.9 | 0.642 | 3933.0 | 3.7 |
Attention
And, note: loc
handled the above slice differently than iloc
!
The upper bound – as well as the lower bound – of the range were included in the result.
In the following sections, we’ll continue to explore slicing and selecting elements by index position and value, via loc
and iloc
.