{ "cells": [ { "cell_type": "markdown", "id": "0259174d-803e-4a49-b1c3-3afb66b8342b", "metadata": {}, "source": [ "# Creating a DataFrame" ] }, { "cell_type": "markdown", "id": "ef08f74f-d6f0-4359-817c-ba1f032c3d5c", "metadata": {}, "source": [ "## The library\n", "\n", "Like the NumPy `array` – and _unlike_ the Python `dict` and `list` – the pandas `DataFrame` is _not_ built into Python.\n", "\n", "And so, first, we might have to ensure that the pandas library is installed.\n", "\n", "Only then can we tell Python to make the `pandas` module available to our code, using the `import` statement. For example:\n", "\n", " import pandas\n", "\n", "Having done so, the `DataFrame` type would be available as: `pandas.DataFrame`.\n", "\n", "That is, unlike with the built-in `list`, we would refer to it as \"under\" the name `pandas`, with a dot between the two names.\n", "\n", "Or, we could import just `DataFrame`, such that it's available as just `DataFrame`, without the rigmarole:\n", "\n", " from pandas import DataFrame\n", "\n", "However, we'll be using `pandas` a lot! And not *just* `DataFrame`. Following a common convention, we'll tell Python to assign the library module the name `pd`. This way, we'll be able to refer to the elements of the pandas interface, such as `DataFrame`, as _e.g._: `pd.DataFrame`." ] }, { "cell_type": "code", "execution_count": 1, "id": "5efcf03f-2bf4-461f-9d8e-01b553fabdc0", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "cde2c831-fbf3-4488-aad1-1c71d161c705", "metadata": {}, "source": [ "## The data" ] }, { "cell_type": "markdown", "id": "b21c0622-a4be-450c-8874-d0fe3fbfa1d9", "metadata": {}, "source": [ "We began to consider tabular or two-dimensional data in [Lists](../../04/1/Lists.html#other-lists), with the distances of planets from our sun. Let's expand on this example with the below data, adding the planets' masses, densities and gravities." ] }, { "cell_type": "code", "execution_count": 2, "id": "e5f7d27f-af8d-44ea-a2bf-7c646a8811c1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Mercury', 57.9, 0.33, 5427.0, 3.7],\n", " ['Venus', 108.2, 4.87, 5243.0, 8.9],\n", " ['Earth', 149.6, 5.97, 5514.0, 9.8],\n", " ['Mars', 227.9, 0.642, 3933.0, 3.7],\n", " ['Jupiter', 778.6, 1898.0, 1326.0, 23.1],\n", " ['Saturn', 1433.5, 568.0, 687.0, 9.0],\n", " ['Uranus', 2872.5, 86.8, 1271.0, 8.7],\n", " ['Neptune', 4495.1, 102.0, 1638.0, 11.0]]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets_features = [\n", " 'name', # familiar name\n", " 'solar_distance_km_6', # distance from sun: 10**6 km\n", " 'mass_kg_24', # absolute mass: 10**24 kg\n", " 'density_kg_m3', # density: kg/m**3\n", " 'gravity_m_s2', # gravity: m/s**2\n", "]\n", "\n", "planets_data = [\n", " ['Mercury', 57.9, 0.33, 5427.0, 3.7],\n", " ['Venus', 108.2, 4.87, 5243.0, 8.9],\n", " ['Earth', 149.6, 5.97, 5514.0, 9.8],\n", " ['Mars', 227.9, 0.642, 3933.0, 3.7],\n", " ['Jupiter', 778.6, 1898.0, 1326.0, 23.1],\n", " ['Saturn', 1433.5, 568.0, 687.0, 9.0],\n", " ['Uranus', 2872.5, 86.8, 1271.0, 8.7],\n", " ['Neptune', 4495.1, 102.0, 1638.0, 11.0]\n", "]\n", "\n", "planets_data" ] }, { "cell_type": "markdown", "id": "c565f02c-beb8-419b-88ce-15dd2b6dca80", "metadata": {}, "source": [ "Now let's *construct* a `DataFrame` for these data." ] }, { "cell_type": "code", "execution_count": 3, "id": "01bbbeaf-8766-43ef-869e-3ea3846cc9e4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
0Mercury57.90.3305427.03.7
1Venus108.24.8705243.08.9
2Earth149.65.9705514.09.8
3Mars227.90.6423933.03.7
4Jupiter778.61898.0001326.023.1
5Saturn1433.5568.000687.09.0
6Uranus2872.586.8001271.08.7
7Neptune4495.1102.0001638.011.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "0 Mercury 57.9 0.330 5427.0 3.7\n", "1 Venus 108.2 4.870 5243.0 8.9\n", "2 Earth 149.6 5.970 5514.0 9.8\n", "3 Mars 227.9 0.642 3933.0 3.7\n", "4 Jupiter 778.6 1898.000 1326.0 23.1\n", "5 Saturn 1433.5 568.000 687.0 9.0\n", "6 Uranus 2872.5 86.800 1271.0 8.7\n", "7 Neptune 4495.1 102.000 1638.0 11.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets = pd.DataFrame(planets_data)\n", "\n", "planets" ] }, { "cell_type": "markdown", "id": "ccdcf7e2-b474-4c3a-a083-3de292ff0585", "metadata": {}, "source": [ "This presentation of our data already looks more like a spreadsheet!\n", "\n", "However, there's something odd about the above. We're accustomed now to numbering elements of a sequence by their *index* (or \"offset\") – 0, 1, 2, 3, … – and this works in this case for numbering our rows. But this isn't as useful a scheme for labeling our columns.\n", "\n", "We'll make manipulation of this data easier, and avoid confusion about what these values represent, by defining useful column labels.\n", "\n", "Luckily, we already defined our features above." ] }, { "cell_type": "code", "execution_count": 4, "id": "ebbc04e6-884b-4622-93de-2e546da4b241", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['name', 'solar_distance_km_6', 'mass_kg_24', 'density_kg_m3', 'gravity_m_s2']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets_features" ] }, { "cell_type": "code", "execution_count": 5, "id": "5140ea86-2e1a-4a72-b79a-8540bd2bbe40", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesolar_distance_km_6mass_kg_24density_kg_m3gravity_m_s2
0Mercury57.90.3305427.03.7
1Venus108.24.8705243.08.9
2Earth149.65.9705514.09.8
3Mars227.90.6423933.03.7
4Jupiter778.61898.0001326.023.1
5Saturn1433.5568.000687.09.0
6Uranus2872.586.8001271.08.7
7Neptune4495.1102.0001638.011.0
\n", "
" ], "text/plain": [ " name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2\n", "0 Mercury 57.9 0.330 5427.0 3.7\n", "1 Venus 108.2 4.870 5243.0 8.9\n", "2 Earth 149.6 5.970 5514.0 9.8\n", "3 Mars 227.9 0.642 3933.0 3.7\n", "4 Jupiter 778.6 1898.000 1326.0 23.1\n", "5 Saturn 1433.5 568.000 687.0 9.0\n", "6 Uranus 2872.5 86.800 1271.0 8.7\n", "7 Neptune 4495.1 102.000 1638.0 11.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets = pd.DataFrame(planets_data, columns=planets_features)\n", "\n", "planets" ] }, { "cell_type": "markdown", "id": "91b630ef-3c5d-4b4c-90d3-e3cb794ca690", "metadata": {}, "source": [ "That's better!" ] }, { "cell_type": "markdown", "id": "234a2fda-adaf-4461-94b7-e91331403dd4", "metadata": {}, "source": [ "Indeed, there are _many_ ways to construct a `DataFrame`.\n", "\n", "For another example, we might have specified our data as a single dictionary of features." ] }, { "cell_type": "code", "execution_count": 6, "id": "cd808ba2-803c-4a6d-b998-da66305481f0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesolar_distance_km_6mass_kg_24density_kg_m3gravity_m_s2
0Mercury57.90.3305427.03.7
1Venus108.24.8705243.08.9
2Earth149.65.9705514.09.8
3Mars227.90.6423933.03.7
4Jupiter778.61898.0001326.023.1
5Saturn1433.5568.000687.09.0
6Uranus2872.586.8001271.08.7
7Neptune4495.1102.0001638.011.0
\n", "
" ], "text/plain": [ " name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2\n", "0 Mercury 57.9 0.330 5427.0 3.7\n", "1 Venus 108.2 4.870 5243.0 8.9\n", "2 Earth 149.6 5.970 5514.0 9.8\n", "3 Mars 227.9 0.642 3933.0 3.7\n", "4 Jupiter 778.6 1898.000 1326.0 23.1\n", "5 Saturn 1433.5 568.000 687.0 9.0\n", "6 Uranus 2872.5 86.800 1271.0 8.7\n", "7 Neptune 4495.1 102.000 1638.0 11.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets_dict = {\n", " 'name': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'],\n", " 'solar_distance_km_6': [57.9, 108.2, 149.6, 227.9, 778.6, 1433.5, 2872.5, 4495.1],\n", " 'mass_kg_24': [0.33, 4.87, 5.97, 0.642, 1898.0, 568.0, 86.8, 102.0],\n", " 'density_kg_m3': [5427.0, 5243.0, 5514.0, 3933.0, 1326.0, 687.0, 1271.0, 1638.0],\n", " 'gravity_m_s2': [3.7, 8.9, 9.8, 3.7, 23.1, 9.0, 8.7, 11.0],\n", "}\n", "\n", "pd.DataFrame(planets_dict)" ] }, { "cell_type": "markdown", "id": "26c736b2-debb-430c-8269-c96b70d2dc34", "metadata": {}, "source": [ "### Data formats\n", "\n", "Of course, it is _very_ common to store data in a file format, such as CSV. pandas supports a great many common data encoding formats, and makes it easy to construct `DataFrames` from them.\n", "\n", "For example, if we had a CSV file in our \"Documents\" folder, we might construct a `DataFrame` from it using the pandas `read_csv` function, like so:\n", "\n", "```py\n", "data = pd.read_csv('/Users/MySelf/Documents/my-data.csv')\n", "```\n", "\n", "Above, we simply gave pandas the file system path to our CSV data. The `read_csv` function also supports file objects, such as those returned by Python's `open` function.\n", "\n", "Our planetary data, encoded as CSV, takes the following form:\n", "\n", "```\n", "name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2\n", "Mercury,57.9,0.33,5427.0,3.7\n", "Venus,108.2,4.87,5243.0,8.9\n", "Earth,149.6,5.97,5514.0,9.8\n", "Mars,227.9,0.642,3933.0,3.7\n", "Jupiter,778.6,1898.0,1326.0,23.1\n", "Saturn,1433.5,568.0,687.0,9.0\n", "Uranus,2872.5,86.8,1271.0,8.7\n", "Neptune,4495.1,102.0,1638.0,11.0\n", "```\n", "\n", "Note that we've included our feature names as the first row of our data. (This is optional – but useful!)\n", "\n", "And below we'll reload our planets `DataFrame`, similarly to the above – (but from a file buffer of that data, `planets_csv`, the details of which are hidden below)." ] }, { "cell_type": "code", "execution_count": 7, "id": "ca877460-cec6-42bb-b9c4-7410902ee76d", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "#\n", "# Hello!\n", "#\n", "# This code allows you to download and execute this notebook as-is –\n", "# without a separate CSV file.\n", "#\n", "# We can just *pretend* that `planets_csv` is a path to a file –\n", "# or, more apt, a file object opened with the Python `open` function.\n", "#\n", "# (Really it's an in-memory file object … but that's not important here!)\n", "#\n", "import io\n", "\n", "\n", "planets_encoded = '''\\\n", "name,solar_distance_km_6,mass_kg_24,density_kg_m3,gravity_m_s2\n", "Mercury,57.9,0.33,5427.0,3.7\n", "Venus,108.2,4.87,5243.0,8.9\n", "Earth,149.6,5.97,5514.0,9.8\n", "Mars,227.9,0.642,3933.0,3.7\n", "Jupiter,778.6,1898.0,1326.0,23.1\n", "Saturn,1433.5,568.0,687.0,9.0\n", "Uranus,2872.5,86.8,1271.0,8.7\n", "Neptune,4495.1,102.0,1638.0,11.0\n", "'''\n", "\n", "planets_csv = io.StringIO(planets_encoded)" ] }, { "cell_type": "code", "execution_count": 8, "id": "be1c8683-1dcc-4e38-b957-3290ab10e8b9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesolar_distance_km_6mass_kg_24density_kg_m3gravity_m_s2
0Mercury57.90.3305427.03.7
1Venus108.24.8705243.08.9
2Earth149.65.9705514.09.8
3Mars227.90.6423933.03.7
4Jupiter778.61898.0001326.023.1
5Saturn1433.5568.000687.09.0
6Uranus2872.586.8001271.08.7
7Neptune4495.1102.0001638.011.0
\n", "
" ], "text/plain": [ " name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2\n", "0 Mercury 57.9 0.330 5427.0 3.7\n", "1 Venus 108.2 4.870 5243.0 8.9\n", "2 Earth 149.6 5.970 5514.0 9.8\n", "3 Mars 227.9 0.642 3933.0 3.7\n", "4 Jupiter 778.6 1898.000 1326.0 23.1\n", "5 Saturn 1433.5 568.000 687.0 9.0\n", "6 Uranus 2872.5 86.800 1271.0 8.7\n", "7 Neptune 4495.1 102.000 1638.0 11.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets = pd.read_csv(planets_csv)\n", "\n", "planets" ] }, { "cell_type": "markdown", "id": "056e60fe-ca84-446b-a50d-67fff7427b72", "metadata": {}, "source": [ "Note that pandas automatically inferred that the first row of our CSV data specified the feature names." ] }, { "cell_type": "markdown", "id": "bbdc7693-b407-423d-924a-290efd8abee2", "metadata": {}, "source": [ "## The index\n", "\n", "pandas's default index – the familiar range of integers starting with `0` – is most often sensible for computational data.\n", "\n", "This is represented by the `RangeIndex` type." ] }, { "cell_type": "code", "execution_count": 9, "id": "abe10a4b-a1a9-43ca-a073-319170c26020", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=8, step=1)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.index" ] }, { "cell_type": "markdown", "id": "88a6131a-b083-4e9f-b20f-e0ceaee80c41", "metadata": {}, "source": [ "Of course, that's _not_ how we think about the planets!\n", "\n", "We can tell pandas to use a more familiar index instead." ] }, { "cell_type": "code", "execution_count": 10, "id": "a502cfa7-4b43-47eb-bcdf-bce3e4f7bf67", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=1, stop=9, step=1, name='number')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.RangeIndex(1, 9, name='number')" ] }, { "cell_type": "code", "execution_count": 11, "id": "b3de389f-da2e-481c-b2a0-1db73b95aab2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesolar_distance_km_6mass_kg_24density_kg_m3gravity_m_s2
number
1Mercury57.90.3305427.03.7
2Venus108.24.8705243.08.9
3Earth149.65.9705514.09.8
4Mars227.90.6423933.03.7
5Jupiter778.61898.0001326.023.1
6Saturn1433.5568.000687.09.0
7Uranus2872.586.8001271.08.7
8Neptune4495.1102.0001638.011.0
\n", "
" ], "text/plain": [ " name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2\n", "number \n", "1 Mercury 57.9 0.330 5427.0 3.7\n", "2 Venus 108.2 4.870 5243.0 8.9\n", "3 Earth 149.6 5.970 5514.0 9.8\n", "4 Mars 227.9 0.642 3933.0 3.7\n", "5 Jupiter 778.6 1898.000 1326.0 23.1\n", "6 Saturn 1433.5 568.000 687.0 9.0\n", "7 Uranus 2872.5 86.800 1271.0 8.7\n", "8 Neptune 4495.1 102.000 1638.0 11.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(planets_data,\n", " columns=planets_features,\n", " index=pd.RangeIndex(1, 9, name='number'))" ] }, { "cell_type": "markdown", "id": "6dfa5acf-660c-4bcc-ba63-2746eaa38193", "metadata": {}, "source": [ "We don't even have to use ranges … or numbers!" ] }, { "cell_type": "code", "execution_count": 12, "id": "6ada95ab-61fe-4344-9a45-f5373a203e47", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesolar_distance_km_6mass_kg_24density_kg_m3gravity_m_s2
ordinal
firstMercury57.90.3305427.03.7
secondVenus108.24.8705243.08.9
thirdEarth149.65.9705514.09.8
fourthMars227.90.6423933.03.7
fifthJupiter778.61898.0001326.023.1
sixthSaturn1433.5568.000687.09.0
seventhUranus2872.586.8001271.08.7
eighthNeptune4495.1102.0001638.011.0
\n", "
" ], "text/plain": [ " name solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2\n", "ordinal \n", "first Mercury 57.9 0.330 5427.0 3.7\n", "second Venus 108.2 4.870 5243.0 8.9\n", "third Earth 149.6 5.970 5514.0 9.8\n", "fourth Mars 227.9 0.642 3933.0 3.7\n", "fifth Jupiter 778.6 1898.000 1326.0 23.1\n", "sixth Saturn 1433.5 568.000 687.0 9.0\n", "seventh Uranus 2872.5 86.800 1271.0 8.7\n", "eighth Neptune 4495.1 102.000 1638.0 11.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ordinals = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth']\n", "\n", "planet_ordinals = pd.DataFrame(planets_data,\n", " columns=planets_features,\n", " index=pd.Index(ordinals, name='ordinal'))\n", "\n", "planet_ordinals" ] }, { "cell_type": "markdown", "id": "9576ebce-205b-45f3-9888-743cceb202fc", "metadata": {}, "source": [ "But, in the end, perhaps we'd prefer not to count the planets at all.\n", "\n", "Whenever a data feature makes sense to use as the data index – that is, it's sufficient to _always_ **uniquely identify** individuals, we can just tell pandas to use that column as the index, instead." ] }, { "cell_type": "markdown", "id": "3e5b25fc-6435-4cf3-b0a1-dc5fbb928d04", "metadata": {}, "source": [ "We'll learn more about manipulating `DataFrames` in subsequent sections. But, for now, here's how we would set the `name` feature as our index, (at least when constructing a `DataFrame` from `lists` or `dicts`)." ] }, { "cell_type": "code", "execution_count": 13, "id": "d9dff830-cbc3-4b40-b04d-9fe4e4b42dc9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
solar_distance_km_6mass_kg_24density_kg_m3gravity_m_s2
name
Mercury57.90.3305427.03.7
Venus108.24.8705243.08.9
Earth149.65.9705514.09.8
Mars227.90.6423933.03.7
Jupiter778.61898.0001326.023.1
Saturn1433.5568.000687.09.0
Uranus2872.586.8001271.08.7
Neptune4495.1102.0001638.011.0
\n", "
" ], "text/plain": [ " solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2\n", "name \n", "Mercury 57.9 0.330 5427.0 3.7\n", "Venus 108.2 4.870 5243.0 8.9\n", "Earth 149.6 5.970 5514.0 9.8\n", "Mars 227.9 0.642 3933.0 3.7\n", "Jupiter 778.6 1898.000 1326.0 23.1\n", "Saturn 1433.5 568.000 687.0 9.0\n", "Uranus 2872.5 86.800 1271.0 8.7\n", "Neptune 4495.1 102.000 1638.0 11.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.set_index('name')" ] }, { "cell_type": "code", "execution_count": 14, "id": "33709da9-56b8-4163-8666-68b03614908b", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Hello!\n", "#\n", "# This cell has been hidden – it's just an implementation concern.\n", "#\n", "# Generally, when working with files, you won't need to worry about this.\n", "#\n", "planets_csv.seek(0)" ] }, { "cell_type": "markdown", "id": "ddcfe281-6fbd-4d7c-a71b-79520be3d967", "metadata": {}, "source": [ "The `read_csv` function, on the other hand, supports this case specifically." ] }, { "cell_type": "code", "execution_count": 15, "id": "073652e2-bb44-4121-a03a-758d7c9aaf0c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
solar_distance_km_6mass_kg_24density_kg_m3gravity_m_s2
name
Mercury57.90.3305427.03.7
Venus108.24.8705243.08.9
Earth149.65.9705514.09.8
Mars227.90.6423933.03.7
Jupiter778.61898.0001326.023.1
Saturn1433.5568.000687.09.0
Uranus2872.586.8001271.08.7
Neptune4495.1102.0001638.011.0
\n", "
" ], "text/plain": [ " solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2\n", "name \n", "Mercury 57.9 0.330 5427.0 3.7\n", "Venus 108.2 4.870 5243.0 8.9\n", "Earth 149.6 5.970 5514.0 9.8\n", "Mars 227.9 0.642 3933.0 3.7\n", "Jupiter 778.6 1898.000 1326.0 23.1\n", "Saturn 1433.5 568.000 687.0 9.0\n", "Uranus 2872.5 86.800 1271.0 8.7\n", "Neptune 4495.1 102.000 1638.0 11.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(planets_csv, index_col='name')" ] }, { "cell_type": "markdown", "id": "cc759e02-cf38-4a97-8eab-f71103146ae8", "metadata": {}, "source": [ "Now these `DataFrames` are looking great! Let's see what we can do with them." ] }, { "cell_type": "markdown", "id": "b68a1d00-2b47-4c95-839d-b63acc602fc4", "metadata": {}, "source": [ "## Operations\n", "\n", "As we've seen with the `list`, (and the string), the `DataFrame` can be manipulated by functions and built-in operators. Moreover, these offer special-purpose functions which have been *bound* to their types – that is, *methods* – which are invoked with expressions of the form below:\n", "\n", " name_of_dataframe.name_of_method(argument0, argument1, ..., keyword0=value0, ...)\n", " \n", "For example, above we used the `set_index` method to construct a new `DataFrame` with the `name` column set as the data index. Here it is again:" ] }, { "cell_type": "code", "execution_count": 16, "id": "0a0f096a-7d48-4a52-9895-0dbfc89c2342", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
solar_distance_km_6mass_kg_24density_kg_m3gravity_m_s2
name
Mercury57.90.3305427.03.7
Venus108.24.8705243.08.9
Earth149.65.9705514.09.8
Mars227.90.6423933.03.7
Jupiter778.61898.0001326.023.1
Saturn1433.5568.000687.09.0
Uranus2872.586.8001271.08.7
Neptune4495.1102.0001638.011.0
\n", "
" ], "text/plain": [ " solar_distance_km_6 mass_kg_24 density_kg_m3 gravity_m_s2\n", "name \n", "Mercury 57.9 0.330 5427.0 3.7\n", "Venus 108.2 4.870 5243.0 8.9\n", "Earth 149.6 5.970 5514.0 9.8\n", "Mars 227.9 0.642 3933.0 3.7\n", "Jupiter 778.6 1898.000 1326.0 23.1\n", "Saturn 1433.5 568.000 687.0 9.0\n", "Uranus 2872.5 86.800 1271.0 8.7\n", "Neptune 4495.1 102.000 1638.0 11.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.set_index('name')" ] }, { "cell_type": "markdown", "id": "276c89e8-7dd4-4838-8f50-b62237ab1794", "metadata": {}, "source": [ "And, similar to methods, there are *attributes* and *properties*. These are values which are similarly bound to the `DataFrame`, but which need not be called:\n", "\n", " name_of_dataframe.name_of_property\n", " \n", "We made use of the `index` property above as well, to inspect our `DataFrame`'s currently-assigned index:" ] }, { "cell_type": "code", "execution_count": 17, "id": "abb6b2d0-48eb-4dac-aead-5d4761a2b2e1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=8, step=1)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planets.index" ] }, { "cell_type": "markdown", "id": "d0db98b6-ca01-4073-8614-debc59f46d1d", "metadata": {}, "source": [ "pandas offers us many functions, methods and properties to explore!\n", "\n", "And now we're ready to explore the dimensions our data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "1a1af0ee75eeea9e2e1ee996c87e7a2b11a0bebd85af04bb136d915cefc0abce" } } }, "nbformat": 4, "nbformat_minor": 5 }