Grouping Data
Contents
Grouping Data#
In addition to merging multiple DataFrames, we can perform operations within a single DataFrame.
The groupby
function allows us to split the DataFrame into groups. We can extract the grouped data directly into a new DataFrame or apply a specified function before combining the results into a DataFrame.
The most important input to this function is what we are grouping by, usually a column name. The general format of the groupby
operation is given by:
df.groupby(by='group_name'),
where ‘group_name’ specifies the column name of which to group by unique entries in the column.
We’ll redefine the merged right menu to use in our investigation with grouping below.
nutritional_info = pd.DataFrame({
'Item': ['Big Mac', 'Medium French Fries', 'Cheeseburger', 'McChicken', 'Hot Fudge Sundae', 'Sausage Burrito', 'Baked Apple Pie'],
'Calories': [530, 340, 290, 360, 330, 300, 250],
'Protein': [24, 4, 15, 14, 8, 12, 2],
})
menu = pd.DataFrame({
'Item': ['Big Mac', 'Cheeseburger', 'McChicken', 'Hot Fudge Sundae', 'Egg McMuffin', 'Medium French Fries', 'Sausage Burrito'],
'Price': [5.47, 1.81, 2.01, 2.68, 4.43, 2.77, 2.10],
'Category': ['Lunch/Dinner', 'Lunch/Dinner', 'Lunch/Dinner', 'Dessert', 'Breakfast', 'Lunch/Dinner', 'Breakfast'],
})
full_menu = pd.merge(nutritional_info, menu, how='right')
full_menu
Item | Calories | Protein | Price | Category | |
---|---|---|---|---|---|
0 | Big Mac | 530.0 | 24.0 | 5.47 | Lunch/Dinner |
1 | Cheeseburger | 290.0 | 15.0 | 1.81 | Lunch/Dinner |
2 | McChicken | 360.0 | 14.0 | 2.01 | Lunch/Dinner |
3 | Hot Fudge Sundae | 330.0 | 8.0 | 2.68 | Dessert |
4 | Egg McMuffin | NaN | NaN | 4.43 | Breakfast |
5 | Medium French Fries | 340.0 | 4.0 | 2.77 | Lunch/Dinner |
6 | Sausage Burrito | 300.0 | 12.0 | 2.10 | Breakfast |
We can use full_menu
above and group the items by “Category”. The groupby
function by itself splits the data. If we want to form a new DataFrame or add a computation, this must be specified.
Here we create a new DataFrame using the get_group
method. This creates a new DataFrame with entries from the specified group; in this case only items from the category “Breakfast” are included in the DataFrame.
category_groups = full_menu.groupby('Category')
category_groups.get_group('Breakfast')
Item | Calories | Protein | Price | Category | |
---|---|---|---|---|---|
4 | Egg McMuffin | NaN | NaN | 4.43 | Breakfast |
6 | Sausage Burrito | 300.0 | 12.0 | 2.10 | Breakfast |
Applying Functions within Groupby#
In addition to viewing or sorting the data, the groupby
operation is useful in applying functions to groups of a DataFrame.
Perhaps we want to know the average price, grouped by the category of each item. We use the line below, which first groups the data by “Category” and then takes the mean across the “Price” column within each group.
(The double brackets around 'Price'
are needed to ensure we get a DataFrame as output. See: DataFrames: Selection by label.)
full_menu.groupby('Category')[['Price']].mean()
Price | |
---|---|
Category | |
Breakfast | 3.265 |
Dessert | 2.680 |
Lunch/Dinner | 3.015 |
The groupby
operation allows us to split the DataFrame into groups, apply a specified function, and combine the results into a DataFrame. The behind the scenes process of this code is outlined below.
There are many different functions we can apply to our groups. Common functions include: count
, min
, max
, sum
, mean
, std
, var
. (For more information, consult the pandas documentation of groupby
.)
We demonstrate a few of these functions below.
For example, it might be useful to count how many items on the full_menu
fall into each “Category”: “Breakfast”, “Lunch/Dinner”, and “Dessert”. We can do so by first grouping the DataFrame by “Category” and then counting how many “Items” are in each group.
full_menu.groupby('Category')[['Item']].count()
Item | |
---|---|
Category | |
Breakfast | 2 |
Dessert | 1 |
Lunch/Dinner | 4 |
Or, perhaps we want to sum
the menu prices by “Category”. We first group by “Category” and then add up all prices within each group:
full_menu.groupby('Category')[['Price']].sum()
Price | |
---|---|
Category | |
Breakfast | 6.53 |
Dessert | 2.68 |
Lunch/Dinner | 12.06 |
Grouping on a Larger Dataset#
To see the full potential of grouping, we’ll consider a larger DataFrame. We consider the Data Science Jobs and Salaries dataset which contains information on:
Data Science job titles
experience level: Entry-level (EN), Mid-level (MI), Senior-level (SE), Executive-level (EX)
remote vs full time ratio
2020 salary or 2021 expected salary (2021e)
employment type: PT (part time), FT (full time), CT (contract), FL (Freelance)
(For a more complete view of the data, note that they were taken from Saurabh Shahane at Kaggle, which gathered information from ai-jobs.net.)
salary_data = pd.read_csv("../../data/Data_Science_Jobs_Salaries.csv")
salary_data.head(5)
work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021e | EN | FT | Data Science Consultant | 54000 | EUR | 64369 | DE | 50 | DE | L |
1 | 2020 | SE | FT | Data Scientist | 60000 | EUR | 68428 | GR | 100 | US | L |
2 | 2021e | EX | FT | Head of Data Science | 85000 | USD | 85000 | RU | 0 | RU | M |
3 | 2021e | EX | FT | Head of Data | 230000 | USD | 230000 | RU | 50 | RU | L |
4 | 2021e | EN | FT | Machine Learning Engineer | 125000 | USD | 125000 | US | 100 | US | S |
This is only the first few and the last few rows; this DataFrame contains a lot of information! We can extract or sort the information we want by grouping the data in a convenient way. The groupby
method in pandas will help us do exactly that!
For example, we can group the salary_data
by the experience level of the employees. Recall we may do this by first splitting the data into experience_level
groups and then extracting the group of employees with experience_level
equal to EN
, or entry-level experience. The resulting DataFrame contains only entries from the specified group: entry-level experience employees.
experience_level_groups = salary_data.groupby('experience_level')
experience_level_groups.get_group('EN').head(5)
work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021e | EN | FT | Data Science Consultant | 54000 | EUR | 64369 | DE | 50 | DE | L |
4 | 2021e | EN | FT | Machine Learning Engineer | 125000 | USD | 125000 | US | 100 | US | S |
11 | 2021e | EN | FT | Data Scientist | 13400 | USD | 13400 | UA | 100 | UA | L |
17 | 2021e | EN | FT | Data Analyst | 90000 | USD | 90000 | US | 100 | US | S |
18 | 2021e | EN | FT | Data Analyst | 60000 | USD | 60000 | US | 100 | US | S |
We can group by more than one category as well. Below we group first by employment_type
and within each employment_type
group we group by experience_level
. The first
method then return the first row, if it exists, of each group.
two_groups = salary_data.groupby(['employment_type','experience_level'])
two_groups.first()
work_year | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | ||
---|---|---|---|---|---|---|---|---|---|---|
employment_type | experience_level | |||||||||
CT | EN | 2020 | Business Data Analyst | 100000 | USD | 100000 | US | 100 | US | L |
EX | 2021e | Principal Data Scientist | 416000 | USD | 416000 | US | 100 | US | S | |
MI | 2021e | ML Engineer | 270000 | USD | 270000 | US | 100 | US | L | |
SE | 2021e | Staff Data Scientist | 105000 | USD | 105000 | US | 100 | US | M | |
FL | MI | 2021e | Data Engineer | 20000 | USD | 20000 | IT | 0 | US | L |
SE | 2020 | Computer Vision Engineer | 60000 | USD | 60000 | RU | 100 | US | S | |
FT | EN | 2021e | Data Science Consultant | 54000 | EUR | 64369 | DE | 50 | DE | L |
EX | 2021e | Head of Data Science | 85000 | USD | 85000 | RU | 0 | RU | M | |
MI | 2020 | Research Scientist | 450000 | USD | 450000 | US | 0 | US | M | |
SE | 2020 | Data Scientist | 60000 | EUR | 68428 | GR | 100 | US | L | |
PT | EN | 2021e | AI Scientist | 12000 | USD | 12000 | PK | 100 | US | M |
MI | 2021e | 3D Computer Vision Researcher | 400000 | INR | 5423 | IN | 50 | IN | M |
Having grouped the data into single or multiple categories, we’ll now apply functions to these categories!
Perhaps we want to know the average salary grouped by experience level. We use the line below, which groups the data by experience_level
and takes the mean
across the salary_in_usd
column within each group.
salary_data.groupby('experience_level').salary_in_usd.mean()
experience_level
EN 59753.462963
EX 226288.000000
MI 85738.135922
SE 128841.298701
Name: salary_in_usd, dtype: float64
We can also count how many people work (or get salaries) at each size company: small, medium, large.
First we group different company sizes and count all employees who receive a salary.
(Recall that the double brackets on 'salary'
ensure our result is a DataFrame and not a series. The output above, for average salary by experience level, was a simple series – see if you can write this as a DataFrame!)
salary_data.groupby('company_size').count()[['salary']]
salary | |
---|---|
company_size | |
L | 132 |
M | 55 |
S | 58 |
Additionally, the groupby
method can be used to split the data into multiple groups, apply a function, and then reform it into a DataFrame.
Similar to the above example, we first group by multiple columns, subdividing into experience_level
and then further by company_size
:
experience_and_compsize = salary_data.groupby(['experience_level', 'company_size'])
experience_and_compsize.first()
work_year | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | ||
---|---|---|---|---|---|---|---|---|---|---|
experience_level | company_size | |||||||||
EN | L | 2021e | FT | Data Science Consultant | 54000 | EUR | 64369 | DE | 50 | DE |
M | 2021e | PT | AI Scientist | 12000 | USD | 12000 | PK | 100 | US | |
S | 2021e | FT | Machine Learning Engineer | 125000 | USD | 125000 | US | 100 | US | |
EX | L | 2021e | FT | Head of Data | 230000 | USD | 230000 | RU | 50 | RU |
M | 2021e | FT | Head of Data Science | 85000 | USD | 85000 | RU | 0 | RU | |
S | 2021e | FT | Data Science Consultant | 59000 | EUR | 70329 | FR | 100 | ES | |
MI | L | 2020 | FT | Data Analyst | 41000 | EUR | 46759 | FR | 50 | FR |
M | 2020 | FT | Research Scientist | 450000 | USD | 450000 | US | 0 | US | |
S | 2021e | FT | Machine Learning Engineer | 40000 | EUR | 47681 | ES | 100 | ES | |
SE | L | 2020 | FT | Data Scientist | 60000 | EUR | 68428 | GR | 100 | US |
M | 2021e | FT | Data Analytics Manager | 120000 | USD | 120000 | US | 100 | US | |
S | 2021e | FT | Data Analyst | 80000 | USD | 80000 | BG | 100 | US |
If we want to compute the average salary within each experience level and dependent on company size we can use our multi-grouped DataFrame and aggregate the average salary:
experience_and_compsize['salary_in_usd'].mean()
experience_level company_size
EN L 75148.000000
M 41063.923077
S 57502.000000
EX L 239729.875000
M 85000.000000
S 243164.500000
MI L 96285.451613
M 83982.800000
S 47610.000000
SE L 134465.604651
M 122572.125000
S 120978.055556
Name: salary_in_usd, dtype: float64
From this we can see that the average salary for an entry level employee at a large company is $75,148.
To turn this Series into a DataFrame, we use the unstack
method, which pivots the inner-most group, company_size
, and rearranges as columns.
experience_and_compsize['salary_in_usd'].mean().unstack()
company_size | L | M | S |
---|---|---|---|
experience_level | |||
EN | 75148.000000 | 41063.923077 | 57502.000000 |
EX | 239729.875000 | 85000.000000 | 243164.500000 |
MI | 96285.451613 | 83982.800000 | 47610.000000 |
SE | 134465.604651 | 122572.125000 | 120978.055556 |
This idea of splitting a DataFrame into multiple groups and computing multi-dimensional aggregations of values is such an important feature that it has its own function in pandas! It is called a pivot_table. In the next section, we introduce how to manipulate DataFrames using the pivot_table
method.