Pivot Tables
Pivot Tables#
In this section we’ll continue with the Data Science Jobs and Salaries dataset to introduce the pivot_table operation.
salary_data = pd.read_csv("../../data/Data_Science_Jobs_Salaries.csv")
salary_data.head(5)
work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021e | EN | FT | Data Science Consultant | 54000 | EUR | 64369 | DE | 50 | DE | L |
1 | 2020 | SE | FT | Data Scientist | 60000 | EUR | 68428 | GR | 100 | US | L |
2 | 2021e | EX | FT | Head of Data Science | 85000 | USD | 85000 | RU | 0 | RU | M |
3 | 2021e | EX | FT | Head of Data | 230000 | USD | 230000 | RU | 50 | RU | L |
4 | 2021e | EN | FT | Machine Learning Engineer | 125000 | USD | 125000 | US | 100 | US | S |
A pivot_table allows for cross-classification of groups in a DataFrame.
The general format for a pivot_table is –
df.pivot_table(data, index='group_1', columns='group_2', aggfunc='function')
or alternately, dependent on preference, the below format from directly from a pandas method can be used.
pd.pivot_table(df, values='data', index='group_1', columns = 'group_2', aggfunc='function').
In both formats, we have
df is the the given dataframe
each unique value in
index
gets its own roweach unique value in
columns
gets its own columndata
specifies the value in the DataFrame to which we want to applyaggfunc
The default option for aggfunc
is mean
.
In the last section, we used the groupby
method to cross-classify experience level with company size using the code:
salary_data.groupby(['experience_level', 'company_size'])['salary_in_usd'].mean().unstack()
We can reimplement this with a pivot table by specifying experience_level
for the rows, company_size
for the columns, and the data of salary_in_usd
. (Again, the default for pivot_table
is to aggregate the mean.)
salary_data.pivot_table('salary_in_usd', index='experience_level', columns='company_size')
company_size | L | M | S |
---|---|---|---|
experience_level | |||
EN | 75148.000000 | 41063.923077 | 57502.000000 |
EX | 239729.875000 | 85000.000000 | 243164.500000 |
MI | 96285.451613 | 83982.800000 | 47610.000000 |
SE | 134465.604651 | 122572.125000 | 120978.055556 |
While taking the average is the default, we can also calculate the sum, maximum, minimum, or variance to name a few.
We find the maximum salary with respect to experience level and company size below:
salary_data.pivot_table('salary_in_usd', index='experience_level', columns='company_size', aggfunc='max')
company_size | L | M | S |
---|---|---|---|
experience_level | |||
EN | 250000 | 100000 | 138000 |
EX | 600000 | 85000 | 416000 |
MI | 450000 | 450000 | 110000 |
SE | 412000 | 195000 | 260000 |
We also have the option to calculate multiple statistics or aggregate functions in one line of code.
We enter a list of functions we want to compute in the aggfunc argument and the output is one large DataFrame with each statistic computed.
Below we find the minimum, maximum, and mean salary dependent on experience level and company size:
salary_data.pivot_table('salary_in_usd', index='experience_level', columns='company_size', aggfunc=['min', 'max', 'mean'])
min | max | mean | |||||||
---|---|---|---|---|---|---|---|---|---|
company_size | L | M | S | L | M | S | L | M | S |
experience_level | |||||||||
EN | 5898 | 4000 | 6072 | 250000 | 100000 | 138000 | 75148.000000 | 41063.923077 | 57502.000000 |
EX | 79833 | 85000 | 70329 | 600000 | 85000 | 416000 | 239729.875000 | 85000.000000 | 243164.500000 |
MI | 6072 | 4000 | 2876 | 450000 | 450000 | 110000 | 96285.451613 | 83982.800000 | 47610.000000 |
SE | 21843 | 19052 | 33511 | 412000 | 195000 | 260000 | 134465.604651 | 122572.125000 | 120978.055556 |
In general, the ability to merge, group, or pivot DataFrames provides easy access to gather statistics about a DataFrame. These operations also provide a way to better visualize or sort grouped data.