Pivot Tables#

In this section we’ll continue with the Data Science Jobs and Salaries dataset to introduce the pivot_table operation.

salary_data = pd.read_csv("../../data/Data_Science_Jobs_Salaries.csv")

salary_data.head(5)
work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
0 2021e EN FT Data Science Consultant 54000 EUR 64369 DE 50 DE L
1 2020 SE FT Data Scientist 60000 EUR 68428 GR 100 US L
2 2021e EX FT Head of Data Science 85000 USD 85000 RU 0 RU M
3 2021e EX FT Head of Data 230000 USD 230000 RU 50 RU L
4 2021e EN FT Machine Learning Engineer 125000 USD 125000 US 100 US S

A pivot_table allows for cross-classification of groups in a DataFrame.

The general format for a pivot_table is –

df.pivot_table(data, index='group_1', columns='group_2', aggfunc='function')

or alternately, dependent on preference, the below format from directly from a pandas method can be used.

pd.pivot_table(df, values='data', index='group_1', columns = 'group_2', aggfunc='function').

In both formats, we have

  • df is the the given dataframe

  • each unique value in index gets its own row

  • each unique value in columns gets its own column

  • data specifies the value in the DataFrame to which we want to apply aggfunc

The default option for aggfunc is mean.

In the last section, we used the groupby method to cross-classify experience level with company size using the code:

salary_data.groupby(['experience_level', 'company_size'])['salary_in_usd'].mean().unstack()

We can reimplement this with a pivot table by specifying experience_level for the rows, company_size for the columns, and the data of salary_in_usd. (Again, the default for pivot_table is to aggregate the mean.)

salary_data.pivot_table('salary_in_usd', index='experience_level', columns='company_size')
company_size L M S
experience_level
EN 75148.000000 41063.923077 57502.000000
EX 239729.875000 85000.000000 243164.500000
MI 96285.451613 83982.800000 47610.000000
SE 134465.604651 122572.125000 120978.055556

While taking the average is the default, we can also calculate the sum, maximum, minimum, or variance to name a few.

We find the maximum salary with respect to experience level and company size below:

salary_data.pivot_table('salary_in_usd', index='experience_level', columns='company_size', aggfunc='max')
company_size L M S
experience_level
EN 250000 100000 138000
EX 600000 85000 416000
MI 450000 450000 110000
SE 412000 195000 260000

We also have the option to calculate multiple statistics or aggregate functions in one line of code.

We enter a list of functions we want to compute in the aggfunc argument and the output is one large DataFrame with each statistic computed.

Below we find the minimum, maximum, and mean salary dependent on experience level and company size:

salary_data.pivot_table('salary_in_usd', index='experience_level', columns='company_size', aggfunc=['min', 'max', 'mean'])
min max mean
company_size L M S L M S L M S
experience_level
EN 5898 4000 6072 250000 100000 138000 75148.000000 41063.923077 57502.000000
EX 79833 85000 70329 600000 85000 416000 239729.875000 85000.000000 243164.500000
MI 6072 4000 2876 450000 450000 110000 96285.451613 83982.800000 47610.000000
SE 21843 19052 33511 412000 195000 260000 134465.604651 122572.125000 120978.055556

In general, the ability to merge, group, or pivot DataFrames provides easy access to gather statistics about a DataFrame. These operations also provide a way to better visualize or sort grouped data.