Pivot Tables#

In this section we’ll continue with the Data Science Jobs and Salaries dataset to introduce the pivot_table operation.

salary_data = pd.read_csv("../../data/Data_Science_Jobs_Salaries.csv")

salary_data.head(5)

	work_year	experience_level	employment_type	job_title	salary	salary_currency	salary_in_usd	employee_residence	remote_ratio	company_location	company_size
0	2021e	EN	FT	Data Science Consultant	54000	EUR	64369	DE	50	DE	L
1	2020	SE	FT	Data Scientist	60000	EUR	68428	GR	100	US	L
2	2021e	EX	FT	Head of Data Science	85000	USD	85000	RU	0	RU	M
3	2021e	EX	FT	Head of Data	230000	USD	230000	RU	50	RU	L
4	2021e	EN	FT	Machine Learning Engineer	125000	USD	125000	US	100	US	S

A pivot_table allows for cross-classification of groups in a DataFrame.

The general format for a pivot_table is –

df.pivot_table(data, index='group_1', columns='group_2', aggfunc='function')

or alternately, dependent on preference, the below format from directly from a pandas method can be used.

pd.pivot_table(df, values='data', index='group_1', columns = 'group_2', aggfunc='function').

In both formats, we have

df is the the given dataframe
each unique value in index gets its own row
each unique value in columns gets its own column
data specifies the value in the DataFrame to which we want to apply aggfunc

The default option for aggfunc is mean.

In the last section, we used the groupby method to cross-classify experience level with company size using the code:

salary_data.groupby(['experience_level', 'company_size'])['salary_in_usd'].mean().unstack()

We can reimplement this with a pivot table by specifying experience_level for the rows, company_size for the columns, and the data of salary_in_usd. (Again, the default for pivot_table is to aggregate the mean.)

salary_data.pivot_table('salary_in_usd', index='experience_level', columns='company_size')

company_size	L	M	S
experience_level
EN	75148.000000	41063.923077	57502.000000
EX	239729.875000	85000.000000	243164.500000
MI	96285.451613	83982.800000	47610.000000
SE	134465.604651	122572.125000	120978.055556

While taking the average is the default, we can also calculate the sum, maximum, minimum, or variance to name a few.

We find the maximum salary with respect to experience level and company size below:

salary_data.pivot_table('salary_in_usd', index='experience_level', columns='company_size', aggfunc='max')

company_size	L	M	S
experience_level
EN	250000	100000	138000
EX	600000	85000	416000
MI	450000	450000	110000
SE	412000	195000	260000

We also have the option to calculate multiple statistics or aggregate functions in one line of code.

We enter a list of functions we want to compute in the aggfunc argument and the output is one large DataFrame with each statistic computed.

Below we find the minimum, maximum, and mean salary dependent on experience level and company size:

salary_data.pivot_table('salary_in_usd', index='experience_level', columns='company_size', aggfunc=['min', 'max', 'mean'])

	min			max			mean
company_size	L	M	S	L	M	S	L	M	S
experience_level
EN	5898	4000	6072	250000	100000	138000	75148.000000	41063.923077	57502.000000
EX	79833	85000	70329	600000	85000	416000	239729.875000	85000.000000	243164.500000
MI	6072	4000	2876	450000	450000	110000	96285.451613	83982.800000	47610.000000
SE	21843	19052	33511	412000	195000	260000	134465.604651	122572.125000	120978.055556

In general, the ability to merge, group, or pivot DataFrames provides easy access to gather statistics about a DataFrame. These operations also provide a way to better visualize or sort grouped data.

Introduction to Data Science I & II

Pivot Tables

Pivot Tables#