Other Visualization Techniques#

In this section, we will introduce other data visualizations that can be used to represent categorical or numerical data. We will discuss another visualization library called seaborn. While the matplotlib library can be used to create most data visualizations in Python, there are some restrictions when it comes to customization. The seaborn library provides many flexible options when creating visualizations. In the upcoming exercises, we will use a combination of seaborn and matplotlib to make visualizations, including box and whisker plots, heatmaps, and area plots.

Along with the previous data and libraries we have been using, we will import seaborn as a common convention: sns.

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('fast')

Box and Whisker Plots#

Box and whisker plots are a useful data visualization method because they intrinsically display multiple summary statistics simultaneously. The central line of each box within a box and whisker plot is the median. The median (also known as the second quartile, \(Q_2\)) is a value within a dataset that lies within the middle, separating the higher half and the lower half of the dataset. The median is a valuable measure of center for a dataset because it is not greatly affected by outliers, as opposed to the mean.

If A represents a dataset with the values listed below, the median of A can be determined by sorting the numbers from low to high and determining the value that falls into the middle (which in this case is 109):

In the case of a dataset with an even number of values, the median can be calculated as the mean of the middle two values, as shown in the example dataset B below:

Box and whisker plots also show the lower quartile, the upper quartile, the interquartile range, outliers, the minimum, and the maximum. The lower quartile \(\left( Q_1\right)\) is the value where the lowest 25% of the data points within a dataset lie. It is represented by the lower end of the box.

On the other side, the upper quartile \(\left( Q_3\right)\) is the value in which the highest 25% of the dataset resides. It is represented by the higher end of a box.

The interquartile range (IQR), is the difference between the upper quartile and the lower quartile. The IQR is used to make the length of a box and is represented by the equation:

\(IQR=Q_3-Q_1\)

Outliers are data points that are less than \(Q_1-1.5×IQR\) or greater than \(Q_3+1.5×IQR\). These data points are shown beyond the extremity of the whiskers.

Lastly, the minimum and maximum values are represented by the lowest and highest values, respectively, that are within the range \(Q_1-1.5×IQR\) and \(Q_3+1.5×IQR\). Essentially, they are the lowest and highest values that do not qualify as outliers. The lower whisker represents the minimum value, while the upper whisker represents the maximum value.

Below is a pictorial summary of the major components of a box and whisker plot with an accompanying set of numbers, A:

With the military data, we will use a box and whisker plot to examine the percentage GDP spending on the military for each country in the ’60s as a way to examine multiple statistics for each country in this time period.

First, we extract the data of interest:

military = pd.read_csv("../../data/NorthAmerica_Military_USD-PercentGDP_Combined.csv", index_col='Year')

the60s = military.loc[1960:1969, ['CAN-PercentGDP', 'MEX-PercentGDP', 'USA-PercentGDP']]

the60s

	CAN-PercentGDP	MEX-PercentGDP	USA-PercentGDP
Year
1960	4.185257	0.673509	8.993125
1961	4.128312	0.651780	9.156031
1962	3.999216	0.689655	9.331673
1963	3.620650	0.718686	8.831891
1964	3.402063	0.677507	8.051281
1965	2.930261	0.591270	7.587247
1966	2.683282	0.576379	8.435300
1967	2.747927	0.545217	9.417796
1968	2.543642	0.548511	9.268454
1969	2.273785	0.600160	8.633264

It is possible to make boxplots using using pyplot or the DataFrame method plot.box(), as shown below:

# Creates a boxplot using pyplot from matplotlib
plt.boxplot(the60s)
plt.show()

# Creates a boxplot using the DataFrame method plot.box()
the60s.plot.box()
plt.show()

Plotting using these approaches, the graphs show data within the columns of interest, but depending on the approach used, we see that the column title may or may not be used as categorical indicators on the x-axis. Furthermore, while these plots do the job of displaying the median and interquartile range, adding individual data points will allow for viewers to more easily see the spread of the data. The addition of axis labels, a title, and some color would also enhance this plot and make it more aesthetically pleasing.

We can accomplish this using a combination of functions from matplotlib and seaborn. Using the sns.boxplot() and sns.swarmplot() functions will allow us to create a boxplot with data points overlayed on top:

ax = sns.boxplot(data=the60s, palette="Set2", linewidth=1)
ax = sns.swarmplot(data=the60s, palette="Set2", linewidth=0.5, edgecolor = "black")
plt.xticks(ticks = [0,1,2], labels = ['Canada', 'Mexico', 'United States'])
plt.ylabel("Percent of GDP")
plt.title("% GDP spent on the military in North America from 1960-1969")


plt.show()

Now that we have proper labeling, we can see the median, upper quartile, and lower quartile of the percentage of the each country’s GDP spent on the military from 1960 to 1969. A noticeable observation this plot shows is that Mexico not only spent a small percentage of their GDP on the military (less than 2%), but the percentage of spending during this decade had very little variability. This makes it hard to see what the median, upper quartile, and lower quartile are for Mexico. The issue of being able to visually resolve displays of data is a common one that data scientists encounter.

Heatmaps#

A heatmap is a matrix of data points depicted through a color gradient. Heatmaps are a great way to visualize data when you want to look at a multidimensional comparison of many variables. Heatmaps can be made from matplotlib, but this process may not be as straightforward to some. On the other hand, seaborn has a function dedicated to generations of heatmaps called sns.heatmap(). For your reference, both the matplotlib and seaborn approaches for constructing heatmaps are listed below.

With seaborn, we will use a heatmap to visualize variables that take on a large range of values. Using data from the Division of Vital Statistics at the Center for Disease Control (CDC), we will examine maternal mortality rates (MMR) in the United States from 2018 to 2019. This data displays the MMR per 100,000 live births based on race and Hispanic origin in various age groups. More information on the dataset can be found here.

Let’s load the data:

mmr = pd.read_csv("../../data/maternal-mortality-rate_2018-2019.csv", index_col='Race/Ethnicity: Age Group')

mmr

	MMR (2018)	MMR (2019)
Race/Ethnicity: Age Group
Total: All Ages	17.4	20.1
Total: Under 25	10.6	12.6
Total: 25-39	16.6	19.9
Total: 40 and over	81.9	75.5
NHW: All Ages	14.9	17.9
NHW: Under 25	10.5	13.1
NHW: 25–39	13.8	16.8
NHW: 40 and over	72.0	75.2
NHB: All Ages	37.3	44.0
NHB: Under 25	15.3	18.8
NHB: 25–39	38.2	49.7
NHB: 40 and over	239.9	166.5
His: All Ages	11.8	12.6
His: Under 25	7.6	8.5
His: 25–39	12.4	12.2
His: 40 and over	NaN	NaN

The sns.heatmap() function minimally requires an argument for the data parameter. We can specify this argument to be the DataFrame mmr:

sns.heatmap(data = mmr)

plt.show()

Above we generated our heatmap based on our data, but there is room for improvement. The current colormap makes it difficult to see each individual race/ethnicity category. We can outline each individual data point using the linewidth and linecolor parameters. We’ll also change the colormap from the default to something more visually distinguishable using the cmap parameter.

Furthermore, additional labeling would help in communicating what the data is about. A label for the color map bar can be added using the cbar_kws paramter and passing a dictionary as an argument. We will also add a title using plt.title()

sns.heatmap(data = mmr, cmap='summer_r', linewidth=2, linecolor="black", # colormap, line width, and color specified
           cbar_kws={'label': 'Maternal Mortality Rate (MMR) per 100,000 Live Births'}) # color bar labeled
plt.title('U.S. MMR, by race and Hispanic origin and age: 2018 - 2019')           # title added

plt.show()

This is better, but it still can be improved! To make it easier to distinguish each row, we can change our figure size to be plt.figure(). We could also increase the space between the title and the heatmap using the pad parameter in plt.title().

Additionally, labeling the values of each data point would aid a viewer in understanding the distribution of the dataset. By passing True into the annot parameter, we can label each data point with its respective value. We dictate the format of this labeling using the fmt parameter, following formatting for string literals. More information on that is referenced below.

Lastly, we can specify the minimum and maximum values of the color bar using vmin and vmax, respectively.

plt.figure(figsize=(4,10)) 

sns.heatmap(data = mmr, cmap='summer_r', linewidth=2, linecolor="black", # colormap, line width, and color specified
           cbar_kws={'label': 'Maternal Mortality Rate (MMR) per 100,000 Live Births'}, # color bar labeled
            vmin=0, vmax=250, annot=True, fmt='g')                  # min and max values set, annotate data points
plt.title('U.S. MMR, by race and Hispanic origin and age: 2018 - 2019', pad=20)     # title added, padding set

plt.show()

Above, we see the matrix of values of the MMR (per 100,000 live births) in the U.S. based on race and ethnicity and age group. Notice that in the Hispanic, 40 and over group (His: 40 and over), these data points appear white for 2018 and 2019 and are not labeled. These groups have NaN values in the mmr DataFrame because the MMR does not meet National Center for Health Statistics standards of reliability, and thus, did not have a recorded value in the original dataset.

Area Plots#

An area plot is a specialized line graph that can be used to show trends of multiple variables in a dataset over a period of time. In an area plot, data points over time are connected to create a trend line and the region formed under the line is filled with a solid color. A useful adaptation of an area plot is that it can be constructed in a way that shows a proportional relationship of each variable to all variables over time, which can be a great alternative to using multiple pie charts to examine temporal trends.

We will use an area plot to examine the proportion of suicides amongst youth ages 10-24 in east north central states from 2000-2018 using the enc dataset:

enc = pd.read_csv("../../data/east-north-central_suicides.csv", index_col='Year')
enc

	Illinois	Indiana	Michigan	Ohio	Wisconsin
Year
2000	155	112	138	151	109
2001	172	98	143	165	100
2002	168	104	150	168	112
2003	131	80	132	139	103
2004	160	99	148	199	111
2005	139	114	140	187	104
2006	135	98	124	183	71
2007	173	104	138	178	110
2008	144	94	146	202	80
2009	151	115	143	157	104
2010	151	107	182	193	113
2011	182	95	169	211	124
2012	160	127	199	197	108
2013	180	127	196	180	107
2014	187	128	211	197	127
2015	211	137	209	224	122
2016	184	157	217	215	143
2017	235	165	212	268	126
2018	201	179	251	271	114

Now that we have this data, we can begin to calculate the proportion of suicides in each state for each year.

Using the .sum() and .div() methods, we can determine the sum for each year, then divide each data point by this sum. In the .sum() method, we will use axis=1 to sum over each row, which will allow us to determine the total for each year:

enc = enc.div(enc.sum(axis=1), axis=0)
enc

	Illinois	Indiana	Michigan	Ohio	Wisconsin
Year
2000	0.233083	0.168421	0.207519	0.227068	0.163910
2001	0.253687	0.144543	0.210914	0.243363	0.147493
2002	0.239316	0.148148	0.213675	0.239316	0.159544
2003	0.223932	0.136752	0.225641	0.237607	0.176068
2004	0.223152	0.138075	0.206416	0.277545	0.154812
2005	0.203216	0.166667	0.204678	0.273392	0.152047
2006	0.220949	0.160393	0.202946	0.299509	0.116203
2007	0.246088	0.147937	0.196302	0.253201	0.156472
2008	0.216216	0.141141	0.219219	0.303303	0.120120
2009	0.225373	0.171642	0.213433	0.234328	0.155224
2010	0.202413	0.143432	0.243968	0.258713	0.151475
2011	0.233035	0.121639	0.216389	0.270166	0.158771
2012	0.202276	0.160556	0.251580	0.249052	0.136536
2013	0.227848	0.160759	0.248101	0.227848	0.135443
2014	0.220000	0.150588	0.248235	0.231765	0.149412
2015	0.233666	0.151717	0.231451	0.248062	0.135105
2016	0.200873	0.171397	0.236900	0.234716	0.156114
2017	0.233598	0.164016	0.210736	0.266402	0.125249
2018	0.197835	0.176181	0.247047	0.266732	0.112205

Now we can begin to make the area plot with the calculated proportions. To do these, we will use the plt.stackplot() function. This function first takes an array-like object for the x-values, followed by arrays of the y-values that are to be stacked.

plt.stackplot(enc.index, enc['Illinois'], enc['Indiana'], enc['Michigan'], enc['Ohio'], enc['Wisconsin'])
plt.show()

Currently, we don’t know what each color corresponds to, but this can be solved by specifying the labels parameter, which dictates how each y-value is to be labeled. By specifying this, the labels can then be visualized in the legend using plt.legend(). The location of the legend will be placed using the bbox_to_anchor parameter.

As an aesthetic feature, we can also make this graph so that it takes up the entire plotting area by using plt.margins(). This function accepts an x and y value, respectively, to indicate where the margins begin on each axis. To get rid of the margins, we will use 0 for each value.

Like with previous visualizations, we can add a title and fix the x-axis ticks using plt.title() and plt.xticks(), respectively.

plt.stackplot(enc.index, enc['Illinois'], enc['Indiana'], enc['Michigan'], enc['Ohio'], enc['Wisconsin'], 
              labels = ['Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin'])
plt.legend(bbox_to_anchor = (1.3,1))
plt.margins(0,0)
plt.title('Proportion of Suicides Amongst Youth Ages 10-24 in East North Central States (2000-2018)', pad=15)

years = np.arange(2000, 2020, 2)
plt.xticks(years)

plt.show()

We can now see the changes in the youth suicides in north east central states from 2000-2018. We see how these trends flucuated in each state over time in each state.

Conclusions#

In this section, we were introduced to a new data visualization library: seaborn. The seaborn library can make many of the same visualizations available in matplotlib and can be used as an alternative in cases where more flexibility is needed.

We learned how to make box and whisker plots and the multiple statistics that these plots innately show. Box and whisker plots can be made in both matplotlib and seaborn, but using seaborn to construct these plots allows for an easy way to overlay data points upon the box and whisker plot.

We also learned about heatmaps and their ability to show multidimensional data.

Lastly, we learned how to construct area plots as another way to show proportional trends overtime, combining benefits of both a line graph and pie chart.

Documentation to functions introduced in this section can be found below:

Introduction to Data Science I & II

Other Visualization Techniques

Contents

Other Visualization Techniques#

Box and Whisker Plots#

Heatmaps#

Area Plots#

Conclusions#