Other Visualization Techniques#

In this section, we will introduce other data visualizations that can be used to represent categorical or numerical data. We will discuss another visualization library called seaborn. While the matplotlib library can be used to create most data visualizations in Python, there are some restrictions when it comes to customization. The seaborn library provides many flexible options when creating visualizations. In the upcoming exercises, we will use a combination of seaborn and matplotlib to make visualizations, including box and whisker plots, heatmaps, and area plots.

Along with the previous data and libraries we have been using, we will import seaborn as a common convention: sns.

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('fast')

Box and Whisker Plots#

Box and whisker plots are a useful data visualization method because they intrinsically display multiple summary statistics simultaneously. The central line of each box within a box and whisker plot is the median. The median (also known as the second quartile, \(Q_2\)) is a value within a dataset that lies within the middle, separating the higher half and the lower half of the dataset. The median is a valuable measure of center for a dataset because it is not greatly affected by outliers, as opposed to the mean.

If A represents a dataset with the values listed below, the median of A can be determined by sorting the numbers from low to high and determining the value that falls into the middle (which in this case is 109):

../../_images/median(odd).png

In the case of a dataset with an even number of values, the median can be calculated as the mean of the middle two values, as shown in the example dataset B below:

../../_images/median(even).png

Box and whisker plots also show the lower quartile, the upper quartile, the interquartile range, outliers, the minimum, and the maximum. The lower quartile \(\left( Q_1\right)\) is the value where the lowest 25% of the data points within a dataset lie. It is represented by the lower end of the box.

On the other side, the upper quartile \(\left( Q_3\right)\) is the value in which the highest 25% of the dataset resides. It is represented by the higher end of a box.

The interquartile range (IQR), is the difference between the upper quartile and the lower quartile. The IQR is used to make the length of a box and is represented by the equation:

\(IQR=Q_3-Q_1\)

Outliers are data points that are less than \(Q_1-1.5×IQR\) or greater than \(Q_3+1.5×IQR\). These data points are shown beyond the extremity of the whiskers.

Lastly, the minimum and maximum values are represented by the lowest and highest values, respectively, that are within the range \(Q_1-1.5×IQR\) and \(Q_3+1.5×IQR\). Essentially, they are the lowest and highest values that do not qualify as outliers. The lower whisker represents the minimum value, while the upper whisker represents the maximum value.

Below is a pictorial summary of the major components of a box and whisker plot with an accompanying set of numbers, A:

../../_images/BoxandWhisker.png

With the military data, we will use a box and whisker plot to examine the percentage GDP spending on the military for each country in the ’60s as a way to examine multiple statistics for each country in this time period.

First, we extract the data of interest:

military = pd.read_csv("../../data/NorthAmerica_Military_USD-PercentGDP_Combined.csv", index_col='Year')

the60s = military.loc[1960:1969, ['CAN-PercentGDP', 'MEX-PercentGDP', 'USA-PercentGDP']]

the60s
CAN-PercentGDP MEX-PercentGDP USA-PercentGDP
Year
1960 4.185257 0.673509 8.993125
1961 4.128312 0.651780 9.156031
1962 3.999216 0.689655 9.331673
1963 3.620650 0.718686 8.831891
1964 3.402063 0.677507 8.051281
1965 2.930261 0.591270 7.587247
1966 2.683282 0.576379 8.435300
1967 2.747927 0.545217 9.417796
1968 2.543642 0.548511 9.268454
1969 2.273785 0.600160 8.633264

It is possible to make boxplots using using pyplot or the DataFrame method plot.box(), as shown below:

# Creates a boxplot using pyplot from matplotlib
plt.boxplot(the60s)
plt.show()
../../_images/other-viz_7_0.png
# Creates a boxplot using the DataFrame method plot.box()
the60s.plot.box()
plt.show()
../../_images/other-viz_8_0.png

Plotting using these approaches, the graphs show data within the columns of interest, but depending on the approach used, we see that the column title may or may not be used as categorical indicators on the x-axis. Furthermore, while these plots do the job of displaying the median and interquartile range, adding individual data points will allow for viewers to more easily see the spread of the data. The addition of axis labels, a title, and some color would also enhance this plot and make it more aesthetically pleasing.

We can accomplish this using a combination of functions from matplotlib and seaborn. Using the sns.boxplot() and sns.swarmplot() functions will allow us to create a boxplot with data points overlayed on top:

ax = sns.boxplot(data=the60s, palette="Set2", linewidth=1)
ax = sns.swarmplot(data=the60s, palette="Set2", linewidth=0.5, edgecolor = "black")
plt.xticks(ticks = [0,1,2], labels = ['Canada', 'Mexico', 'United States'])
plt.ylabel("Percent of GDP")
plt.title("% GDP spent on the military in North America from 1960-1969")


plt.show()
../../_images/other-viz_10_0.png

Now that we have proper labeling, we can see the median, upper quartile, and lower quartile of the percentage of the each country’s GDP spent on the military from 1960 to 1969. A noticeable observation this plot shows is that Mexico not only spent a small percentage of their GDP on the military (less than 2%), but the percentage of spending during this decade had very little variability. This makes it hard to see what the median, upper quartile, and lower quartile are for Mexico. The issue of being able to visually resolve displays of data is a common one that data scientists encounter.

Heatmaps#

A heatmap is a matrix of data points depicted through a color gradient. Heatmaps are a great way to visualize data when you want to look at a multidimensional comparison of many variables. Heatmaps can be made from matplotlib, but this process may not be as straightforward to some. On the other hand, seaborn has a function dedicated to generations of heatmaps called sns.heatmap(). For your reference, both the matplotlib and seaborn approaches for constructing heatmaps are listed below.

With seaborn, we will use a heatmap to visualize variables that take on a large range of values. Using data from the Division of Vital Statistics at the Center for Disease Control (CDC), we will examine maternal mortality rates (MMR) in the United States from 2018 to 2019. This data displays the MMR per 100,000 live births based on race and Hispanic origin in various age groups. More information on the dataset can be found here.

Let’s load the data:

mmr = pd.read_csv("../../data/maternal-mortality-rate_2018-2019.csv", index_col='Race/Ethnicity: Age Group')

mmr
MMR (2018) MMR (2019)
Race/Ethnicity: Age Group
Total: All Ages 17.4 20.1
Total: Under 25 10.6 12.6
Total: 25-39 16.6 19.9
Total: 40 and over 81.9 75.5
NHW: All Ages 14.9 17.9
NHW: Under 25 10.5 13.1
NHW: 25–39 13.8 16.8
NHW: 40 and over 72.0 75.2
NHB: All Ages 37.3 44.0
NHB: Under 25 15.3 18.8
NHB: 25–39 38.2 49.7
NHB: 40 and over 239.9 166.5
His: All Ages 11.8 12.6
His: Under 25 7.6 8.5
His: 25–39 12.4 12.2
His: 40 and over NaN NaN

The sns.heatmap() function minimally requires an argument for the data parameter. We can specify this argument to be the DataFrame mmr:

sns.heatmap(data = mmr)

plt.show()
../../_images/other-viz_15_0.png

Above we generated our heatmap based on our data, but there is room for improvement. The current colormap makes it difficult to see each individual race/ethnicity category. We can outline each individual data point using the linewidth and linecolor parameters. We’ll also change the colormap from the default to something more visually distinguishable using the cmap parameter.

Furthermore, additional labeling would help in communicating what the data is about. A label for the color map bar can be added using the cbar_kws paramter and passing a dictionary as an argument. We will also add a title using plt.title()

sns.heatmap(data = mmr, cmap='summer_r', linewidth=2, linecolor="black", # colormap, line width, and color specified
           cbar_kws={'label': 'Maternal Mortality Rate (MMR) per 100,000 Live Births'}) # color bar labeled
plt.title('U.S. MMR, by race and Hispanic origin and age: 2018 - 2019')           # title added

plt.show()
../../_images/other-viz_17_0.png

This is better, but it still can be improved! To make it easier to distinguish each row, we can change our figure size to be plt.figure(). We could also increase the space between the title and the heatmap using the pad parameter in plt.title().

Additionally, labeling the values of each data point would aid a viewer in understanding the distribution of the dataset. By passing True into the annot parameter, we can label each data point with its respective value. We dictate the format of this labeling using the fmt parameter, following formatting for string literals. More information on that is referenced below.

Lastly, we can specify the minimum and maximum values of the color bar using vmin and vmax, respectively.

plt.figure(figsize=(4,10)) 

sns.heatmap(data = mmr, cmap='summer_r', linewidth=2, linecolor="black", # colormap, line width, and color specified
           cbar_kws={'label': 'Maternal Mortality Rate (MMR) per 100,000 Live Births'}, # color bar labeled
            vmin=0, vmax=250, annot=True, fmt='g')                  # min and max values set, annotate data points
plt.title('U.S. MMR, by race and Hispanic origin and age: 2018 - 2019', pad=20)     # title added, padding set

plt.show()
../../_images/other-viz_19_0.png

Above, we see the matrix of values of the MMR (per 100,000 live births) in the U.S. based on race and ethnicity and age group. Notice that in the Hispanic, 40 and over group (His: 40 and over), these data points appear white for 2018 and 2019 and are not labeled. These groups have NaN values in the mmr DataFrame because the MMR does not meet National Center for Health Statistics standards of reliability, and thus, did not have a recorded value in the original dataset.

Area Plots#

An area plot is a specialized line graph that can be used to show trends of multiple variables in a dataset over a period of time. In an area plot, data points over time are connected to create a trend line and the region formed under the line is filled with a solid color. A useful adaptation of an area plot is that it can be constructed in a way that shows a proportional relationship of each variable to all variables over time, which can be a great alternative to using multiple pie charts to examine temporal trends.

We will use an area plot to examine the proportion of suicides amongst youth ages 10-24 in east north central states from 2000-2018 using the enc dataset:

enc = pd.read_csv("../../data/east-north-central_suicides.csv", index_col='Year')
enc
Illinois Indiana Michigan Ohio Wisconsin
Year
2000 155 112 138 151 109
2001 172 98 143 165 100
2002 168 104 150 168 112
2003 131 80 132 139 103
2004 160 99 148 199 111
2005 139 114 140 187 104
2006 135 98 124 183 71
2007 173 104 138 178 110
2008 144 94 146 202 80
2009 151 115 143 157 104
2010 151 107 182 193 113
2011 182 95 169 211 124
2012 160 127 199 197 108
2013 180 127 196 180 107
2014 187 128 211 197 127
2015 211 137 209 224 122
2016 184 157 217 215 143
2017 235 165 212 268 126
2018 201 179 251 271 114

Now that we have this data, we can begin to calculate the proportion of suicides in each state for each year.

Using the .sum() and .div() methods, we can determine the sum for each year, then divide each data point by this sum. In the .sum() method, we will use axis=1 to sum over each row, which will allow us to determine the total for each year:

enc = enc.div(enc.sum(axis=1), axis=0)
enc
Illinois Indiana Michigan Ohio Wisconsin
Year
2000 0.233083 0.168421 0.207519 0.227068 0.163910
2001 0.253687 0.144543 0.210914 0.243363 0.147493
2002 0.239316 0.148148 0.213675 0.239316 0.159544
2003 0.223932 0.136752 0.225641 0.237607 0.176068
2004 0.223152 0.138075 0.206416 0.277545 0.154812
2005 0.203216 0.166667 0.204678 0.273392 0.152047
2006 0.220949 0.160393 0.202946 0.299509 0.116203
2007 0.246088 0.147937 0.196302 0.253201 0.156472
2008 0.216216 0.141141 0.219219 0.303303 0.120120
2009 0.225373 0.171642 0.213433 0.234328 0.155224
2010 0.202413 0.143432 0.243968 0.258713 0.151475
2011 0.233035 0.121639 0.216389 0.270166 0.158771
2012 0.202276 0.160556 0.251580 0.249052 0.136536
2013 0.227848 0.160759 0.248101 0.227848 0.135443
2014 0.220000 0.150588 0.248235 0.231765 0.149412
2015 0.233666 0.151717 0.231451 0.248062 0.135105
2016 0.200873 0.171397 0.236900 0.234716 0.156114
2017 0.233598 0.164016 0.210736 0.266402 0.125249
2018 0.197835 0.176181 0.247047 0.266732 0.112205

Now we can begin to make the area plot with the calculated proportions. To do these, we will use the plt.stackplot() function. This function first takes an array-like object for the x-values, followed by arrays of the y-values that are to be stacked.

plt.stackplot(enc.index, enc['Illinois'], enc['Indiana'], enc['Michigan'], enc['Ohio'], enc['Wisconsin'])
plt.show()
../../_images/other-viz_26_0.png

Currently, we don’t know what each color corresponds to, but this can be solved by specifying the labels parameter, which dictates how each y-value is to be labeled. By specifying this, the labels can then be visualized in the legend using plt.legend(). The location of the legend will be placed using the bbox_to_anchor parameter.

As an aesthetic feature, we can also make this graph so that it takes up the entire plotting area by using plt.margins(). This function accepts an x and y value, respectively, to indicate where the margins begin on each axis. To get rid of the margins, we will use 0 for each value.

Like with previous visualizations, we can add a title and fix the x-axis ticks using plt.title() and plt.xticks(), respectively.

plt.stackplot(enc.index, enc['Illinois'], enc['Indiana'], enc['Michigan'], enc['Ohio'], enc['Wisconsin'], 
              labels = ['Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin'])
plt.legend(bbox_to_anchor = (1.3,1))
plt.margins(0,0)
plt.title('Proportion of Suicides Amongst Youth Ages 10-24 in East North Central States (2000-2018)', pad=15)

years = np.arange(2000, 2020, 2)
plt.xticks(years)

plt.show()
../../_images/other-viz_28_0.png

We can now see the changes in the youth suicides in north east central states from 2000-2018. We see how these trends flucuated in each state over time in each state.

Conclusions#

In this section, we were introduced to a new data visualization library: seaborn. The seaborn library can make many of the same visualizations available in matplotlib and can be used as an alternative in cases where more flexibility is needed.

We learned how to make box and whisker plots and the multiple statistics that these plots innately show. Box and whisker plots can be made in both matplotlib and seaborn, but using seaborn to construct these plots allows for an easy way to overlay data points upon the box and whisker plot.

We also learned about heatmaps and their ability to show multidimensional data.

Lastly, we learned how to construct area plots as another way to show proportional trends overtime, combining benefits of both a line graph and pie chart.

Documentation to functions introduced in this section can be found below: