Categorical Data#

Categorical data considers numerical quantities in the context of categorical variables. Surveys, like the ones we see on the television show Family Feud or the frequency of people with various eye colors, are examples of categorical data.

In this chapter, there are two types of categorical data that we consider: ordinal data and nominal data.

Ordinal data consists of data that can be described as having a meaningful order, ranking, or relationship between categories. An inventory that quantifies the number of small, medium, and large shirts in stock is an example of ordinal data because there is a ranked relationship between shirt sizes.

Nominal data can be described as named categories that have no meaningful relationship to one another. Counting the number of people with black, brunette, red, and blonde hair colors in a room is an example of nominal data because hair color has no inherit meaning amongst each other - one hair color is not greater than or less than the others.

Categorical data can be visualized using bar graphs and pie charts, and in this section, we will practice making such visualizations.

Again, we’ll load the necessary libraries and data.

import numpy as np
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt

plt.style.use('fast')

military = pd.read_csv("../../data/NorthAmerica_Military_USD-PercentGDP_Combined.csv", index_col='Year')

military

	CAN-PercentGDP	MEX-PercentGDP	USA-PercentGDP	CAN-USD	MEX-USD	USA-USD
Year
1960	4.185257	0.673509	8.993125	1.702443	0.084000	47.346553
1961	4.128312	0.651780	9.156031	1.677821	0.086400	49.879771
1962	3.999216	0.689655	9.331673	1.671314	0.099200	54.650943
1963	3.620650	0.718686	8.831891	1.610092	0.112000	54.561216
1964	3.402063	0.677507	8.051281	1.657457	0.120000	53.432327
...	...	...	...	...	...	...
2016	1.164162	0.495064	3.418942	17.782776	5.336876	639.856443
2017	1.351602	0.436510	3.313381	22.269696	5.062077	646.752927
2018	1.324681	0.477517	3.316249	22.729328	5.839521	682.491400
2019	1.278941	0.523482	3.427080	22.204408	6.650808	734.344100
2020	1.415056	0.573652	3.741160	22.754847	6.116377	778.232200

61 rows × 6 columns

Bar graphs#

Bar graphs are a popular method to visualize categorical data. They’re simple, concise, and can condense large and complex datasets into a simple visual summary. Most bar graphs depict a categorical element as an independent variable on the x-axis while the height of the bar corresponds to a numerical value on the y-axis.

We will practice making bar graphs using our military dataset in the context of an ordinal variable.

First, let’s create a graph to examine the percent of the GDP spent on the military in Canada. We will look at the years 2018, 2019 and 2020.

To do this, we must extract the data for the years of interest from the column containing the data pertaining to GDP percentage of military spending in Canada. We will call this can_gdp.

can_gdp = military.loc[[2018, 2019, 2020], ['CAN-PercentGDP']]

can_gdp

	CAN-PercentGDP
Year
2018	1.324681
2019	1.278941
2020	1.415056

To make a bar graph from a DataFrame, the plot.bar() method can be used. We will use this method on can_gdp:

can_gdp.plot.bar()
plt.show()

We can also use plt.bar() to create a bar graph using the pyplot library. The plt.bar() function needs two arguments. The first argument, x, is an array of values to be plotted on the x-axis.

The second argument, height, determines the height of the bars (the y-values).

We will create a list of our years of interest and call it year_labels to input as the first argument and use the CAN-PercentGDP column of can_gdp as our second argument.

year_labels = ['2018', '2019', '2020']
plt.bar(year_labels, can_gdp["CAN-PercentGDP"])
plt.show()

The above code produced a plot, but this plot needs more descriptive labeling to help others understand the data.

We need to add axis labels and a title to communicate what is being measured. Aesthetically, we can also reduce the width of each bar to give more room on the graph and more rest for our eyes.

plt.bar(year_labels, can_gdp["CAN-PercentGDP"], width=0.25)

plt.title('Military Spending in Canada')

plt.ylabel('Percentage of GDP')
plt.xlabel('Year')

plt.show()

This plot looks better and is a lot more descriptive.

Let’s add the data from Mexico and the United States.

To do this, we will once again use the plt.subplots() function. This time, we will specify our figure with a single axis called ax. Because we want to group our data by year, we can call ax.bar() to set precise positions on the x-axis for each country.

We will also use plt.tight_layout() to automatically adjust the subplot dimensions to give appropriate spacing between the bars and the axes boundaries.

can_gdp = military.loc[[2018, 2019, 2020], ['CAN-PercentGDP']]
mex_gdp = military.loc[[2018, 2019, 2020], ['MEX-PercentGDP']]
usa_gdp = military.loc[[2018, 2019, 2020], ['USA-PercentGDP']]

index = np.arange(len(year_labels))

(fig, ax) = plt.subplots()

# Offsets the bars for Canada by -0.25in
ax.bar(index - 0.25, can_gdp["CAN-PercentGDP"], width=0.25) 

# Plots the bars for Mexico in the middle
ax.bar(index, mex_gdp["MEX-PercentGDP"], width=0.25)

# Offsets the bars for the U.S. by +0.25in
ax.bar(index + 0.25, usa_gdp["USA-PercentGDP"], width=0.25)

plt.tight_layout()

plt.show()

We were able to create a bar plot with all three data sets together. Now, let’s add the appropriate titles, axis labels, and other details using previously described functions.

If we assign each country’s bar to a variable, we can also label each individual bar with the associated numerical value by calling the bar_label() method on ax. In order to label the bars, the label must be specified when creating each each bar.

(fig, ax) = plt.subplots()


# Creates variable for each country and offsets as appropriate
# Annotates bar with the associated value, rounded to 2 places after the decimal
can_bar = ax.bar(index - 0.25, can_gdp["CAN-PercentGDP"].round(decimals=2), width=0.25, label='Canada') 
mex_bar = ax.bar(index, mex_gdp["MEX-PercentGDP"].round(decimals=2), width=0.25, label='Mexico')
usa_bar = ax.bar(index + 0.25, usa_gdp["USA-PercentGDP"].round(decimals=2), width=0.25, label='USA')


# Add labels and titles for entire figure
plt.title("Military Spending in North America", pad=10)
plt.ylabel('Percentage of GDP')
plt.xlabel('Year')
plt.xticks(index, year_labels)
plt.ylim(0, 5)


# Creates legend for the entire figure
plt.legend(loc=4, bbox_to_anchor=(1.3, 0.5))


# Add labels for individual bars; gives spacing (padding) between the value and the bar
ax.bar_label(can_bar, label_type="edge", padding=4)
ax.bar_label(mex_bar, label_type="edge", padding=4)
ax.bar_label(usa_bar, label_type="edge", padding=4)


plt.tight_layout()

plt.show()

Great! Now we have a well annotated, visually appealing graph that depicts an important message about the data: the percentage of the GDP spent on the military for each country for the years 2018-2020.

From this graph, we can easily see that during this time period, Canada and Mexico contribute a smaller proportion of their GDP to military spending than the United States. This may not have been easily discernible by just looking at our large data table.

Horizontal Bar Graphs#

There may be times when you want to present data as a horizontal bar graph. Using the dollar amounts spent in 2020 for each North American country, we can create a horizontal bar graph to represent nominal data. First, we need to extract the data for the year of 2020:

year2020 = military.loc[[2020]][['CAN-USD', 'MEX-USD', 'USA-USD']]

From this, we obtain a DataFrame with a single row of data with values for three variables. To format this data for plotting, we can tranpose the DataFrame using the transpose() method:

year2020 = year2020.transpose()
year2020

Year	2020
CAN-USD	22.754847
MEX-USD	6.116377
USA-USD	778.232200

The DataFrame in this format can directly be used for plotting. To create a horiztonal bar graph, we can use the plt.barh() function. This function requires arguments for the y and width parameters. The y parameter is the categorical variables to be plotted, which are usually displayed on the x-axis of a regular bar graph. The width parameter corresponds to the numerical values that are associated with each categorical variable.

We will use the index of the year2020 DataFrame as the an argument for the y parameter and the values in the 2020 column as an argument for the width parameter:

plt.barh(y = year2020.index, width = year2020[2020])
plt.show()

In the above graph, the index is used to label the categories on the y-axis. If desired, this can be changed by using a list of the same size as an argument for the tick_label parameter. Additional titling and labeling can be added to this graph as well:

plt.barh(y = year2020.index, width = year2020[2020], tick_label = ['Canada', 'Mexico', 'USA'])
plt.title('Military Spending in North American in 2020 (USD)')
plt.xlabel('USD')
plt.ylabel('Country')
plt.show()

For further customization, the bars can be colored too. To do this, the bar graph will need to be defined as a variable and then each bar color can be set by calling for the index of the bar and using the set_color() method like so:

barh = plt.barh(y = year2020.index, width = year2020[2020], tick_label = ['Canada', 'Mexico', 'USA'])

barh[0].set_color('red')
barh[1].set_color('green')
barh[2].set_color('blue')

plt.title('Military Spending in North American in 2020 (USD)')
plt.xlabel('USD')
plt.ylabel('Country')
plt.show()

The same can be done for regular bar graphs made using plt.bar(). Depending on your visualization preferences, plt.bar() and plt.barh() provide multiple options for constructing a bar graph.

Pie charts#

Pie charts are a commonly used visualization method to represent proportions in datasets. Pie charts use wedges to represent the numerical value of a proportion corresponding to a categorial variable.

While pie charts are very common and can be easily interpreted by a layperson audience, they may not be the best way to represent data in certain cases. Firstly, because pie charts use the area of a circle to represent the proportion of a categorical variable, it can be difficult to gauge the numerical value that a wedge represents if the area doesn’t appear as an easily discernible fraction (e.g. ½, ⅓, ¼). This can be aided with the help of labels and legends that explicitly show the numerical values associated with the wedges of the pie chart. Secondly, if you want to visualize many categorical variables or variables that make up a significantly small proportion of the dataset, it may be difficult to see the variable on a pie chart. Overall, pie charts can be a simple and effective way to communicate proportional categorical data, but before using them, consider what attributes of the data need to be highlighted to help decide if a pie chart is the most appropriate visualization method.

We will shift our focus to a different dataset. This is a dataset of the number of suicides amongst youth ages 10-24 from the years 2000 to 2018 within east north central states (Illinois, Indiana, Michigan, Ohio, and Wisconsin). This data comes from a larger dataset gathered by the Center for Disease Control (CDC). More information on this dataset, as well as data from other states, can be found here.

We will load the dataset below as enc:

enc = pd.read_csv("../../data/east-north-central_suicides.csv", index_col='Year')
enc

	Illinois	Indiana	Michigan	Ohio	Wisconsin
Year
2000	155	112	138	151	109
2001	172	98	143	165	100
2002	168	104	150	168	112
2003	131	80	132	139	103
2004	160	99	148	199	111
2005	139	114	140	187	104
2006	135	98	124	183	71
2007	173	104	138	178	110
2008	144	94	146	202	80
2009	151	115	143	157	104
2010	151	107	182	193	113
2011	182	95	169	211	124
2012	160	127	199	197	108
2013	180	127	196	180	107
2014	187	128	211	197	127
2015	211	137	209	224	122
2016	184	157	217	215	143
2017	235	165	212	268	126
2018	201	179	251	271	114

Next, we will make a pie chart using the plt.pie() function. We will use data from the year 2018 to visualize the proportions of suicide from each state in the region. We’ll also set the figure size, in inches, using plt.figure().

plt.figure(figsize=(10, 6))
plt.pie(enc.loc[2018])
plt.show()

Now that we have a pie chart, let’s add some more detail to it to make it more descriptive.

We can label the wedges of the chart so that we know which state corresponds to which color. Likewise, we can label the percentage of each sector to know the definitive proportion of each state’s contribution to the total number of suicides in 2018.

To do this, we will use the columns of enc and define it as states. We then will use states as an argument for the labels parameter within plt.pie() and specify formatting for the autopct parameter, which labels the wedges using the Python’s formatting style. More information on that format can be found here.

states = enc.columns
plt.figure(figsize=(10, 6))
plt.pie(enc.loc[2018], labels=states, autopct='%.1f')
plt.show()

This plot is okay, but it can be better.

The percentages may be difficult to see as the labeling competes with the color of the wedge. Instead, let’s add the percentages into a legend along with the labels of each sector. Let’s also add a title so others can know what they are looking at when they view this chart.

plt.figure(figsize=(10, 6))

patches, text = plt.pie(x = enc.loc[2018])
labels = ['Illinois (19.8 %)', 'Indiana (17.6 %)', 'Michigan (24.7 %)', 'Ohio (26.7 %)', 'Wisconsin (11.2 %)']

plt.legend(patches, labels, loc=4, bbox_to_anchor=(1.6, 0.3), fontsize=15)
plt.title("Proportion of Suicides Amongst Youth Ages 10-24 in East North Central States (2018)",  loc = 'left',
         fontsize = 13)

plt.show()

Above, we used plt.pie() in a way that we had not used it before.

Under the hood, the plt.pie() function returns two default outputs, which we named: patches and text. The size of each wedge is dictated by the patches object. The text object consists of a list of labels for our data. Here, we needed to specifically assign patches and text objects so we could use patches as an argument for the plt.legend() function.

The plt.legend() function has two required arguments. The first argument dictates what is being labeled. In our case, the wedges of the pie chart (i.e. the patches object) are being labeled. The second argument dictates how things are labeled. Here, we simply created a variable called labels, which consists of the fiv strings for the five wedges:

labels

['Illinois (19.8 %)',
 'Indiana (17.6 %)',
 'Michigan (24.7 %)',
 'Ohio (26.7 %)',
 'Wisconsin (11.2 %)']

The other arguments, bbox_to_anchor and fontsize, are optional when using the plt.legend() function.

The argument bbox_to_anchor designates the position in the plotting area where the legend will be, while the fontsize argument dictates the font size, in points, of the legend text.

As mentioned previously, labeling can greatly enhance the efficiency of of a pie chart’s ability to communicate information. Because some of the wedges are very similar in size, it can be hard to discern the numeric value associated with each state. Labeling each wedge with the percentage and associated category or making a legend that depicts this information leaves less room for ambiguity when it comes to the data.

Conclusions#

In this section, we were introduced to the plt.bar() and plt.pie() functions to construct bar plots and pie charts, respectively.

The plt.bar() function requires x and height arguments, which can be an array of number values, but other parameters can be included.

The plt.pie() function only requires an x argument as an array of values and has other parameters that can be utilized well.

Both of these types of visualizations are used for depicting categorical data.

As a reminder, when deciding on whether to use a pie chart, consider certain attributes of the data, such as the number of categorical variables or the size of the proportions to be plotted. Below is a list of functions with linked documentation for your reference and further reading:

Introduction to Data Science I & II

Categorical Data

Contents

Categorical Data#

Bar graphs#

Horizontal Bar Graphs#

Pie charts#

Conclusions#