Stat 112 - Recitation 11

Ozancan Ozdemir - ozancan@metu.edu.tr

Data Visualization with Seaborn

Seaborn is an open source Python library for statistical visualization. It is a tool that uses matplotlib as a foundation, which is meant to be used in conjunction with it, rather than as a replacement. It is designed to work closely with pandas data structures as well as Numpy arrays by allowing the users to quickly create plots from large datasets.

It solves the problem of matplotlib with pandas series and data frame. It is also efficient tool to produce attractive plots with high-level interfaces since it comes with a range of predefined styles and color palettes, which makes it easy to create visually appealing charts with just a few lines of code. Thereby, it is useful tool for Exploratory Data Analysis. (EDA)

Prior to the your analysis, you have to import the seaborn package with matplotlib.pyplot.

In [57]:
import matplotlib.pyplot as plt
import seaborn as sns

After that, we also import the other necessary packages for the analysis.

In [58]:
import numpy as np
import pandas as pd

Before proceeding, I'd like to show you the general structure of the seaborn functions.

sns.function_name( df, x = "x", y = "y", arg)

In this lab, we will try to figure out the properties of the seaborn by considering a famous dataset, mpg.

This dataset contains a subset of the fuel economy data that the EPA makes available on https://fueleconomy.gov/. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.

This data is already available in Python, but we import the data since we also would like to practice our pandas skills :)

The description of the variables are given below.

  • manufacturer: manufacturer name

  • model: model name

  • displ: engine displacement, in litres

  • year: year of manufacture

  • cyl: number of cylinders

  • trans: type of transmission

  • drv: the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd

  • cty: city miles per gallon

  • hwy : highway miles per gallon

  • fl : fuel type

  • class : "type" of car

Import mpg.xlsx dataset to your colab enviroment.

In [59]:
mpg = pd.read_excel('mpg.xlsx')
mpg.head()
Out[59]:
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
In [60]:
mpg.shape
Out[60]:
(234, 11)

The dataset contains 234 observations and 11 variables.

In [61]:
mpg.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 20.2+ KB
In [62]:
mpg.isnull().sum().sum() #no missing
Out[62]:
0

The data has no missing value.

We explore the package by asking some research questions, but the following figure will represent the name of the function and their functionality purposes briefly.

image.png

Source : Seaborn

According to the figure above, we have a general three function in seaborn, and we will determine the plot type by using kind operator. However, we can use the functions designed for the specific plot. For example, we can use catplot to draw bar plot or barplot().

Univariate Analysis

In this part, we will explore the variable properties individually.

RQ1: What is the shape of the distribution of the highway miles per gallon(hwy) of the car models?

The most common approach to visualizing a distribution is the histogram. This is the default approach in displot(), which uses the same underlying code as histplot().

In [159]:
sns.displot(mpg, x = "hwy")
Out[159]:
<seaborn.axisgrid.FacetGrid at 0x7f34ba02ca60>

Since the basic historgram requires only one

Seaborn provides several themes which helps us to improve the apperance of the plot. We can set the style by calling Seaborn's set() method.

In [64]:
sns.set()
In [65]:
sns.displot(mpg, x = "hwy")
Out[65]:
<seaborn.axisgrid.FacetGrid at 0x7f34c107ea30>

We can also set the background through the function set_style. Some possible settings are:

In [66]:
sns.set_style("dark")
sns.displot(mpg, x = "hwy")
Out[66]:
<seaborn.axisgrid.FacetGrid at 0x7f34c0fc91f0>
In [67]:
sns.set_style("whitegrid")
sns.displot(mpg, x = "hwy")
Out[67]:
<seaborn.axisgrid.FacetGrid at 0x7f34c0fd46a0>
In [68]:
sns.set_style("white")
sns.displot(mpg, x = "hwy")
Out[68]:
<seaborn.axisgrid.FacetGrid at 0x7f34c35951f0>
In [69]:
sns.set() # go back to set

You can improve your visual by adding title, caption, subtitle and setting axis titles. In doing so, you can use either seaborn function (set, for example) or matplotlib commands.

In [70]:
sns.displot(mpg, x = "hwy").set(title = 'The Distribution of Highway Miles per Gallon').set_axis_labels('Highway Miles per Gallon', 'Count')

# .set(title = "") add title name 
# .set_axis_labels ("x label", "y label")
Out[70]:
<seaborn.axisgrid.FacetGrid at 0x7f34c0f1c2b0>
In [71]:
sns.displot(mpg, x = "hwy")
plt.title('The Distribution of Highway Miles per Gallon')
plt.xlabel('Highway Miles per Gallon')
plt.ylabel('Count')
Out[71]:
Text(12.334999999999994, 0.5, 'Count')

The size of the figure is controlled by setting the height argument.

In [140]:
sns.set()
sns.displot(mpg, x = "hwy",height = 10)
plt.title('The Distribution of Highway Miles per Gallon',fontweight="bold",fontsize=14,loc="left") # loc: location
plt.suptitle('Car models between 1999-2008',fontsize=10,x=0.08,y=0.98,ha="left") #ha: horizontal alignment, x and y arranges the position
plt.xlabel('Highway Miles per Gallon' )
plt.ylabel('Count')
plt.ylim(0,75)
plt.show()

As in matplotlib, you also can change the figure size in seaborn using figsize function. However, displot is not working, use histplot.

In [73]:
sns.set()
plt.figure(figsize = (12,6))
sns.histplot(mpg, x = "hwy")
plt.title('The Distribution of Highway Miles per Gallon',fontweight="bold",fontsize=14,loc="left") # loc: location
plt.xlabel('Highway Miles per Gallon' )
plt.ylabel('Count')
plt.show()

RQ2: What is the distribution of manufacturer of the car models released between 1999-2008?

Since the manufacturer is the categorical variable, we can display its distribution by using bar plot. There are two ways of drawing bar plot using seaborn, catplot (categorical plot) and barplot (bar plot).

In [74]:
sns.catplot(data=mpg, x="manufacturer", kind="count")
Out[74]:
<seaborn.axisgrid.FacetGrid at 0x7f34c0dc0790>

Let's make it better..

Order by frequency at first, and then change the color of the plot and rotate the axis.

In [150]:
sns.catplot(data=mpg, y="manufacturer", 
            kind="count",
            order=mpg.manufacturer.value_counts().index, 
            color='blue') 

# assing the variable of interest to y, rather than x 
# order determines the order of the axis ticks
# color fill the bars
Out[150]:
<seaborn.axisgrid.FacetGrid at 0x7f34baa8b9d0>

You can use a color palette instead of a fixed color. You can see the color palettes from here

As with the previous example, the size of the figure is controlled by setting the height.

In [144]:
sns.catplot(data=mpg, y="manufacturer", 
            kind="count",
            order=mpg.manufacturer.value_counts().index,palette="icefire",
            height = 12) 
plt.title('The Frequency Distribution of Manufacturer', fontweight="bold",fontsize=14)
plt.show()

You can draw the same plot using barplot command. In doing so, you need to extract a frequency table from the raw data.

In [168]:
freq_table = pd.DataFrame(mpg['manufacturer'].value_counts())
freq_table
Out[168]:
manufacturer
dodge 37
toyota 34
volkswagen 27
ford 25
chevrolet 19
audi 18
hyundai 14
subaru 14
nissan 13
honda 9
jeep 8
pontiac 5
land rover 4
mercury 4
lincoln 3
In [170]:
sns.barplot(data = freq_table, x = freq_table.index, y= 'manufacturer', color  = 'darkred' )
plt.title('The Frequency Distribution of Manufacturer', fontweight="bold",fontsize=14)
plt.show()

Multivariate Analysis

In this part, we will draw the plot that includes more than one variables at the same time.

RQ3: What is the relationship between displacement and highway miles per gallon of the cars released between 1999 and 2008?

Now, we will have two numerical variables. The most classical approach here is the scatter plot

In [90]:
plt.figure(figsize = (10,10)) # set the figure size 
sns.scatterplot(data=mpg, x="displ", y="hwy")
plt.title('The Relationship between Displacement and Highway',fontsize = 15, fontweight = "bold")
plt.xlabel('Displacement')
plt.ylabel('Highway Miles Per Gallon')
plt.show()

If you would like to draw a scatter plot with the trend line, you should use lmplot command.

In [185]:
sns.lmplot(data=mpg, x="displ", y="hwy", height = 10)
plt.title('The Relationship between Displacement and Highway',fontsize = 15, fontweight = "bold")
plt.xlabel('Displacement')
plt.ylabel('Highway Miles Per Gallon')
plt.show()

RQ4: What is the relationship between displacement and highway miles per gallon of the cars released by year?

Now, we will include the third variable to the relationship that we are looking for. In seaborn, we use hue argument in the plot to include the third variable.

In [93]:
plt.figure(figsize = (10,10)) # set the figure size 
sns.scatterplot(data=mpg, x="displ", y="hwy",hue = 'year', palette = ['darkred','orange']) #palette arranges the hue color 
plt.title('The Relationship between Displacement and Highway',fontsize = 15, fontweight = "bold")
plt.xlabel('Displacement')
plt.ylabel('Highway Miles Per Gallon')
plt.show()

RQ5: How does city miles per gallon by tranmission type?

In this question, we have one categorical and one numerical variable. Thus, we can draw either box plot or violin plot.

Let's draw both!

We use .boxplot() function in seaborn to plot the box plot.

In [96]:
plt.figure(figsize = (13,13)) # set the figure size 
sns.boxplot(data=mpg, x="trans", y="cty") #x: the variable on the x axis, y: the variable on the y axis 
plt.title('The Relationship between Transmission and City miles per gallon',fontsize = 15, fontweight = "bold")
plt.xlabel('Transmission')
plt.ylabel('City miles per gallon')
plt.show()
In [100]:
plt.figure(figsize = (13,13)) # set the figure size 
sns.boxplot(data=mpg, x="trans", y="cty",
            palette = 'Paired') #x: the variable on the x axis, y: the variable on the y axis, #palette: assign a color palet
plt.title('The Relationship between Transmission and City miles per gallon',fontsize = 15, fontweight = "bold")
plt.xlabel('Transmission')
plt.ylabel('City miles per gallon')
plt.show()

If you intend to draw a violin plot, you can use violinplot function with the same syntax.

In [101]:
plt.figure(figsize = (13,13)) # set the figure size 
sns.violinplot(data=mpg, x="trans", y="cty",
            palette = 'Paired') #x: the variable on the x axis, y: the variable on the y axis, #palette: assign a color palet
plt.title('The Relationship between Transmission and City miles per gallon',fontsize = 15, fontweight = "bold")
plt.xlabel('Transmission')
plt.ylabel('City miles per gallon')
plt.show()

RQ6: How does the highway miles per gallon varies by number of the cylinder in two different years?

In this research question, we will have three numerical variables. However, two of them can be considered as categorical among these three variables, which are cylinder and year.

Thus, we can construct either box plot or violin plot again. In contrast to the previous example, we have the third variable; year and this can be included to the plot using hueargument.

In [104]:
plt.figure(figsize = (13,13)) # set the figure size 
sns.boxplot(data=mpg, x="cyl",
            y="hwy",
            hue = 'year',
            palette = ['darkred','orange']) #x: the variable on the x axis, y: the variable on the y axis, #palette: assign a color palet
plt.title('The Relationship between Number of Cylinder and Highway miles per gallon varies by year',fontsize = 15, fontweight = "bold")
plt.xlabel('Cylinder')
plt.ylabel('highway miles per gallon varies')
plt.show()

RQ7: Display the correlation among the numerical variables in the data

Instead of asking research question, we will show you how to visualize a correlation among the numerical variables in the data. There are several alternatives, for sure and heatmap is one of them. You can use heatmap() function for this purpose.

In doing so, we will create a correlation matrix at first.

In [108]:
corr_data = mpg.corr() #calculate the correlation table 
corr_data
Out[108]:
displ year cyl cty hwy
displ 1.000000 0.147843 0.930227 -0.798524 -0.766020
year 0.147843 1.000000 0.122245 -0.037232 0.002158
cyl 0.930227 0.122245 1.000000 -0.805771 -0.761912
cty -0.798524 -0.037232 -0.805771 1.000000 0.955916
hwy -0.766020 0.002158 -0.761912 0.955916 1.000000
In [109]:
plt.figure(figsize = (10,10)) # set the figure size 
sns.heatmap(corr_data)
Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f34c0321a60>

RQ8: What is the relationship between highway miles and displacement in the levels of drive train (drv)?

Remember that we will investigate the relationship between hwy and disp in RQ3. We also look at the relationship among these two variables by including the third variable, year.

In contrast to coloring the plot by the third variable, we can draw multiple plot for each level of the third variable because the detection of the pattern in the plot may be harder when we color our plot by the third variable.

In this case, we will use FacetGrid, which creates multi-plot grid for plotting conditional relationships.

In [112]:
g = sns.FacetGrid(mpg, col="drv")
g.map(sns.scatterplot, "displ", "hwy")
Out[112]:
<seaborn.axisgrid.FacetGrid at 0x7f34bc675e80>
<Figure size 1152x360 with 0 Axes>

The size and shape of the plot is specified at the level of each subplot using the height andaspectparameters:

In [126]:
g = sns.FacetGrid(mpg, col="drv", height=6.5, aspect=.65)
g.map(sns.scatterplot, "displ", "hwy")
Out[126]:
<seaborn.axisgrid.FacetGrid at 0x7f34bbf42640>

If you intend to add a title to your grid plot, use fig.subplots_adjust

In [127]:
g = sns.FacetGrid(mpg, col="drv", height=6.5, aspect=.65)
g.map(sns.scatterplot, "displ", "hwy")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('The Relationship Between Hwy and Displ by Drive')
Out[127]:
Text(0.5, 0.98, 'The Relationship Between Hwy and Displ by Drive')

RQ9: What are the associations between numerical variables in the data? What are the distributions of them?

As you see, we handle with answering two research questions simultaneously. Note that scatter_matrix function pandas plotting returns a scatter plot matrix with the histogram.

In seaborn, we can have a similar plot using pairplot

In [128]:
sns.pairplot(mpg)
Out[128]:
<seaborn.axisgrid.PairGrid at 0x7f34bc37af10>

As with other figure-level functions, the size of the figure is controlled by setting the height of each individual subplot:

In [129]:
sns.pairplot(mpg, height=1.5)
Out[129]:
<seaborn.axisgrid.PairGrid at 0x7f34bb5a6b80>

RQ10: What are the associations between numerical variables in the data by two different year? What are the distributions of them?

In addition the previous example, we will include the third variable, year. Remember that we use hueargument to add the third variable.

In [146]:
sns.pairplot(mpg, height=1.5, hue = 'year' , palette = ['darkred','orange'])
Out[146]:
<seaborn.axisgrid.PairGrid at 0x7f34bac23eb0>

If you draw a subplot from this matrix, you can use jointplot command.

In [148]:
sns.jointplot(data=mpg,x="displ",y="hwy",hue="year", palette = ['darkred','orange'])
Out[148]:
<seaborn.axisgrid.JointGrid at 0x7f34ba2bb5e0>

image.png

Exercise 1

Consider flights.xlsx data. The data description given below.

Description: On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 1-1-2013.

year, month, day : Date of departure.

dep_time, arr_time : Actual departure and arrival times (format HHMM or HMM), local tz.

sched_dep_time, sched_arr_time : Scheduled departure and arrival times (format HHMM or HMM), local tz.

dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.

carrier: Two letter carrier abbreviation. See airlines to get name.

flight: Flight number.

tailnum: Plane tail number. See planes for additional metadata.

origin, dest: Origin and destination. See airports for additional metadata.

air_time: Amount of time spent in the air, in minutes.

distance: Distance between airports, in miles.

hour, minute : Time of scheduled departure broken into hour and minutes.

time_hour : Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.

temp, dewp : Temperature and dewpoint in F.

humid : Relative humidity.

wind_dir, wind_speed, wind_gust : Wind direction (in degrees), speed and gust speed (in mph).

precip : Precipitation, in inches.

pressure : Sea level pressure in millibars.

visib : Visibility in miles.

time_hour : Date and hour of the recording as a POSIXct date.

Import the data.

In [152]:
flight = pd.read_excel('flights.xlsx')
flight.head()
Out[152]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... time_hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib
0 2013 1 1 517 515 2 830 819 11 UA ... 2013-01-01 05:00:00 39.02 28.04 64.43 260 12.65858 NaN 0 1011.9 10
1 2013 1 1 517 515 2 830 819 11 UA ... 2013-01-01 05:00:00 39.02 26.96 61.63 260 14.96014 NaN 0 1012.1 10
2 2013 1 1 517 515 2 830 819 11 UA ... 2013-01-01 05:00:00 39.92 24.98 54.81 250 14.96014 21.86482 0 1011.4 10
3 2013 1 1 533 529 4 850 830 20 UA ... 2013-01-01 05:00:00 39.02 28.04 64.43 260 12.65858 NaN 0 1011.9 10
4 2013 1 1 533 529 4 850 830 20 UA ... 2013-01-01 05:00:00 39.02 26.96 61.63 260 14.96014 NaN 0 1012.1 10

5 rows × 28 columns

Please use the seaborn functions to answer the following questions.

1. What is the distribution of the departure time (dep_time)?

In [ ]:

2. What is the frequency distribution of the origin?

In [ ]:

3. What is the association between departure time and arrival time?

In [ ]:

4. What are the top 10 destination points and the corresponding number of destination?

In [ ]:

5. What is the relationship between departure time and arrival time by origin?

In [ ]:

6. How does the departure time change by hour?

In [ ]:

7.How does the temperature change by the time?

In [ ]:

8. Which pairs of the variables are the most correlated ones in the data?

In [ ]:

9. What is the distribution of the departure time in each origin by hour?

In [ ]:

10. What is the relationship between distance and air_time in each origin?

In [ ]:

11. Draw the scatter plot matrix of the numerical variables in the data

In [ ]:

Thank you for this great semester, good luck with your finals and projects!

image.png