Ozancan Ozdemir - ozancan@metu.edu.tr
Seaborn is an open source Python library for statistical visualization. It is a tool that uses matplotlib as a foundation, which is meant to be used in conjunction with it, rather than as a replacement. It is designed to work closely with pandas data structures as well as Numpy arrays by allowing the users to quickly create plots from large datasets.
It solves the problem of matplotlib with pandas series and data frame. It is also efficient tool to produce attractive plots with high-level interfaces since it comes with a range of predefined styles and color palettes, which makes it easy to create visually appealing charts with just a few lines of code. Thereby, it is useful tool for Exploratory Data Analysis. (EDA)
Prior to the your analysis, you have to import the seaborn package with matplotlib.pyplot.
import matplotlib.pyplot as plt
import seaborn as sns
After that, we also import the other necessary packages for the analysis.
import numpy as np
import pandas as pd
Before proceeding, I'd like to show you the general structure of the seaborn functions.
sns.function_name( df, x = "x", y = "y", arg)
In this lab, we will try to figure out the properties of the seaborn by considering a famous dataset, mpg.
This dataset contains a subset of the fuel economy data that the EPA makes available on https://fueleconomy.gov/. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
This data is already available in Python, but we import the data since we also would like to practice our pandas skills :)
The description of the variables are given below.
manufacturer: manufacturer name
model: model name
displ: engine displacement, in litres
year: year of manufacture
cyl: number of cylinders
trans: type of transmission
drv: the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
cty: city miles per gallon
hwy : highway miles per gallon
fl : fuel type
class : "type" of car
Import mpg.xlsx
dataset to your colab enviroment.
mpg = pd.read_excel('mpg.xlsx')
mpg.head()
mpg.shape
The dataset contains 234 observations and 11 variables.
mpg.info()
mpg.isnull().sum().sum() #no missing
The data has no missing value.
We explore the package by asking some research questions, but the following figure will represent the name of the function and their functionality purposes briefly.
Source : Seaborn
According to the figure above, we have a general three function in seaborn, and we will determine the plot type by using kind
operator. However, we can use the functions designed for the specific plot. For example, we can use catplot
to draw bar plot or barplot()
.
In this part, we will explore the variable properties individually.
RQ1: What is the shape of the distribution of the highway miles per gallon(hwy) of the car models?
The most common approach to visualizing a distribution is the histogram. This is the default approach in displot()
, which uses the same underlying code as histplot()
.
sns.displot(mpg, x = "hwy")
Since the basic historgram requires only one
Seaborn provides several themes which helps us to improve the apperance of the plot. We can set the style by calling Seaborn's set()
method.
sns.set()
sns.displot(mpg, x = "hwy")
We can also set the background through the function set_style
. Some possible settings are:
sns.set_style("dark")
sns.displot(mpg, x = "hwy")
sns.set_style("whitegrid")
sns.displot(mpg, x = "hwy")
sns.set_style("white")
sns.displot(mpg, x = "hwy")
sns.set() # go back to set
You can improve your visual by adding title, caption, subtitle and setting axis titles. In doing so, you can use either seaborn function (set, for example) or matplotlib commands.
sns.displot(mpg, x = "hwy").set(title = 'The Distribution of Highway Miles per Gallon').set_axis_labels('Highway Miles per Gallon', 'Count')
# .set(title = "") add title name
# .set_axis_labels ("x label", "y label")
sns.displot(mpg, x = "hwy")
plt.title('The Distribution of Highway Miles per Gallon')
plt.xlabel('Highway Miles per Gallon')
plt.ylabel('Count')
The size of the figure is controlled by setting the height
argument.
sns.set()
sns.displot(mpg, x = "hwy",height = 10)
plt.title('The Distribution of Highway Miles per Gallon',fontweight="bold",fontsize=14,loc="left") # loc: location
plt.suptitle('Car models between 1999-2008',fontsize=10,x=0.08,y=0.98,ha="left") #ha: horizontal alignment, x and y arranges the position
plt.xlabel('Highway Miles per Gallon' )
plt.ylabel('Count')
plt.ylim(0,75)
plt.show()
As in matplotlib, you also can change the figure size in seaborn using figsize
function. However, displot
is not working, use histplot
.
sns.set()
plt.figure(figsize = (12,6))
sns.histplot(mpg, x = "hwy")
plt.title('The Distribution of Highway Miles per Gallon',fontweight="bold",fontsize=14,loc="left") # loc: location
plt.xlabel('Highway Miles per Gallon' )
plt.ylabel('Count')
plt.show()
RQ2: What is the distribution of manufacturer of the car models released between 1999-2008?
Since the manufacturer is the categorical variable, we can display its distribution by using bar plot. There are two ways of drawing bar plot using seaborn, catplot
(categorical plot) and barplot
(bar plot).
sns.catplot(data=mpg, x="manufacturer", kind="count")
Let's make it better..
Order by frequency at first, and then change the color of the plot and rotate the axis.
sns.catplot(data=mpg, y="manufacturer",
kind="count",
order=mpg.manufacturer.value_counts().index,
color='blue')
# assing the variable of interest to y, rather than x
# order determines the order of the axis ticks
# color fill the bars
You can use a color palette instead of a fixed color. You can see the color palettes from here
As with the previous example, the size of the figure is controlled by setting the height
.
sns.catplot(data=mpg, y="manufacturer",
kind="count",
order=mpg.manufacturer.value_counts().index,palette="icefire",
height = 12)
plt.title('The Frequency Distribution of Manufacturer', fontweight="bold",fontsize=14)
plt.show()
You can draw the same plot using barplot
command. In doing so, you need to extract a frequency table from the raw data.
freq_table = pd.DataFrame(mpg['manufacturer'].value_counts())
freq_table
sns.barplot(data = freq_table, x = freq_table.index, y= 'manufacturer', color = 'darkred' )
plt.title('The Frequency Distribution of Manufacturer', fontweight="bold",fontsize=14)
plt.show()
In this part, we will draw the plot that includes more than one variables at the same time.
RQ3: What is the relationship between displacement and highway miles per gallon of the cars released between 1999 and 2008?
Now, we will have two numerical variables. The most classical approach here is the scatter plot
plt.figure(figsize = (10,10)) # set the figure size
sns.scatterplot(data=mpg, x="displ", y="hwy")
plt.title('The Relationship between Displacement and Highway',fontsize = 15, fontweight = "bold")
plt.xlabel('Displacement')
plt.ylabel('Highway Miles Per Gallon')
plt.show()
If you would like to draw a scatter plot with the trend line, you should use lmplot
command.
sns.lmplot(data=mpg, x="displ", y="hwy", height = 10)
plt.title('The Relationship between Displacement and Highway',fontsize = 15, fontweight = "bold")
plt.xlabel('Displacement')
plt.ylabel('Highway Miles Per Gallon')
plt.show()
RQ4: What is the relationship between displacement and highway miles per gallon of the cars released by year?
Now, we will include the third variable to the relationship that we are looking for. In seaborn, we use hue
argument in the plot to include the third variable.
plt.figure(figsize = (10,10)) # set the figure size
sns.scatterplot(data=mpg, x="displ", y="hwy",hue = 'year', palette = ['darkred','orange']) #palette arranges the hue color
plt.title('The Relationship between Displacement and Highway',fontsize = 15, fontweight = "bold")
plt.xlabel('Displacement')
plt.ylabel('Highway Miles Per Gallon')
plt.show()
RQ5: How does city miles per gallon by tranmission type?
In this question, we have one categorical and one numerical variable. Thus, we can draw either box plot or violin plot.
Let's draw both!
We use .boxplot()
function in seaborn to plot the box plot.
plt.figure(figsize = (13,13)) # set the figure size
sns.boxplot(data=mpg, x="trans", y="cty") #x: the variable on the x axis, y: the variable on the y axis
plt.title('The Relationship between Transmission and City miles per gallon',fontsize = 15, fontweight = "bold")
plt.xlabel('Transmission')
plt.ylabel('City miles per gallon')
plt.show()
plt.figure(figsize = (13,13)) # set the figure size
sns.boxplot(data=mpg, x="trans", y="cty",
palette = 'Paired') #x: the variable on the x axis, y: the variable on the y axis, #palette: assign a color palet
plt.title('The Relationship between Transmission and City miles per gallon',fontsize = 15, fontweight = "bold")
plt.xlabel('Transmission')
plt.ylabel('City miles per gallon')
plt.show()
If you intend to draw a violin plot, you can use violinplot
function with the same syntax.
plt.figure(figsize = (13,13)) # set the figure size
sns.violinplot(data=mpg, x="trans", y="cty",
palette = 'Paired') #x: the variable on the x axis, y: the variable on the y axis, #palette: assign a color palet
plt.title('The Relationship between Transmission and City miles per gallon',fontsize = 15, fontweight = "bold")
plt.xlabel('Transmission')
plt.ylabel('City miles per gallon')
plt.show()
RQ6: How does the highway miles per gallon varies by number of the cylinder in two different years?
In this research question, we will have three numerical variables. However, two of them can be considered as categorical among these three variables, which are cylinder and year.
Thus, we can construct either box plot or violin plot again. In contrast to the previous example, we have the third variable; year and this can be included to the plot using hue
argument.
plt.figure(figsize = (13,13)) # set the figure size
sns.boxplot(data=mpg, x="cyl",
y="hwy",
hue = 'year',
palette = ['darkred','orange']) #x: the variable on the x axis, y: the variable on the y axis, #palette: assign a color palet
plt.title('The Relationship between Number of Cylinder and Highway miles per gallon varies by year',fontsize = 15, fontweight = "bold")
plt.xlabel('Cylinder')
plt.ylabel('highway miles per gallon varies')
plt.show()
RQ7: Display the correlation among the numerical variables in the data
Instead of asking research question, we will show you how to visualize a correlation among the numerical variables in the data. There are several alternatives, for sure and heatmap is one of them. You can use heatmap()
function for this purpose.
In doing so, we will create a correlation matrix at first.
corr_data = mpg.corr() #calculate the correlation table
corr_data
plt.figure(figsize = (10,10)) # set the figure size
sns.heatmap(corr_data)
RQ8: What is the relationship between highway miles and displacement in the levels of drive train (drv)?
Remember that we will investigate the relationship between hwy and disp in RQ3. We also look at the relationship among these two variables by including the third variable, year.
In contrast to coloring the plot by the third variable, we can draw multiple plot for each level of the third variable because the detection of the pattern in the plot may be harder when we color our plot by the third variable.
In this case, we will use FacetGrid
, which creates multi-plot grid for plotting conditional relationships.
g = sns.FacetGrid(mpg, col="drv")
g.map(sns.scatterplot, "displ", "hwy")
The size and shape of the plot is specified at the level of each subplot using the height
andaspect
parameters:
g = sns.FacetGrid(mpg, col="drv", height=6.5, aspect=.65)
g.map(sns.scatterplot, "displ", "hwy")
If you intend to add a title to your grid plot, use fig.subplots_adjust
g = sns.FacetGrid(mpg, col="drv", height=6.5, aspect=.65)
g.map(sns.scatterplot, "displ", "hwy")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('The Relationship Between Hwy and Displ by Drive')
RQ9: What are the associations between numerical variables in the data? What are the distributions of them?
As you see, we handle with answering two research questions simultaneously. Note that scatter_matrix
function pandas plotting
returns a scatter plot matrix with the histogram.
In seaborn, we can have a similar plot using pairplot
sns.pairplot(mpg)
As with other figure-level functions, the size of the figure is controlled by setting the height of each individual subplot:
sns.pairplot(mpg, height=1.5)
RQ10: What are the associations between numerical variables in the data by two different year? What are the distributions of them?
In addition the previous example, we will include the third variable, year. Remember that we use hue
argument to add the third variable.
sns.pairplot(mpg, height=1.5, hue = 'year' , palette = ['darkred','orange'])
If you draw a subplot from this matrix, you can use jointplot
command.
sns.jointplot(data=mpg,x="displ",y="hwy",hue="year", palette = ['darkred','orange'])
Consider flights.xlsx
data. The data description given below.
Description: On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 1-1-2013.
year, month, day : Date of departure.
dep_time, arr_time : Actual departure and arrival times (format HHMM or HMM), local tz.
sched_dep_time, sched_arr_time : Scheduled departure and arrival times (format HHMM or HMM), local tz.
dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
carrier: Two letter carrier abbreviation. See airlines to get name.
flight: Flight number.
tailnum: Plane tail number. See planes for additional metadata.
origin, dest: Origin and destination. See airports for additional metadata.
air_time: Amount of time spent in the air, in minutes.
distance: Distance between airports, in miles.
hour, minute : Time of scheduled departure broken into hour and minutes.
time_hour : Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.
temp, dewp : Temperature and dewpoint in F.
humid : Relative humidity.
wind_dir, wind_speed, wind_gust : Wind direction (in degrees), speed and gust speed (in mph).
precip : Precipitation, in inches.
pressure : Sea level pressure in millibars.
visib : Visibility in miles.
time_hour : Date and hour of the recording as a POSIXct date.
Import the data.
flight = pd.read_excel('flights.xlsx')
flight.head()
Please use the seaborn functions to answer the following questions.
1. What is the distribution of the departure time (dep_time)?
2. What is the frequency distribution of the origin?
3. What is the association between departure time and arrival time?
4. What are the top 10 destination points and the corresponding number of destination?
5. What is the relationship between departure time and arrival time by origin?
6. How does the departure time change by hour?
7.How does the temperature change by the time?
8. Which pairs of the variables are the most correlated ones in the data?
9. What is the distribution of the departure time in each origin by hour?
10. What is the relationship between distance and air_time in each origin?
11. Draw the scatter plot matrix of the numerical variables in the data
Suggessted Sources for Seaborn