Ozancan Ozdemir - ozancan@metu.edu.tr
Pandas offers various tools for creating clear and insightful visualizations to aid in data analysis, which utilize matplotlib. These tools allow for easy comparison of different parts of a dataset, exploration of the entire dataset, and identification of patterns and correlations in the data.
The primary tool in pandas for creating visualizations is the plot()
method, which is available for both Series and DataFrames. This method includes a parameter called kind
that determines the type of plot to be generated. The available options for kind
are listed below.
The plot ID is the value of the keyword argument kind. That is, df.plot(kind="scatter") creates a scatter plot. The default kind is "line".
In Pandas, there are two different way of using plot()
method. The first one,
df.plot(kind='<kind of the desired plot e.gbar, area etc>', x,y)
The second one
df.plot.<kind of the desired plot e.gbar, area, etc>()
where df indicates the working data frame.
You can use any of them until as long as it works correctly.
Import the necessary packages.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# set the plots to display in the Jupyter notebook
%matplotlib inline
The plt
interface is what we will use most often.
Consider avocado.csv
dataset.
The variable explanations are given below.
avocado = pd.read_csv('avocado.csv')
avocado = avocado.drop('Unnamed: 0',axis = 1) #drop the unnecessary unnamed:0 column
avocado.head(5)
avocado.info()
❓How does the average price change over time?
I will explore the behaviour of average price over time. Since my intentition is to observe a pattern by time, line plot is the best choice. Notice that the default kind of the plot
function is line.
avocado.plot('Date','AveragePrice')
Let's customize this plot using additional arguments in plot.
avocado.plot('Date','AveragePrice',figsize = (15,8)) #figsize specifies the size of the figure object.
avocado.plot('Date','AveragePrice',figsize = (15,8),
xlabel = 'Date', ylabel = "Average Price") #xlabel & ylabel set axis title for x and y axis.
avocado.plot('Date','AveragePrice',figsize = (15,8),
xlabel = 'Date', ylabel = "Average Price",
title = 'The Change of Average Price of Avocado by Time') #title set the title of the plot
avocado.plot('Date','AveragePrice',figsize = (15,8),
xlabel = 'Date', ylabel = "Average Price",
title = 'The Change of Average Price of Avocado by Time',
color = 'Red') #color changes the color of your plot
❓How does the hass types change over time?
fig, axs = plt.subplots(nrows=2, ncols=2,figsize = (20,12)) #subplots is an attribute from matplotlib and Create a figure and a set of subplots.
axs[0,0].plot(avocado['Date'],avocado['Small Hass'])
axs[0,1].plot(avocado['Date'],avocado['Large Hass'])
axs[1,0].plot(avocado['Date'], avocado['Extra Large Hass'])
It is certainly ugly plot, but you should focus on how to draw a multiple plot in one window.
❓What is the distribution of type of avocados?
Note that avocado has an object data type, implying that it is a categorical variable. Remember that we use bar plot to display the frequency/distribution of the categorical variable.
Drawing a bar plot using pandas may require a simple data manipulation/calculation
avocado['type'].value_counts().plot(kind = 'bar') # basic bar plot
avocado['type'].value_counts().plot(kind = 'barh') #horizontal bar plot
Which type of bar plot can be used in this example? Explain your reasoning.
You can use the customization argument specified above to customize your bar plot.
## Customize your plot
avocado['type'].value_counts().plot(kind = 'barh', figsize = (10,6), color = ['Red','Orange'],
fontsize = 12) #fontsize: Font size for xticks and yticks
#plt.xlabel('xlabel', fontsize=14)
plt.ylabel('Avocado Type', fontsize=14)
plt.title('The Avocado Type', fontsize = 18) #set the title and font size
Let's add counts on the top of the bars... 😢
It takes longer time.
freq_table = pd.DataFrame(avocado['type'].value_counts())
freq_table
ax = freq_table['type'].plot(kind = 'bar', figsize = (10,6), color = ['Red','Orange'],
fontsize = 12)
plt.ylabel('Avocado Type', fontsize=14)
plt.title('The Avocado Type', fontsize = 18) #set the title and font size
#plt.xticks([]) #remove x axis text
plt.yticks([]) #remove y axis text.
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))
plt.show()
❓What is the distribution of Average Price?
Since average price is a numerical value, the shape of the distribution of this can be assessed by histogram.
avocado['AveragePrice'].plot(kind = 'hist')
## Customize
avocado['AveragePrice'].plot(kind = 'hist', color ="orange", figsize = (10,6))
plt.title("Histogram of the Average Price", fontsize = 14,fontweight='bold') #fontweight: bold makes your title bold.
What is the default number of bin in pandas histogram?
❓How does the distribution of large hass and extra large hass differ from each other?
You can compare the shape of the distribution of the two numerical variable on the same plot. Here the alpha is the key parameter.
avocado[['Large Hass', 'Extra Large Hass']].plot(kind = 'hist', alpha = 0.5, figsize = (10,6))
You can customize your plot using the functions and argument listed above.
❓How does the distribution of average price chage by avocado type?
The variable of interests are one numerical and one categorical variable, box plot will be an appropriate choice.
avocado[['AveragePrice','type']].plot(kind = 'box',figsize=(10, 8))
#column: numerical variable & by: categorical variable
Upps.. We have a problem because the data is not in the appropriate form.
subdata = avocado[['AveragePrice','type']]
subdata.head()
Solution: Reshape the data
subdata_wider = subdata.pivot(columns="type", values="AveragePrice")
subdata_wider
subdata_wider.plot(kind = 'box', figsize = (10,6))
You can customize your plot using the functions and argument listed above.
❓What is the relationship between Small Hass and Small Bags?
Since we aim to investigate the relationship between two numerical variables, we can use scatter plot.
avocado.plot(kind ='scatter', x = 'Small Hass', y ='Small Bags', color = 'darkorange', figsize = (10,8))
You can customize your plot using the functions and argument listed above.
Import beerhall.xlsx
data, and answer the following questions using the corresponding graphics.
#compile this chunk
bh = pd.read_excel('beerhall.xlsx')
bh.head()
1.1 How does the average of school attendance (At_School) changes by Region_Code? (Hint: use aggregate and line plot)
1.2 Can you show the frequencies of Regions with an appropriate visual tool? Which region has the highest frequency?
1.3 Do school attendance (At_School) and public attendance (At_Public) show a difference in terms of the distribution?
1.4 How does the distribution of the criminals change by region?
1.5 What is the relationship between criminal and public attendance?
In addition to the plot()
constructor, pandas has also pandas.plotting
module, providing more advanced plots compared to plot
constructor.
Source: Pandas
Among these functions, scatter_matrix()
will be our choice since it is an efficient tool to understand the data at the beginning.
from pandas.plotting import scatter_matrix
scatter_matrix(avocado, figsize = (18,12))
The function returns the histogram of each numerical variable as well as the scatter plot exploring the relationship between each pair of numerical variables.
Try it yourself...
Import the pandas.plotting
module and construct a scatter plot matrix for beerhall.xlsx
data.
Matplotlib is a multi-platform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack. It enables interactive MATLAB-style plotting.
Although better tools like ggplot2 in R or seaborn in Python have been developed in recent years, matplotlib has still a sort of popularity.
Before talking about the usage of matplotlib in details, there are a few things that you should know about the package, which you have already used some of them above.
#import the package
import matplotlib as mpl
import matplotlib.pyplot as plt
The plt
interface is what we will use most often, as we shall see throughout this chapter.
Setting Style
We will use the plt.style
directive to choose appropriate aesthetic styles for our figures.
Here we will set the classic
style, which ensures that the plots we create use the classic Matplotlib style:
plt.style.use('classic')
You can see more options for customization.
Showing the plot
If you are using Matplotlib from within a script, the function plt.show()
is your friend.
plt.show()
starts an event loop, looks for all currently active figure objects, and opens one or more interactive windows that display your figure or figures.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
fig = plt.figure() #figure: Create a new figure, or activate an existing figure
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
plt.show()
Saving Figures to File
One nice feature of Matplotlib is the ability to save figures in a wide variety of formats.
Saving a figure can be done using the savefig()
command.
For example, to save the previous figure as a PNG file, you can run this:
fig.savefig('my_figure.png')
Plot Frame
Matplotlib has an ability to draw a multiple plot in one window.
plt.figure() # create a plot figure
# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))
# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));
Now, let's stary by drawing a basic line chart
For all Matplotlib plots, we start by creating a figure and an axes. In their simplest form, a figure and axes can be created as follows:
fig = plt.figure()
ax = plt.axes()
In Matplotlib, the figure (an instance of the class plt.Figure) can be thought of as a single container that contains all the objects representing axes, graphics, text, and labels. The axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks and labels, which will eventually contain the plot elements that make up our visualization. Throughout this class, we'll commonly use the variable name fig to refer to a figure instance, and ax to refer to an axes instance or group of axes instances.
Once we have created an axes, we can use the ax.plot function to plot some data. Let's start with a simple sinusoid:
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x));
Alternatively, we can use the pylab interface and let the figure and axes be created for us in the background (see Two Interfaces for the Price of One for a discussion of these two interfaces):
plt.plot(x, np.sin(x));
If we want to create a single figure with multiple lines, we can simply call the plot function multiple times
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));
Adjusting the Plot: Line Colors and Styles
The first adjustment you might wish to make to a plot is to control the line colors and styles. The plt.plot() function takes additional arguments that can be used to specify these. To adjust the color, you can use the color keyword, which accepts a string argument representing virtually any imaginable color. The color can be specified in a variety of ways:
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
If no color is specified, Matplotlib will automatically cycle through a set of default colors for multiple lines.
Similarly, the line style can be adjusted using the linestyle keyword:
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');
# For short, you can use the following codes:
plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
Adjusting the Plot: Axes Limits
Matplotlib does a decent job of choosing default axes limits for your plot, but sometimes it's nice to have finer control. The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim() methods:
plt.plot(x, np.sin(x))
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);
If for some reason you'd like either axis to be displayed in reverse, you can simply reverse the order of the arguments:
plt.plot(x, np.sin(x))
plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
A useful related method is plt.axis()
(note here the potential confusion between axes with an e, and axis with an i).
The plt.axis()
method allows you to set the x
and y
limits with a single call, by passing a list which specifies [xmin, xmax, ymin, ymax]
:
plt.plot(x, np.sin(x))
plt.axis([-1, 11, -1.5, 1.5]);
The plt.axis()
method goes even beyond this, allowing you to do things like automatically tighten the bounds around the current plot:
plt.plot(x, np.sin(x))
plt.axis('tight');
It allows even higher-level specifications, such as ensuring an equal aspect ratio so that on your screen, one unit in x is equal to one unit in y:
plt.plot(x, np.sin(x))
plt.axis('equal');
For more information on axis limits and the other capabilities of the plt.axis
method, refer to the plt.axis
docstring.
Labeling Plots
As the last piece of this section, we'll briefly look at the labeling of plots: titles, axis labels, and simple legends.
Titles and axis labels are the simplest such labels—there are methods that can be used to quickly set them:
plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
The position, size, and style of these labels can be adjusted using optional arguments to the function. For more information, see the Matplotlib documentation and the docstrings of each of these functions.
When multiple lines are being shown within a single axes, it can be useful to create a plot legend that labels each line type.
Again, Matplotlib has a built-in way of quickly creating such a legend.
It is done via the (you guessed it) plt.legend()
method.
Though there are several valid ways of using this, I find it easiest to specify the label of each line using the label
keyword of the plot function:
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.axis('equal')
plt.legend();
As you can see, the plt.legend()
function keeps track of the line style and color, and matches these with the correct label.
More information on specifying and formatting plot legends can be found in the plt.legend
docstring; additionally, we will cover some more advanced legend options in Customizing Plot Legends
Aside: Matplotlib Gotchas
While most plt
functions translate directly to ax
methods (such as plt.plot()
→ ax.plot()
, plt.legend()
→ ax.legend()
, etc.), this is not the case for all commands.
In particular, functions to set limits, labels, and titles are slightly modified.
For transitioning between MATLAB-style functions and object-oriented methods, make the following changes:
plt.xlabel()
→ ax.set_xlabel()
plt.ylabel()
→ ax.set_ylabel()
plt.xlim()
→ ax.set_xlim()
plt.ylim()
→ ax.set_ylim()
plt.title()
→ ax.set_title()
In the object-oriented interface to plotting, rather than calling these functions individually, it is often more convenient to use the ax.set()
method to set all these properties at once:
ax = plt.axes()
ax.plot(x, np.sin(x))
ax.set(xlim=(0, 10), ylim=(-2, 2),
xlabel='x', ylabel='sin(x)',
title='A Simple Plot');
After this overview about the arguments in matplotlib, we will look at how we can draw the other plots.
Bar Plot
import matplotlib.pyplot as plt
# Look at index 4 and 6, which demonstrate overlapping cases.
x1 = [1, 3, 4, 5, 6, 7, 9]
y1 = [4, 7, 2, 4, 7, 8, 3]
x2 = [2, 4, 6, 8, 10]
y2 = [5, 6, 2, 6, 2]
# Colors: https://matplotlib.org/api/colors_api.html
plt.bar(x1, y1, label="Blue Bar", color='b')
plt.bar(x2, y2, label="Green Bar", color='g')
plt.plot()
plt.xlabel("bar number")
plt.ylabel("bar height")
plt.title("Bar Chart Example")
plt.legend()
plt.show()
Histogram
import matplotlib.pyplot as plt
import numpy as np
# Use numpy to generate a bunch of random data in a bell curve around 5.
n = 5 + np.random.randn(1000)
m = [m for m in range(len(n))]
plt.bar(m, n)
plt.title("Raw Data")
plt.show()
plt.hist(n, bins=20)
plt.title("Histogram")
plt.show()
plt.hist(n, cumulative=True, bins=20)
plt.title("Cumulative Histogram")
plt.show()
Scatter Plot
import matplotlib.pyplot as plt
x1 = [2, 3, 4]
y1 = [5, 5, 5]
x2 = [1, 2, 3, 4, 5]
y2 = [2, 3, 2, 3, 4]
y3 = [6, 8, 7, 8, 7]
# Markers: https://matplotlib.org/api/markers_api.html
plt.scatter(x1, y1)
plt.scatter(x2, y2, marker='v', color='r')
plt.scatter(x2, y3, marker='^', color='m')
plt.title('Scatter Plot Example')
plt.show()
Box Plot
x1 = np.random.rand(50)
plt.boxplot(x1)
x1 = np.random.rand(50)
x2 = np.random.rand(50)
x3 = np.concatenate([x1,x2])
x3_reshape = x3.reshape(50,2)
x3_reshape
plt.boxplot(x3_reshape)
Please compile the following code to create a data.
df = pd.DataFrame(data={'a':np.random.randint(0, 100, 30),
'b':np.random.randint(0, 100, 30),
'c':np.random.randint(0, 100, 30)})
df.head()
Answer the following questions using the appropriate tools from matplotlib.
3.1 What is the distribution of a?
plt.hist(df['a'].values)
3.2 What is the association between a and b?
3.3 How do b and c differ from each other?
subdata.boxplotMatplotlib Cheat Sheet
References