Stat 112 - Recitation 10

Ozancan Ozdemir - ozancan@metu.edu.tr

Data Visualization with Pandas

Pandas offers various tools for creating clear and insightful visualizations to aid in data analysis, which utilize matplotlib. These tools allow for easy comparison of different parts of a dataset, exploration of the entire dataset, and identification of patterns and correlations in the data.

The primary tool in pandas for creating visualizations is the plot() method, which is available for both Series and DataFrames. This method includes a parameter called kind that determines the type of plot to be generated. The available options for kind are listed below.

image.png

The plot ID is the value of the keyword argument kind. That is, df.plot(kind="scatter") creates a scatter plot. The default kind is "line".

In Pandas, there are two different way of using plot() method. The first one,

df.plot(kind='<kind of the desired plot e.gbar, area etc>', x,y)

The second one

df.plot.<kind of the desired plot e.gbar, area, etc>()

where df indicates the working data frame.

You can use any of them until as long as it works correctly.

Import the necessary packages.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# set the plots to display in the Jupyter notebook
%matplotlib inline

The plt interface is what we will use most often.

Consider avocado.csv dataset.

The variable explanations are given below.

  • Date - The date of the observation
  • AveragePrice - the average price of a single avocado
  • type - conventional or organic
  • year - the year
  • Region - the city or region of the observation
  • Total Volume - Total number of avocados sold
  • Small Hass - Total number of avocados with Small Hass sold
  • Large Hass - Total number of avocados with Large Hass sold
  • Extra Large Hass - Total number of avocados with Large Hass sold
In [ ]:
avocado = pd.read_csv('avocado.csv')
avocado = avocado.drop('Unnamed: 0',axis  = 1) #drop the unnecessary unnamed:0 column
avocado.head(5)
Out[ ]:
Date AveragePrice Total Volume Small Hass Large Hass Extra Large Hass Total Bags Small Bags Large Bags XLarge Bags type year region
0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.0 conventional 2015 Albany
4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany
In [ ]:
avocado.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              18249 non-null  object 
 1   AveragePrice      18249 non-null  float64
 2   Total Volume      18249 non-null  float64
 3   Small Hass        18249 non-null  float64
 4   Large Hass        18249 non-null  float64
 5   Extra Large Hass  18249 non-null  float64
 6   Total Bags        18249 non-null  float64
 7   Small Bags        18249 non-null  float64
 8   Large Bags        18249 non-null  float64
 9   XLarge Bags       18249 non-null  float64
 10  type              18249 non-null  object 
 11  year              18249 non-null  int64  
 12  region            18249 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 1.8+ MB

How does the average price change over time?

I will explore the behaviour of average price over time. Since my intentition is to observe a pattern by time, line plot is the best choice. Notice that the default kind of the plot function is line.

In [ ]:
avocado.plot('Date','AveragePrice')
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3cce4af0>

Let's customize this plot using additional arguments in plot.

In [ ]:
avocado.plot('Date','AveragePrice',figsize = (15,8)) #figsize specifies the size of the figure object.
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3cb76be0>
In [ ]:
avocado.plot('Date','AveragePrice',figsize = (15,8),
             xlabel = 'Date', ylabel = "Average Price") #xlabel & ylabel set axis title for x and y axis.
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3c660fa0>
In [ ]:
avocado.plot('Date','AveragePrice',figsize = (15,8),
             xlabel = 'Date', ylabel = "Average Price",
             title = 'The Change of Average Price of Avocado by Time') #title set the title of the plot
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3adc8f10>
In [ ]:
avocado.plot('Date','AveragePrice',figsize = (15,8),
             xlabel = 'Date', ylabel = "Average Price",
             title = 'The Change of Average Price of Avocado by Time',
             color = 'Red') #color changes the color of your plot
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3ad70640>

How does the hass types change over time?

In [ ]:
fig, axs = plt.subplots(nrows=2, ncols=2,figsize = (20,12)) #subplots is an attribute from matplotlib and Create a figure and a set of subplots.
axs[0,0].plot(avocado['Date'],avocado['Small Hass'])
axs[0,1].plot(avocado['Date'],avocado['Large Hass'])
axs[1,0].plot(avocado['Date'], avocado['Extra Large Hass'])
Out[ ]:
[<matplotlib.lines.Line2D at 0x7f5c3acc5460>]

It is certainly ugly plot, but you should focus on how to draw a multiple plot in one window.

What is the distribution of type of avocados?

Note that avocado has an object data type, implying that it is a categorical variable. Remember that we use bar plot to display the frequency/distribution of the categorical variable.

Drawing a bar plot using pandas may require a simple data manipulation/calculation

In [ ]:
avocado['type'].value_counts().plot(kind = 'bar') # basic bar plot
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3a81eac0>
In [ ]:
avocado['type'].value_counts().plot(kind = 'barh') #horizontal bar plot
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3a807160>

Knowledge Check

Which type of bar plot can be used in this example? Explain your reasoning.

You can use the customization argument specified above to customize your bar plot.

In [ ]:
## Customize your plot 

avocado['type'].value_counts().plot(kind = 'barh', figsize = (10,6), color = ['Red','Orange'],
                                    fontsize = 12) #fontsize: Font size for xticks and yticks 
#plt.xlabel('xlabel', fontsize=14)
plt.ylabel('Avocado Type', fontsize=14)
plt.title('The Avocado Type', fontsize = 18) #set the title and font size
Out[ ]:
Text(0.5, 1.0, 'The Avocado Type')

Let's add counts on the top of the bars... 😢

It takes longer time.

In [ ]:
freq_table = pd.DataFrame(avocado['type'].value_counts())
freq_table
Out[ ]:
type
conventional 9126
organic 9123
In [ ]:
ax = freq_table['type'].plot(kind = 'bar', figsize = (10,6), color = ['Red','Orange'],
                                    fontsize = 12)

plt.ylabel('Avocado Type', fontsize=14)
plt.title('The Avocado Type', fontsize = 18) #set the title and font size 
#plt.xticks([]) #remove x axis text 
plt.yticks([]) #remove y axis text. 

for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))
plt.show()

What is the distribution of Average Price?

Since average price is a numerical value, the shape of the distribution of this can be assessed by histogram.

In [ ]:
avocado['AveragePrice'].plot(kind = 'hist')
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3a71c640>
In [ ]:
## Customize 

avocado['AveragePrice'].plot(kind = 'hist', color ="orange", figsize = (10,6))
plt.title("Histogram of the Average Price", fontsize = 14,fontweight='bold') #fontweight: bold makes your title bold.
Out[ ]:
Text(0.5, 1.0, 'Histogram of the Average Price')

Knowledge Check

What is the default number of bin in pandas histogram?

How does the distribution of large hass and extra large hass differ from each other?

You can compare the shape of the distribution of the two numerical variable on the same plot. Here the alpha is the key parameter.

In [ ]:
avocado[['Large Hass', 'Extra Large Hass']].plot(kind = 'hist', alpha = 0.5, figsize = (10,6))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3a5dae50>

You can customize your plot using the functions and argument listed above.

How does the distribution of average price chage by avocado type?

The variable of interests are one numerical and one categorical variable, box plot will be an appropriate choice.

In [ ]:
avocado[['AveragePrice','type']].plot(kind = 'box',figsize=(10, 8))
#column: numerical variable & by: categorical variable
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3a562e50>

Upps.. We have a problem because the data is not in the appropriate form.

In [ ]:
subdata = avocado[['AveragePrice','type']]
subdata.head()
Out[ ]:
AveragePrice type
0 1.33 conventional
1 1.35 conventional
2 0.93 conventional
3 1.08 conventional
4 1.28 conventional

Solution: Reshape the data

In [ ]:
subdata_wider = subdata.pivot(columns="type", values="AveragePrice")
subdata_wider
Out[ ]:
type conventional organic
0 1.33 NaN
1 1.35 NaN
2 0.93 NaN
3 1.08 NaN
4 1.28 NaN
... ... ...
18244 NaN 1.63
18245 NaN 1.71
18246 NaN 1.87
18247 NaN 1.93
18248 NaN 1.62

18249 rows × 2 columns

In [ ]:
subdata_wider.plot(kind = 'box', figsize = (10,6))
/usr/local/lib/python3.8/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3a83f250>

You can customize your plot using the functions and argument listed above.

What is the relationship between Small Hass and Small Bags?

Since we aim to investigate the relationship between two numerical variables, we can use scatter plot.

In [ ]:
avocado.plot(kind ='scatter', x = 'Small Hass', y ='Small Bags', color = 'darkorange', figsize = (10,8))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c3a87a460>

You can customize your plot using the functions and argument listed above.

Exercise 1

Import beerhall.xlsx data, and answer the following questions using the corresponding graphics.

In [ ]:
#compile this chunk
bh = pd.read_excel('beerhall.xlsx')
bh.head()
Out[ ]:
County Region_Name Region_Code Criminal Beer At_School At_Public
0 Middlesex South_Eastern 1 200 541 560 434
1 Surrey South_Eastern 1 160 504 630 482
2 Kent South_Eastern 1 160 552 790 680
3 Sussex South Eastern 1 147 295 820 678
4 Hants South_Eastern 1 178 409 990 798

1.1 How does the average of school attendance (At_School) changes by Region_Code? (Hint: use aggregate and line plot)

In [ ]:

1.2 Can you show the frequencies of Regions with an appropriate visual tool? Which region has the highest frequency?

In [ ]:

1.3 Do school attendance (At_School) and public attendance (At_Public) show a difference in terms of the distribution?

In [ ]:

1.4 How does the distribution of the criminals change by region?

In [ ]:

1.5 What is the relationship between criminal and public attendance?

In [ ]:

In addition to the plot() constructor, pandas has also pandas.plotting module, providing more advanced plots compared to plot constructor.

image.png

Source: Pandas

Among these functions, scatter_matrix() will be our choice since it is an efficient tool to understand the data at the beginning.

In [ ]:
from pandas.plotting import scatter_matrix
scatter_matrix(avocado, figsize = (18,12))
Out[ ]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3a40dac0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39470160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39494520>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39438910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c393e4d00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3941a070>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3941a160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c393c75b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39328df0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c392e43a0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39314520>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c392c0910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3926dd00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39227130>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39256550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39202940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c391b0d60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39168190>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39196580>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39143970>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c390f3d60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c390ad190>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c390da580>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39087970>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39036d60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38ff2190>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c399ea790>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c393d2d00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39227ee0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c390e6700>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c39516550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38ef63d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38ea47c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38ed1bb0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38e7ffa0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38e393d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38de47f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38e11be0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38dc0fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38d7a400>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38d297f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38d58be0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38d050a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38cbf430>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38c6d850>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38c9acd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38c47160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38bff4f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38baf8e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38bddcd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38b9b190>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38b4e580>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38af78b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38aa3ca0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38ad1130>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38a894c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38a38520>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c389e4c40>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38a1c490>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c389c5bb0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3897b310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38926a30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3895c190>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c389058b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c388b2fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38867730>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38893e50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c388495b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c387f4cd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c387ac430>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c387d5b50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3878c2b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c387369d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c386ed130>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38717850>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c386c5f70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38679700>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38624e20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3865a580>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38604ca0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c385bb400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38566b20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3859c280>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c399ea3a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38e63310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38d29af0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38c5cb50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38aa3d90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38429370>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c384597f0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3840bbe0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c383b9ee0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38371370>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3831e760>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3834bb50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c382f7f40>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c382b1370>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3825e760>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c3828cb50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f5c38238f40>]],
      dtype=object)

The function returns the histogram of each numerical variable as well as the scatter plot exploring the relationship between each pair of numerical variables.

Exercise 2

Try it yourself...

Import the pandas.plotting module and construct a scatter plot matrix for beerhall.xlsx data.

In [ ]:

Matplotlib

Matplotlib is a multi-platform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack. It enables interactive MATLAB-style plotting.

Although better tools like ggplot2 in R or seaborn in Python have been developed in recent years, matplotlib has still a sort of popularity.

Before talking about the usage of matplotlib in details, there are a few things that you should know about the package, which you have already used some of them above.

In [ ]:
#import the package
import matplotlib as mpl
import matplotlib.pyplot as plt

The plt interface is what we will use most often, as we shall see throughout this chapter.

Setting Style

We will use the plt.style directive to choose appropriate aesthetic styles for our figures. Here we will set the classic style, which ensures that the plots we create use the classic Matplotlib style:

plt.style.use('classic')

You can see more options for customization.

Showing the plot

If you are using Matplotlib from within a script, the function plt.show() is your friend. plt.show() starts an event loop, looks for all currently active figure objects, and opens one or more interactive windows that display your figure or figures.

In [ ]:
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)

fig = plt.figure() #figure: Create a new figure, or activate an existing figure
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))

plt.show()

Saving Figures to File

One nice feature of Matplotlib is the ability to save figures in a wide variety of formats. Saving a figure can be done using the savefig() command. For example, to save the previous figure as a PNG file, you can run this:

In [ ]:
fig.savefig('my_figure.png')

Plot Frame

Matplotlib has an ability to draw a multiple plot in one window.

In [ ]:
plt.figure()  # create a plot figure

# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))

# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));

Now, let's stary by drawing a basic line chart

For all Matplotlib plots, we start by creating a figure and an axes. In their simplest form, a figure and axes can be created as follows:

In [ ]:
fig = plt.figure()
ax = plt.axes()

In Matplotlib, the figure (an instance of the class plt.Figure) can be thought of as a single container that contains all the objects representing axes, graphics, text, and labels. The axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks and labels, which will eventually contain the plot elements that make up our visualization. Throughout this class, we'll commonly use the variable name fig to refer to a figure instance, and ax to refer to an axes instance or group of axes instances.

Once we have created an axes, we can use the ax.plot function to plot some data. Let's start with a simple sinusoid:

In [ ]:
fig = plt.figure()
ax = plt.axes()

x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x));

Alternatively, we can use the pylab interface and let the figure and axes be created for us in the background (see Two Interfaces for the Price of One for a discussion of these two interfaces):

In [ ]:
plt.plot(x, np.sin(x));

If we want to create a single figure with multiple lines, we can simply call the plot function multiple times

In [ ]:
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));

Adjusting the Plot: Line Colors and Styles

The first adjustment you might wish to make to a plot is to control the line colors and styles. The plt.plot() function takes additional arguments that can be used to specify these. To adjust the color, you can use the color keyword, which accepts a string argument representing virtually any imaginable color. The color can be specified in a variety of ways:

In [ ]:
plt.plot(x, np.sin(x - 0), color='blue')        # specify color by name
plt.plot(x, np.sin(x - 1), color='g')           # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75')        # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44')     # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported

If no color is specified, Matplotlib will automatically cycle through a set of default colors for multiple lines.

Similarly, the line style can be adjusted using the linestyle keyword:

In [ ]:
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');

# For short, you can use the following codes:
plt.plot(x, x + 4, linestyle='-')  # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':');  # dotted

Adjusting the Plot: Axes Limits

Matplotlib does a decent job of choosing default axes limits for your plot, but sometimes it's nice to have finer control. The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim() methods:

In [ ]:
plt.plot(x, np.sin(x))

plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);

If for some reason you'd like either axis to be displayed in reverse, you can simply reverse the order of the arguments:

In [ ]:
plt.plot(x, np.sin(x))

plt.xlim(10, 0)
plt.ylim(1.2, -1.2);

A useful related method is plt.axis() (note here the potential confusion between axes with an e, and axis with an i). The plt.axis() method allows you to set the x and y limits with a single call, by passing a list which specifies [xmin, xmax, ymin, ymax]:

In [ ]:
plt.plot(x, np.sin(x))
plt.axis([-1, 11, -1.5, 1.5]);

The plt.axis() method goes even beyond this, allowing you to do things like automatically tighten the bounds around the current plot:

In [ ]:
plt.plot(x, np.sin(x))
plt.axis('tight');

It allows even higher-level specifications, such as ensuring an equal aspect ratio so that on your screen, one unit in x is equal to one unit in y:

In [ ]:
plt.plot(x, np.sin(x))
plt.axis('equal');

For more information on axis limits and the other capabilities of the plt.axis method, refer to the plt.axis docstring.

Labeling Plots

As the last piece of this section, we'll briefly look at the labeling of plots: titles, axis labels, and simple legends.

Titles and axis labels are the simplest such labels—there are methods that can be used to quickly set them:

In [ ]:
plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");

The position, size, and style of these labels can be adjusted using optional arguments to the function. For more information, see the Matplotlib documentation and the docstrings of each of these functions.

When multiple lines are being shown within a single axes, it can be useful to create a plot legend that labels each line type. Again, Matplotlib has a built-in way of quickly creating such a legend. It is done via the (you guessed it) plt.legend() method. Though there are several valid ways of using this, I find it easiest to specify the label of each line using the label keyword of the plot function:

In [ ]:
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.axis('equal')

plt.legend();

As you can see, the plt.legend() function keeps track of the line style and color, and matches these with the correct label. More information on specifying and formatting plot legends can be found in the plt.legend docstring; additionally, we will cover some more advanced legend options in Customizing Plot Legends

Aside: Matplotlib Gotchas

While most plt functions translate directly to ax methods (such as plt.plot()ax.plot(), plt.legend()ax.legend(), etc.), this is not the case for all commands. In particular, functions to set limits, labels, and titles are slightly modified. For transitioning between MATLAB-style functions and object-oriented methods, make the following changes:

  • plt.xlabel()ax.set_xlabel()
  • plt.ylabel()ax.set_ylabel()
  • plt.xlim()ax.set_xlim()
  • plt.ylim()ax.set_ylim()
  • plt.title()ax.set_title()

In the object-oriented interface to plotting, rather than calling these functions individually, it is often more convenient to use the ax.set() method to set all these properties at once:

In [ ]:
ax = plt.axes()
ax.plot(x, np.sin(x))
ax.set(xlim=(0, 10), ylim=(-2, 2),
       xlabel='x', ylabel='sin(x)',
       title='A Simple Plot');

After this overview about the arguments in matplotlib, we will look at how we can draw the other plots.

Bar Plot

In [ ]:
import matplotlib.pyplot as plt

# Look at index 4 and 6, which demonstrate overlapping cases.
x1 = [1, 3, 4, 5, 6, 7, 9]
y1 = [4, 7, 2, 4, 7, 8, 3]

x2 = [2, 4, 6, 8, 10]
y2 = [5, 6, 2, 6, 2]

# Colors: https://matplotlib.org/api/colors_api.html

plt.bar(x1, y1, label="Blue Bar", color='b')
plt.bar(x2, y2, label="Green Bar", color='g')
plt.plot()

plt.xlabel("bar number")
plt.ylabel("bar height")
plt.title("Bar Chart Example")
plt.legend()
plt.show()

Histogram

In [ ]:
import matplotlib.pyplot as plt
import numpy as np

# Use numpy to generate a bunch of random data in a bell curve around 5.
n = 5 + np.random.randn(1000)

m = [m for m in range(len(n))]
plt.bar(m, n)
plt.title("Raw Data")
plt.show()

plt.hist(n, bins=20)
plt.title("Histogram")
plt.show()

plt.hist(n, cumulative=True, bins=20)
plt.title("Cumulative Histogram")
plt.show()

Scatter Plot

In [ ]:
import matplotlib.pyplot as plt

x1 = [2, 3, 4]
y1 = [5, 5, 5]

x2 = [1, 2, 3, 4, 5]
y2 = [2, 3, 2, 3, 4]
y3 = [6, 8, 7, 8, 7]

# Markers: https://matplotlib.org/api/markers_api.html

plt.scatter(x1, y1)
plt.scatter(x2, y2, marker='v', color='r')
plt.scatter(x2, y3, marker='^', color='m')
plt.title('Scatter Plot Example')
plt.show()

Box Plot

In [ ]:
x1 = np.random.rand(50)
plt.boxplot(x1)
Out[ ]:
{'whiskers': [<matplotlib.lines.Line2D at 0x7f5c37b23f10>,
  <matplotlib.lines.Line2D at 0x7f5c37b23340>],
 'caps': [<matplotlib.lines.Line2D at 0x7f5c37b23040>,
  <matplotlib.lines.Line2D at 0x7f5c37b235e0>],
 'boxes': [<matplotlib.lines.Line2D at 0x7f5c37b36d90>],
 'medians': [<matplotlib.lines.Line2D at 0x7f5c37b435b0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7f5c37b43d90>],
 'means': []}
In [ ]:
x1 = np.random.rand(50)
x2 = np.random.rand(50)
x3 = np.concatenate([x1,x2])
x3_reshape = x3.reshape(50,2)
x3_reshape
Out[ ]:
array([[0.37485003, 0.77893345],
       [0.36409428, 0.05609475],
       [0.08252197, 0.34098896],
       [0.2943063 , 0.39656947],
       [0.26390944, 0.58239511],
       [0.35594333, 0.28492717],
       [0.89719534, 0.55892311],
       [0.77071473, 0.30624419],
       [0.35620354, 0.1913261 ],
       [0.61040188, 0.46154939],
       [0.33346341, 0.62550058],
       [0.20475847, 0.53530646],
       [0.85117428, 0.61077819],
       [0.54529397, 0.08593079],
       [0.50031228, 0.82524808],
       [0.99260623, 0.76202116],
       [0.26576717, 0.81966557],
       [0.14183538, 0.42440211],
       [0.10992322, 0.41759987],
       [0.7133524 , 0.04633627],
       [0.3977847 , 0.2484725 ],
       [0.51825317, 0.71760286],
       [0.42025894, 0.58153223],
       [0.06575681, 0.51405647],
       [0.56005002, 0.45456881],
       [0.41072718, 0.23052248],
       [0.71188251, 0.51815988],
       [0.84845254, 0.62700756],
       [0.19209136, 0.71737884],
       [0.9599086 , 0.7983489 ],
       [0.39438165, 0.54197614],
       [0.29515117, 0.99765865],
       [0.86260054, 0.00251253],
       [0.06653977, 0.78990063],
       [0.31046066, 0.74895621],
       [0.65818705, 0.28955124],
       [0.85711566, 0.86648347],
       [0.14764391, 0.38662099],
       [0.68321046, 0.19463156],
       [0.41809038, 0.72489841],
       [0.25181415, 0.3726507 ],
       [0.52927835, 0.8026993 ],
       [0.59530913, 0.98302614],
       [0.23325503, 0.60245023],
       [0.13189185, 0.37497201],
       [0.69698911, 0.31752449],
       [0.73071818, 0.37029798],
       [0.75689245, 0.93438718],
       [0.69533411, 0.21669761],
       [0.02104218, 0.36402817]])
In [ ]:
plt.boxplot(x3_reshape)
Out[ ]:
{'whiskers': [<matplotlib.lines.Line2D at 0x7f5c37c02eb0>,
  <matplotlib.lines.Line2D at 0x7f5c37c02340>,
  <matplotlib.lines.Line2D at 0x7f5c37c69430>,
  <matplotlib.lines.Line2D at 0x7f5c37c69a00>],
 'caps': [<matplotlib.lines.Line2D at 0x7f5c37c939d0>,
  <matplotlib.lines.Line2D at 0x7f5c37c93c70>,
  <matplotlib.lines.Line2D at 0x7f5c37c501f0>,
  <matplotlib.lines.Line2D at 0x7f5c37c50a30>],
 'boxes': [<matplotlib.lines.Line2D at 0x7f5c37c02640>,
  <matplotlib.lines.Line2D at 0x7f5c37c69df0>],
 'medians': [<matplotlib.lines.Line2D at 0x7f5c37c69b50>,
  <matplotlib.lines.Line2D at 0x7f5c37c171f0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7f5c37c69eb0>,
  <matplotlib.lines.Line2D at 0x7f5c37c17640>],
 'means': []}

Exercise 3

Please compile the following code to create a data.

In [ ]:
df = pd.DataFrame(data={'a':np.random.randint(0, 100, 30),
                        'b':np.random.randint(0, 100, 30),
                        'c':np.random.randint(0, 100, 30)})
df.head()
Out[ ]:
a b c
0 4 31 37
1 70 80 86
2 78 75 4
3 97 84 13
4 0 9 53

Answer the following questions using the appropriate tools from matplotlib.

3.1 What is the distribution of a?

In [ ]:
plt.hist(df['a'].values)
Out[ ]:
(array([4., 3., 2., 0., 2., 3., 7., 1., 3., 5.]),
 array([ 0. ,  9.7, 19.4, 29.1, 38.8, 48.5, 58.2, 67.9, 77.6, 87.3, 97. ]),
 <a list of 10 Patch objects>)

3.2 What is the association between a and b?

In [ ]:

3.3 How do b and c differ from each other?

In [ ]:

subdata.boxplotMatplotlib Cheat Sheet

image.png