Stat 112 - Recitation 8¶

Ozancan Ozdemir - ozancan@metu.edu.tr

Numpy logo

Introduction to Pandas¶

This introduction sentence is created by ChatGPT

Pandas is a software library written for the Python programming language for data manipulation and analysis. It is often used for working with data in tabular form, such as data stored in a spreadsheet or a database table.

Pandas provides a powerful set of tools for working with and analyzing data, including data frames, which are similar to tables in a relational database, and tools for working with time series data.

To install pandas on Anaconda, you should run the following command on your Anaconda terminal.

conda install pandas

However, we do not have such a problem with colaboratory. :)

Once Pandas is installed, you can import it into your Python script using the following code:

In [1]:
import pandas as pd #import the package

The pd alias will be used to reference Pandas throughout your code.

In Pandas, the three main data structures for storing and working with data are the Series, the DataFrame and the Panel.

A Series is a one-dimensional array-like object that can hold any data type. It has a labeled index, which is used to access the elements of the Series.

A DataFrame is a 2-dimensional tabular data structure that can hold any data type. It is similar to a spreadsheet or a SQL table. A DataFrame has a labeled index and labeled columns, which allow you to access the data in the table using the labels rather than numeric indices.

A Panel is a 3-dimensional data container. However, we are mostly using Pandas with DataFrame. At least in this course.

You can think of a DataFrame as a collection of Series objects that share the same index. This makes it easy to perform operations on data in the table, such as selecting specific rows or columns, or applying calculations to entire columns of data.

image.png

Source: Data Science Made Simple

You can create a Series or DataFrame in Pandas by passing a list or array of data to the pandas.Series() or pandas.DataFrame() constructor, respectively. You can also import data from a file or other data source into a DataFrame using the pandas.read_() functions, where the is replaced with the file format (e.g. csv or excel).

In [2]:
# create a Series
s = pd.Series([1,9,5,6])
print(s)
0    1
1    9
2    5
3    6
dtype: int64
In [3]:
# create a DataFrame
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(df)
   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9

Although the created objects above include only integers, there are several data types that can be used in a Pandas DataFrame, including:

  • object: a string or mixed data type
  • int64: a 64-bit integer
  • float64: a 64-bit floating-point number
  • bool: a boolean value (True or False)
  • datetime64: a date and time, stored in a 64-bit format
  • timedelta[ns]: a duration of time, stored in a 64-bit format

These are just some of the data types that can be used in Pandas. You can use the DataFrame.dtypes property to view the data type for each column in a DataFrame.

In [4]:
df.dtypes
Out[4]:
0    int64
1    int64
2    int64
dtype: object

Creating Data Frame from Scratch¶

In pandas, there are two main ways to create a data frame from scratch, list, dictionary. The usage of list in creating data frame has already done above. Now, we will look at how to use dictionaries to create a data frame in pandas.

Note that dictionaries provides an advantage over assigning a column name to your data frame.

When you use list to create a data frame, you have to define the column names after that. However, you can both define column names and values at the same time in dictionary.

Reminder

A typical dictionary object in Python has the following structure.

dict_name = {'key' : value}

or

dict_name = { 'key' : [value1,value2] }

Key and value can be either integer or character.

In [5]:
# Dictionary with integer keys
my_dict = {1: 'A', 2: 'B'}
print(my_dict)

# Dictionary with string keys
my_dict = {'name': 'ozan', 'age': 28}
print(my_dict)

# Dictionary with mixed keys
my_dict = {'name': 'ozan', 1: ['A', 'B']}
print(my_dict)
{1: 'A', 2: 'B'}
{'name': 'ozan', 'age': 28}
{'name': 'ozan', 1: ['A', 'B']}

Now, we will create the following table using dictonaries. In doing so, we use pd.DataFrame() constructor.

image.png

In [6]:
table = pd.DataFrame({'City' : ['Tokyo','Delhi','Shangai','Dhaka','Sao Paulo','Mexico City','Cairo','Beijing','Mumbai','Osaka'],
                      'Country': ['Japan','India','China','Bangladesh','Brazil','Mexico','Egypt','China','India','Japan'],
                      '2022 Population':[37274000,32065760,28516904,22478116,22429800,22085140,21750020,21333332,20961472,19059856],
                      '2021 Population': [37339804,31181376,27795702,21741090,22237472,21918936,21322750,20896820,20667656,19110616],
                      'Growth': [-0.18,2.84,2.59,3.39,0.86,0.76,2,2.09,1.42,-0.27]})

print(table)
          City     Country  2022 Population  2021 Population  Growth
0        Tokyo       Japan         37274000         37339804   -0.18
1        Delhi       India         32065760         31181376    2.84
2      Shangai       China         28516904         27795702    2.59
3        Dhaka  Bangladesh         22478116         21741090    3.39
4    Sao Paulo      Brazil         22429800         22237472    0.86
5  Mexico City      Mexico         22085140         21918936    0.76
6        Cairo       Egypt         21750020         21322750    2.00
7      Beijing       China         21333332         20896820    2.09
8       Mumbai       India         20961472         20667656    1.42
9        Osaka       Japan         19059856         19110616   -0.27

Exercise 1¶

Please create the following DataFrame and name ex1.

image.png

In [ ]:
 

After creating the data frame, we will introduce some Pandas attributes on data frame.

In [7]:
table.dtypes #list the data types of the series
Out[7]:
City                object
Country             object
2022 Population      int64
2021 Population      int64
Growth             float64
dtype: object
In [8]:
table.shape #shows the number of column and rows of your data frame
Out[8]:
(10, 5)

Displaying the general information about data frame: You can print the information above together using df.info() attribute where df shows the name of the data frame.

In [9]:
table.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   City             10 non-null     object 
 1   Country          10 non-null     object 
 2   2022 Population  10 non-null     int64  
 3   2021 Population  10 non-null     int64  
 4   Growth           10 non-null     float64
dtypes: float64(1), int64(2), object(2)
memory usage: 528.0+ bytes

When you create a longer data frame, it is not necessary to view all observations. df.head() funciton will show first 5 observations in your data while df.tail() shows the last 5 observations.

In [10]:
table.head() #shows first 5 observations 
Out[10]:
City Country 2022 Population 2021 Population Growth
0 Tokyo Japan 37274000 37339804 -0.18
1 Delhi India 32065760 31181376 2.84
2 Shangai China 28516904 27795702 2.59
3 Dhaka Bangladesh 22478116 21741090 3.39
4 Sao Paulo Brazil 22429800 22237472 0.86
In [11]:
table.head(3) #shows first three observations
Out[11]:
City Country 2022 Population 2021 Population Growth
0 Tokyo Japan 37274000 37339804 -0.18
1 Delhi India 32065760 31181376 2.84
2 Shangai China 28516904 27795702 2.59

Adding a new column: You can add new intel to your data frame easily. Let's add the continent of the countries. But first, create your continent object.

In [12]:
import numpy as np 

continent =  np.repeat(np.array(['Asia','S.America','Africa','Asia']),[4,2,1,3],axis = 0)
continent
Out[12]:
array(['Asia', 'Asia', 'Asia', 'Asia', 'S.America', 'S.America', 'Africa',
       'Asia', 'Asia', 'Asia'], dtype='<U9')
In [13]:
table['continent']  = continent 
#df_name['new column name'] = new series
table.head()
Out[13]:
City Country 2022 Population 2021 Population Growth continent
0 Tokyo Japan 37274000 37339804 -0.18 Asia
1 Delhi India 32065760 31181376 2.84 Asia
2 Shangai China 28516904 27795702 2.59 Asia
3 Dhaka Bangladesh 22478116 21741090 3.39 Asia
4 Sao Paulo Brazil 22429800 22237472 0.86 S.America
In [14]:
table.columns # list the columns in the data
Out[14]:
Index(['City', 'Country', '2022 Population', '2021 Population', 'Growth',
       'continent'],
      dtype='object')

Subsetting a Data Frame: Pandas provides several ways to assess a subset of the DataFrame. In other words, you can pull out specific rows, columns, slices or cells in the DataFrame object using different ways.

In [15]:
table['City']
#df_name['Column name']
Out[15]:
0          Tokyo
1          Delhi
2        Shangai
3          Dhaka
4      Sao Paulo
5    Mexico City
6          Cairo
7        Beijing
8         Mumbai
9          Osaka
Name: City, dtype: object
In [16]:
table[['City','Country']] 
#for multiple extraction, create a list object. 
Out[16]:
City Country
0 Tokyo Japan
1 Delhi India
2 Shangai China
3 Dhaka Bangladesh
4 Sao Paulo Brazil
5 Mexico City Mexico
6 Cairo Egypt
7 Beijing China
8 Mumbai India
9 Osaka Japan

You can acess your data using loc[] and iloc[] in Pandas.

loc stands for "location" and is used to select data by the labels of the rows and columns. For example, to select the first row of a DataFrame, you would use df.loc[0].

iloc stands for "integer location" and is used to select data by the position of the rows and columns. So, to select the first row of a DataFrame using iloc, you would use df.iloc[0].

One key difference between the two is that loc uses the labels of the rows and columns to identify which data to select, while iloc uses the integer positions of the rows and columns.

This means that if you have changed the row labels or column names, loc will still use the original labels, while iloc will use the current integer positions of the rows and columns.

Indexing starts from 0 in Pandas, as in Numpy.

In [17]:
table.loc[0] #first row of the data
Out[17]:
City                  Tokyo
Country               Japan
2022 Population    37274000
2021 Population    37339804
Growth                -0.18
continent              Asia
Name: 0, dtype: object

Since we do not have row label names in this data, we use number in loc.

In [18]:
table.iloc[[0,3,8]] #1st, 4th, 9th rows of the data
Out[18]:
City Country 2022 Population 2021 Population Growth continent
0 Tokyo Japan 37274000 37339804 -0.18 Asia
3 Dhaka Bangladesh 22478116 21741090 3.39 Asia
8 Mumbai India 20961472 20667656 1.42 Asia
In [19]:
table['City'].iloc[2] #the name of the third city
Out[19]:
'Shangai'

You can use boolean indexing for slicing a DataFrame in Pandas. Remember that boolean indexing is useful when you access your data based on a condition.

For example, we would like to list the information of the cities whose current population exceeds 24000000.

In [20]:
table[table['2022 Population']>24000000]
Out[20]:
City Country 2022 Population 2021 Population Growth continent
0 Tokyo Japan 37274000 37339804 -0.18 Asia
1 Delhi India 32065760 31181376 2.84 Asia
2 Shangai China 28516904 27795702 2.59 Asia
In [21]:
#If only list the city names. 

table['City'].loc[table['2022 Population']>24000000]
Out[21]:
0      Tokyo
1      Delhi
2    Shangai
Name: City, dtype: object

If you try this with iloc, you will get an error.

table['City'].loc[table['2022 Population']>24000000]

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-39-b13d5c365ef6> in <module>
----> 1 table['City'].iloc[table['2022 Population']>24000000]

2 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1393             if hasattr(key, "index") and isinstance(key.index, Index):
   1394                 if key.index.inferred_type == "integer":
-> 1395                     raise NotImplementedError(
   1396                         "iLocation based boolean "
   1397                         "indexing on an integer type "

NotImplementedError: iLocation based boolean indexing on an integer type is not available

Renaming a column

You can change either specific column name or all column names in a DataFrame object.

Check rename and columns

df.rename({'old name': 'new name'},axis = "columns")

You can change row label by rename if axis ="rows".

To change the all columns simultaneously,

df.columns(['c1','c2','c3'])

In [22]:
table.rename({'continent': 'Continent'},axis = "columns")
Out[22]:
City Country 2022 Population 2021 Population Growth Continent
0 Tokyo Japan 37274000 37339804 -0.18 Asia
1 Delhi India 32065760 31181376 2.84 Asia
2 Shangai China 28516904 27795702 2.59 Asia
3 Dhaka Bangladesh 22478116 21741090 3.39 Asia
4 Sao Paulo Brazil 22429800 22237472 0.86 S.America
5 Mexico City Mexico 22085140 21918936 0.76 S.America
6 Cairo Egypt 21750020 21322750 2.00 Africa
7 Beijing China 21333332 20896820 2.09 Asia
8 Mumbai India 20961472 20667656 1.42 Asia
9 Osaka Japan 19059856 19110616 -0.27 Asia

Exercise 2¶

Consider ex1 DataFrame that you created in Exercise 1.

Answer the following questions based on this file.

a) Shows the all starving-poverty information in the first 6 months of the year.

In [ ]:
 

b) Display the starving line only.

In [ ]:
 

c) Display the data where poverty line is below 20000.

In [ ]:
 

d) List the name of the months where the starving line exceeds 7000.

In [ ]:
 

Creating Data Frame by Importing Data¶

You can import a data from several sources to your enviroment by using pandas.

You can use the pandas.read_() functions, where the is replaced with the file format (e.g. csv or excel).

Before proceeding, you can look at the way that makes importing your data to your colabotary easily. Just drag and drop.

image.png

Note that the data that you dragged and dropped will be deleted when you close the notebook in Colab.

Now, consider teams.csv, teams.xlsx and https://metustat112.github.io/teams.csv.

Way 1

In [23]:
teams = pd.read_csv("teams.csv")
#teams = pd.read_csv("Teams.csv",sep = ";")
#dataname = pd.data.csv("filename.csv",sep = ";")
#sep argument specifies a custom delimiter for the CSV input, where the default is comma.
teams.head()
Out[23]:
Unnamed: 0 team ranking continent games wins draws losses goalsFor goalsAgainst yellowCards redCards
0 0 Brazil 1 South America 5 3 1 1 9 4 7 2
1 1 Spain 2 Europe 6 5 0 1 7 2 3 0
2 2 Portugal 3 Europe 4 1 2 1 7 1 8 1
3 3 Netherlands 4 Europe 6 6 0 0 12 5 15 0
4 4 Italy 5 Europe 3 0 2 1 4 5 5 0

Way 2

In [24]:
teams_excel = pd.read_excel('teams.xlsx')
teams_excel.head()
Out[24]:
team ranking continent games wins draws losses goalsFor goalsAgainst yellowCards redCards
0 Brazil 1 South America 5 3 1 1 9 4 7 2
1 Spain 2 Europe 6 5 0 1 7 2 3 0
2 Portugal 3 Europe 4 1 2 1 7 1 8 1
3 Netherlands 4 Europe 6 6 0 0 12 5 15 0
4 Italy 5 Europe 3 0 2 1 4 5 5 0

Way 3

In [25]:
teams_url = pd.read_csv('https://metustat112.github.io/teams.csv')
teams_url.head()
Out[25]:
Unnamed: 0 team ranking continent games wins draws losses goalsFor goalsAgainst yellowCards redCards
0 0 Brazil 1 South America 5 3 1 1 9 4 7 2
1 1 Spain 2 Europe 6 5 0 1 7 2 3 0
2 2 Portugal 3 Europe 4 1 2 1 7 1 8 1
3 3 Netherlands 4 Europe 6 6 0 0 12 5 15 0
4 4 Italy 5 Europe 3 0 2 1 4 5 5 0

Continue with teams data.

In [26]:
teams.head() #shows the first fice observations
#dataname.function()
Out[26]:
Unnamed: 0 team ranking continent games wins draws losses goalsFor goalsAgainst yellowCards redCards
0 0 Brazil 1 South America 5 3 1 1 9 4 7 2
1 1 Spain 2 Europe 6 5 0 1 7 2 3 0
2 2 Portugal 3 Europe 4 1 2 1 7 1 8 1
3 3 Netherlands 4 Europe 6 6 0 0 12 5 15 0
4 4 Italy 5 Europe 3 0 2 1 4 5 5 0

On the left hand side of the screen, the sequence starting from 0 is the index of the data frame created by pandas automatically. In addition to that, the first column of the data has Unnamed: 0 columns produced by Pandas.

We should remove them to work efficiently and save the memory. There are various way to get rid of this column. Here is the only one solution, drop() function. It is used to drop specified labels from rows or columns.

In [27]:
teams_new = teams.drop("Unnamed: 0", axis=1) #axis = 1 is for column. axis = 0 is for row removing. 
teams_new.head()
Out[27]:
team ranking continent games wins draws losses goalsFor goalsAgainst yellowCards redCards
0 Brazil 1 South America 5 3 1 1 9 4 7 2
1 Spain 2 Europe 6 5 0 1 7 2 3 0
2 Portugal 3 Europe 4 1 2 1 7 1 8 1
3 Netherlands 4 Europe 6 6 0 0 12 5 15 0
4 Italy 5 Europe 3 0 2 1 4 5 5 0

After dropping the unnecessary column, export your data set.

You can use the df.to_('filename.') functions, where the * is replaced with the file format (e.g. csv or excel), filename is the name of the exported file, and df corresponds the data name.

In [28]:
teams_new.to_csv('teamsnew.csv')

Pandas Properties

Here, we will represent some pandas functions being useful in data manipulation.

  • groupby

This function groups the data with respect to level of a variable. It is generally used to apply the aggregate functions like min, median, mean, max etc.

Now, group the teams by their continent.

In [29]:
teams_by_continent = teams_new.groupby('continent')
teams_by_continent.first()
#Compute the first non-null entry of each column.
Out[29]:
team ranking games wins draws losses goalsFor goalsAgainst yellowCards redCards
continent
Africa Cameroon 19 3 0 0 3 2 5 5 0
Asia Japan 45 4 2 1 1 4 2 7 0
Australia Australia 20 3 1 1 1 3 6 7 2
Europe Spain 2 6 5 0 1 7 2 3 0
North America USA 14 4 1 2 1 5 5 9 0
South America Brazil 1 5 3 1 1 9 4 7 2

First function displays the first element in each group. To assess the subgroup elements, use get_group(('group name')).

In [30]:
teams_by_continent.get_group(('Africa')) #the Africa countries in WC 2010
Out[30]:
team ranking continent games wins draws losses goalsFor goalsAgainst yellowCards redCards
15 Cameroon 19 Africa 3 0 0 3 2 5 5 0
17 Nigeria 21 Africa 3 0 1 2 3 5 5 1
20 Ivory Coast 27 Africa 3 1 1 1 4 3 5 0
21 Algeria 30 Africa 3 0 1 2 0 2 4 2
23 Ghana 32 Africa 5 2 2 1 5 4 11 0
30 South Africa 83 Africa 3 1 1 1 3 5 4 1

You can group your data more than one variable using a list object in groupby function.

In [31]:
teams_by_continent_and_win = teams_new.groupby(['continent','wins'])
teams_new.groupby(['continent','wins']).get_group(('Africa',0)) #the most unsuccessful African teams in WC 2010
Out[31]:
team ranking continent games wins draws losses goalsFor goalsAgainst yellowCards redCards
15 Cameroon 19 Africa 3 0 0 3 2 5 5 0
17 Nigeria 21 Africa 3 0 1 2 3 5 5 1
21 Algeria 30 Africa 3 0 1 2 0 2 4 2
  • aggregate

aggregate method is used to apply a specific aggregation function to each column of a DataFrame. The famous aggreate functions are min, max, sum, mean, describe and more.

image.png

Spark by Examples

In [32]:
teams_new.describe() #shows the several descriptive statistics of the numerical variables. 
Out[32]:
ranking games wins draws losses goalsFor goalsAgainst yellowCards redCards
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.00000 32.000000 32.000000 32.000000
mean 26.031250 3.875000 1.437500 1.000000 1.437500 4.34375 4.343750 7.156250 0.500000
std 24.233387 1.070122 1.522678 0.879883 0.715609 3.28839 2.208625 2.760427 0.718421
min 1.000000 3.000000 0.000000 0.000000 0.000000 0.00000 1.000000 2.000000 0.000000
25% 8.750000 3.000000 0.750000 0.000000 1.000000 2.00000 3.000000 5.000000 0.000000
50% 19.500000 3.500000 1.000000 1.000000 1.000000 3.00000 5.000000 7.000000 0.000000
75% 32.500000 4.250000 2.000000 1.250000 2.000000 5.25000 5.000000 8.250000 1.000000
max 105.000000 6.000000 6.000000 3.000000 3.000000 13.00000 12.000000 15.000000 2.000000
In [33]:
teams_new.describe().transpose() #by transposing, you can have a better view
Out[33]:
count mean std min 25% 50% 75% max
ranking 32.0 26.03125 24.233387 1.0 8.75 19.5 32.50 105.0
games 32.0 3.87500 1.070122 3.0 3.00 3.5 4.25 6.0
wins 32.0 1.43750 1.522678 0.0 0.75 1.0 2.00 6.0
draws 32.0 1.00000 0.879883 0.0 0.00 1.0 1.25 3.0
losses 32.0 1.43750 0.715609 0.0 1.00 1.0 2.00 3.0
goalsFor 32.0 4.34375 3.288390 0.0 2.00 3.0 5.25 13.0
goalsAgainst 32.0 4.34375 2.208625 1.0 3.00 5.0 5.00 12.0
yellowCards 32.0 7.15625 2.760427 2.0 5.00 7.0 8.25 15.0
redCards 32.0 0.50000 0.718421 0.0 0.00 0.0 1.00 2.0

What is missing in the output above?

Categorical variables, i.e object type variables.

In [34]:
teams_new.describe(include=['O'])
Out[34]:
team continent
count 32 32
unique 32 6
top Brazil Europe
freq 1 13

If you would like to apply your function onto specific column, you can use .values or value attributes. For example,

In [35]:
teams_new['wins'].values.mean() #the average wins 
Out[35]:
1.4375
In [36]:
teams_new['continent'].value_counts() #frequency distribution of continent
Out[36]:
Europe           13
South America     7
Africa            6
Asia              3
Australia         2
North America     1
Name: continent, dtype: int64
In [37]:
teams_by_continent.aggregate('mean') #calculate the average of each variable by continent. 
Out[37]:
ranking games wins draws losses goalsFor goalsAgainst yellowCards redCards
continent
Africa 35.333333 3.333333 0.666667 1.000000 1.666667 2.833333 4.000000 5.666667 0.666667
Asia 65.666667 3.666667 1.000000 0.666667 2.000000 3.666667 7.333333 5.000000 0.000000
Australia 49.000000 3.000000 0.500000 2.000000 0.500000 2.500000 4.000000 6.500000 1.000000
Europe 14.153846 3.923077 1.769231 0.769231 1.384615 4.846154 3.846154 7.538462 0.384615
North America 14.000000 4.000000 1.000000 2.000000 1.000000 5.000000 5.000000 9.000000 0.000000
South America 18.285714 4.571429 2.000000 1.142857 1.428571 5.428571 4.285714 8.571429 0.714286
In [38]:
teams_by_continent.aggregate('count') #shows the frequency of each column by continent. 
Out[38]:
team ranking games wins draws losses goalsFor goalsAgainst yellowCards redCards
continent
Africa 6 6 6 6 6 6 6 6 6 6
Asia 3 3 3 3 3 3 3 3 3 3
Australia 2 2 2 2 2 2 2 2 2 2
Europe 13 13 13 13 13 13 13 13 13 13
North America 1 1 1 1 1 1 1 1 1 1
South America 7 7 7 7 7 7 7 7 7 7

You can also specify a different aggregation function for each column by passing a dictionary to the aggregate method, where the keys are the column names and the values are the aggregation functions. For example:

In [39]:
teams_by_continent.aggregate({
    "wins": "mean",
    "goalsFor": "max",
    "goalsAgainst": "min"
})
Out[39]:
wins goalsFor goalsAgainst
continent
Africa 0.666667 5 2
Asia 1.000000 6 2
Australia 0.500000 3 2
Europe 1.769231 13 1
North America 1.000000 5 5
South America 2.000000 10 2

Example 3¶

Please read the Players.csv data, and answer the following questions.

In [40]:
players  = pd.read_csv('Players.csv')
players.head()
Out[40]:
surname team position minutes shots passes tackles saves
0 Abdoun Algeria midfielder 16 0 6 0 0
1 Belhadj Algeria defender 270 1 146 8 0
2 Boudebouz Algeria midfielder 74 3 28 1 0
3 Bougherra Algeria defender 270 1 89 11 0
4 Chaouchi Algeria goalkeeper 90 0 17 0 2

a) Show first 5 observations of the data

In [ ]:
 

b) How many players are there in the data?

In [ ]:
 

c) What is the distribution of positions of the players? (i.e display the frequency of each position)

In [ ]:
 

d) What is the average number of pass of the players in WC 2010?

In [ ]:
 

e) What is the average minutes that players play by their positions in WC 2010?

In [ ]:
 

f) State the surname of the player who has the longest game time.

In [ ]:
 

g) State the surname of the goalkeeper who has the highest save.

In [ ]:
 

Missing Data¶

Missing data occurs when no information is provided. Pandas provides some tools that makes dealing with missing easier.

In pandas, the missing observations have two representations.

  • None: None is a Python singleton object that is often used for missing data in Python code.

  • NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

We can use following functions for missing analysis. i.e We can detect and replace the missing values with these functions

df.isnull()

df.notnull()

df.dropna()

df.fillna()

df.replace()

Consider the example,

In [41]:
import pandas as pd
In [42]:
missing_data = pd.read_csv('https://metustat112.github.io/stolen_cars.csv',sep = ";")
print(missing_data)
                 Car   Number
0  Chevrolet Pick-Up  48206.0
1       Ford Pick-Up  47999.0
2        Honda Civic  31673.0
3       Honda Accord  30274.0
4       Toyota Camry  17270.0
5        GMC Pick-Up      NaN
6      Nissan Altima  14108.0
7         Honda CR-V      NaN

isnull and notnull works for the same purpose, but returns reverse outputs. While isnull returns True for missing values, notnull returns True for available values.

In [43]:
missing_data.isnull()
Out[43]:
Car Number
0 False False
1 False False
2 False False
3 False False
4 False False
5 False True
6 False False
7 False True
In [44]:
missing_data.notnull()
Out[44]:
Car Number
0 True True
1 True True
2 True True
3 True True
4 True True
5 True False
6 True True
7 True False

Instead of having a matrix including boolean values, we can directly compute the total number of missing values for each column in the data as given below.

In [45]:
missing_data.isnull().sum()
Out[45]:
Car       0
Number    2
dtype: int64

As seen above, we have 2 missing values in Number column. Apart from the column names, if we would like to compute the total number of the missing observations across the data,

In [46]:
missing_data.isnull().sum().sum() #total number of missing observations 
Out[46]:
2

Dealing with Missing Data

The missing data problem is one of the hardest problem in data science. There are several approaches, including both simple and complex cases.

Before remedying your missing problem in the data, you should figure out the source of the missing and pattern of the missing. After that, you should decide your strategies.

The simplest but most unpreferable solution in missing data is dropping the missing terms.

You can remove the NA values with dropna

In [47]:
missing_data.dropna(axis = 0) #axis = 0 removes the rows with na 
#axis = 1 drops the columns with na
Out[47]:
Car Number
0 Chevrolet Pick-Up 48206.0
1 Ford Pick-Up 47999.0
2 Honda Civic 31673.0
3 Honda Accord 30274.0
4 Toyota Camry 17270.0
6 Nissan Altima 14108.0

The imputing the missing values is better way, sure. You can fill your missing observations via different way. However, we will talk about the simplest way in the context of the course.

fillna() fills the all NA values with same value.

In [48]:
missing_data.fillna(0)
Out[48]:
Car Number
0 Chevrolet Pick-Up 48206.0
1 Ford Pick-Up 47999.0
2 Honda Civic 31673.0
3 Honda Accord 30274.0
4 Toyota Camry 17270.0
5 GMC Pick-Up 0.0
6 Nissan Altima 14108.0
7 Honda CR-V 0.0

You can use the function to fill the missing values with descriptive statistics.

In [49]:
mean = missing_data['Number'].mean()
median = missing_data['Number'].median()
mode = missing_data['Number'].mode()
print('Mean:',mean,'Median:',median,'Mode:',mode)
Mean: 31588.333333333332 Median: 30973.5 Mode: 0    14108.0
1    17270.0
2    30274.0
3    31673.0
4    47999.0
5    48206.0
Name: Number, dtype: float64
In [50]:
missing_data['Number'].fillna(mean) #mean imputation 
Out[50]:
0    48206.000000
1    47999.000000
2    31673.000000
3    30274.000000
4    17270.000000
5    31588.333333
6    14108.000000
7    31588.333333
Name: Number, dtype: float64
In [51]:
missing_data['Number'].fillna(median) #mean imputation 
Out[51]:
0    48206.0
1    47999.0
2    31673.0
3    30274.0
4    17270.0
5    30973.5
6    14108.0
7    30973.5
Name: Number, dtype: float64
In [52]:
missing_data['Number'].fillna(mode) #mean imputation 
Out[52]:
0    48206.0
1    47999.0
2    31673.0
3    30274.0
4    17270.0
5    48206.0
6    14108.0
7        NaN
Name: Number, dtype: float64

replace replaces the value in the specified unit.

In [53]:
missing_data.replace(to_replace=np.nan,value=0) #replaces the all missing values 
Out[53]:
Car Number
0 Chevrolet Pick-Up 48206.0
1 Ford Pick-Up 47999.0
2 Honda Civic 31673.0
3 Honda Accord 30274.0
4 Toyota Camry 17270.0
5 GMC Pick-Up 0.0
6 Nissan Altima 14108.0
7 Honda CR-V 0.0

Example 4¶

Import the tao.csv data.

In [54]:
tao = pd.read_csv('tao.csv')
tao.head()
tao_new = tao.drop(['Unnamed: 0'],axis = 1)
tao_new.head()
Out[54]:
Year Latitude Longitude Sea.Surface.Temp Air.Temp Humidity UWind VWind
0 1997 0 -110 27.590000 27.15 79.599998 -6.4 5.4
1 1997 0 -110 27.549999 27.02 75.800003 -5.3 5.3
2 1997 0 -110 27.570000 27.00 76.500000 -5.1 4.5
3 1997 0 -110 27.620001 26.93 76.199997 -4.9 2.5
4 1997 0 -110 27.650000 26.84 76.400002 -3.5 4.1

a) Calculate the total number of missing observations in the data.

In [ ]:
 

b) Which column has the highest number of missing observations in the data?

In [ ]:
 

c) Fill the missing values with the mean of the corresponding columns.

In [ ]:
 

Suggested Sources

  • http://pandas.pydata.org/pandas-docs/stable/

  • Data Manipulation with Pandas

  • Pandas for Everyone

  • Data Analysis with Pandas

2808327959.png

In [ ]: