Ozancan Ozdemir - ozancan@metu.edu.tr
This introduction sentence is created by ChatGPT
Pandas is a software library written for the Python programming language for data manipulation and analysis. It is often used for working with data in tabular form, such as data stored in a spreadsheet or a database table.
Pandas provides a powerful set of tools for working with and analyzing data, including data frames, which are similar to tables in a relational database, and tools for working with time series data.
To install pandas on Anaconda, you should run the following command on your Anaconda terminal.
conda install pandas
However, we do not have such a problem with colaboratory. :)
Once Pandas is installed, you can import it into your Python script using the following code:
import pandas as pd #import the package
The pd
alias will be used to reference Pandas throughout your code.
In Pandas, the three main data structures for storing and working with data are the Series, the DataFrame and the Panel.
A Series is a one-dimensional array-like object that can hold any data type. It has a labeled index, which is used to access the elements of the Series.
A DataFrame is a 2-dimensional tabular data structure that can hold any data type. It is similar to a spreadsheet or a SQL table. A DataFrame has a labeled index and labeled columns, which allow you to access the data in the table using the labels rather than numeric indices.
A Panel is a 3-dimensional data container. However, we are mostly using Pandas with DataFrame. At least in this course.
You can think of a DataFrame as a collection of Series objects that share the same index. This makes it easy to perform operations on data in the table, such as selecting specific rows or columns, or applying calculations to entire columns of data.
Source: Data Science Made Simple
You can create a Series or DataFrame in Pandas by passing a list or array of data to the pandas.Series() or pandas.DataFrame() constructor, respectively. You can also import data from a file or other data source into a DataFrame using the pandas.read_() functions, where the is replaced with the file format (e.g. csv or excel).
# create a Series
s = pd.Series([1,9,5,6])
print(s)
0 1 1 9 2 5 3 6 dtype: int64
# create a DataFrame
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(df)
0 1 2 0 1 2 3 1 4 5 6 2 7 8 9
Although the created objects above include only integers, there are several data types that can be used in a Pandas DataFrame, including:
These are just some of the data types that can be used in Pandas. You can use the DataFrame.dtypes property to view the data type for each column in a DataFrame.
df.dtypes
0 int64 1 int64 2 int64 dtype: object
In pandas, there are two main ways to create a data frame from scratch, list
, dictionary
. The usage of list in creating data frame has already done above. Now, we will look at how to use dictionaries to create a data frame in pandas.
Note that dictionaries provides an advantage over assigning a column name to your data frame.
When you use list to create a data frame, you have to define the column names after that. However, you can both define column names and values at the same time in dictionary.
Reminder
A typical dictionary object in Python has the following structure.
dict_name = {'key' : value}
or
dict_name = { 'key' : [value1,value2] }
Key and value can be either integer or character.
# Dictionary with integer keys
my_dict = {1: 'A', 2: 'B'}
print(my_dict)
# Dictionary with string keys
my_dict = {'name': 'ozan', 'age': 28}
print(my_dict)
# Dictionary with mixed keys
my_dict = {'name': 'ozan', 1: ['A', 'B']}
print(my_dict)
{1: 'A', 2: 'B'} {'name': 'ozan', 'age': 28} {'name': 'ozan', 1: ['A', 'B']}
Now, we will create the following table using dictonaries. In doing so, we use pd.DataFrame()
constructor.
table = pd.DataFrame({'City' : ['Tokyo','Delhi','Shangai','Dhaka','Sao Paulo','Mexico City','Cairo','Beijing','Mumbai','Osaka'],
'Country': ['Japan','India','China','Bangladesh','Brazil','Mexico','Egypt','China','India','Japan'],
'2022 Population':[37274000,32065760,28516904,22478116,22429800,22085140,21750020,21333332,20961472,19059856],
'2021 Population': [37339804,31181376,27795702,21741090,22237472,21918936,21322750,20896820,20667656,19110616],
'Growth': [-0.18,2.84,2.59,3.39,0.86,0.76,2,2.09,1.42,-0.27]})
print(table)
City Country 2022 Population 2021 Population Growth 0 Tokyo Japan 37274000 37339804 -0.18 1 Delhi India 32065760 31181376 2.84 2 Shangai China 28516904 27795702 2.59 3 Dhaka Bangladesh 22478116 21741090 3.39 4 Sao Paulo Brazil 22429800 22237472 0.86 5 Mexico City Mexico 22085140 21918936 0.76 6 Cairo Egypt 21750020 21322750 2.00 7 Beijing China 21333332 20896820 2.09 8 Mumbai India 20961472 20667656 1.42 9 Osaka Japan 19059856 19110616 -0.27
Please create the following DataFrame and name ex1.
After creating the data frame, we will introduce some Pandas attributes on data frame.
table.dtypes #list the data types of the series
City object Country object 2022 Population int64 2021 Population int64 Growth float64 dtype: object
table.shape #shows the number of column and rows of your data frame
(10, 5)
Displaying the general information about data frame: You can print the information above together using df.info()
attribute where df
shows the name of the data frame.
table.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 City 10 non-null object 1 Country 10 non-null object 2 2022 Population 10 non-null int64 3 2021 Population 10 non-null int64 4 Growth 10 non-null float64 dtypes: float64(1), int64(2), object(2) memory usage: 528.0+ bytes
When you create a longer data frame, it is not necessary to view all observations. df.head()
funciton will show first 5 observations in your data while df.tail()
shows the last 5 observations.
table.head() #shows first 5 observations
City | Country | 2022 Population | 2021 Population | Growth | |
---|---|---|---|---|---|
0 | Tokyo | Japan | 37274000 | 37339804 | -0.18 |
1 | Delhi | India | 32065760 | 31181376 | 2.84 |
2 | Shangai | China | 28516904 | 27795702 | 2.59 |
3 | Dhaka | Bangladesh | 22478116 | 21741090 | 3.39 |
4 | Sao Paulo | Brazil | 22429800 | 22237472 | 0.86 |
table.head(3) #shows first three observations
City | Country | 2022 Population | 2021 Population | Growth | |
---|---|---|---|---|---|
0 | Tokyo | Japan | 37274000 | 37339804 | -0.18 |
1 | Delhi | India | 32065760 | 31181376 | 2.84 |
2 | Shangai | China | 28516904 | 27795702 | 2.59 |
Adding a new column: You can add new intel to your data frame easily. Let's add the continent of the countries. But first, create your continent object.
import numpy as np
continent = np.repeat(np.array(['Asia','S.America','Africa','Asia']),[4,2,1,3],axis = 0)
continent
array(['Asia', 'Asia', 'Asia', 'Asia', 'S.America', 'S.America', 'Africa', 'Asia', 'Asia', 'Asia'], dtype='<U9')
table['continent'] = continent
#df_name['new column name'] = new series
table.head()
City | Country | 2022 Population | 2021 Population | Growth | continent | |
---|---|---|---|---|---|---|
0 | Tokyo | Japan | 37274000 | 37339804 | -0.18 | Asia |
1 | Delhi | India | 32065760 | 31181376 | 2.84 | Asia |
2 | Shangai | China | 28516904 | 27795702 | 2.59 | Asia |
3 | Dhaka | Bangladesh | 22478116 | 21741090 | 3.39 | Asia |
4 | Sao Paulo | Brazil | 22429800 | 22237472 | 0.86 | S.America |
table.columns # list the columns in the data
Index(['City', 'Country', '2022 Population', '2021 Population', 'Growth', 'continent'], dtype='object')
Subsetting a Data Frame: Pandas provides several ways to assess a subset of the DataFrame. In other words, you can pull out specific rows, columns, slices or cells in the DataFrame object using different ways.
table['City']
#df_name['Column name']
0 Tokyo 1 Delhi 2 Shangai 3 Dhaka 4 Sao Paulo 5 Mexico City 6 Cairo 7 Beijing 8 Mumbai 9 Osaka Name: City, dtype: object
table[['City','Country']]
#for multiple extraction, create a list object.
City | Country | |
---|---|---|
0 | Tokyo | Japan |
1 | Delhi | India |
2 | Shangai | China |
3 | Dhaka | Bangladesh |
4 | Sao Paulo | Brazil |
5 | Mexico City | Mexico |
6 | Cairo | Egypt |
7 | Beijing | China |
8 | Mumbai | India |
9 | Osaka | Japan |
You can acess your data using loc[]
and iloc[]
in Pandas.
loc stands for "location" and is used to select data by the labels of the rows and columns. For example, to select the first row of a DataFrame, you would use df.loc[0]
.
iloc stands for "integer location" and is used to select data by the position of the rows and columns. So, to select the first row of a DataFrame using iloc, you would use df.iloc[0]
.
One key difference between the two is that loc uses the labels of the rows and columns to identify which data to select, while iloc uses the integer positions of the rows and columns.
This means that if you have changed the row labels or column names, loc will still use the original labels, while iloc will use the current integer positions of the rows and columns.
Indexing starts from 0 in Pandas, as in Numpy.
table.loc[0] #first row of the data
City Tokyo Country Japan 2022 Population 37274000 2021 Population 37339804 Growth -0.18 continent Asia Name: 0, dtype: object
Since we do not have row label names in this data, we use number in loc.
table.iloc[[0,3,8]] #1st, 4th, 9th rows of the data
City | Country | 2022 Population | 2021 Population | Growth | continent | |
---|---|---|---|---|---|---|
0 | Tokyo | Japan | 37274000 | 37339804 | -0.18 | Asia |
3 | Dhaka | Bangladesh | 22478116 | 21741090 | 3.39 | Asia |
8 | Mumbai | India | 20961472 | 20667656 | 1.42 | Asia |
table['City'].iloc[2] #the name of the third city
'Shangai'
You can use boolean indexing for slicing a DataFrame in Pandas. Remember that boolean indexing is useful when you access your data based on a condition.
For example, we would like to list the information of the cities whose current population exceeds 24000000.
table[table['2022 Population']>24000000]
City | Country | 2022 Population | 2021 Population | Growth | continent | |
---|---|---|---|---|---|---|
0 | Tokyo | Japan | 37274000 | 37339804 | -0.18 | Asia |
1 | Delhi | India | 32065760 | 31181376 | 2.84 | Asia |
2 | Shangai | China | 28516904 | 27795702 | 2.59 | Asia |
#If only list the city names.
table['City'].loc[table['2022 Population']>24000000]
0 Tokyo 1 Delhi 2 Shangai Name: City, dtype: object
If you try this with iloc, you will get an error.
table['City'].loc[table['2022 Population']>24000000]
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-39-b13d5c365ef6> in <module>
----> 1 table['City'].iloc[table['2022 Population']>24000000]
2 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
1393 if hasattr(key, "index") and isinstance(key.index, Index):
1394 if key.index.inferred_type == "integer":
-> 1395 raise NotImplementedError(
1396 "iLocation based boolean "
1397 "indexing on an integer type "
NotImplementedError: iLocation based boolean indexing on an integer type is not available
Renaming a column
You can change either specific column name or all column names in a DataFrame object.
df.rename({'old name': 'new name'},axis = "columns")
You can change row label by rename if axis ="rows"
.
To change the all columns simultaneously,
df.columns(['c1','c2','c3'])
table.rename({'continent': 'Continent'},axis = "columns")
City | Country | 2022 Population | 2021 Population | Growth | Continent | |
---|---|---|---|---|---|---|
0 | Tokyo | Japan | 37274000 | 37339804 | -0.18 | Asia |
1 | Delhi | India | 32065760 | 31181376 | 2.84 | Asia |
2 | Shangai | China | 28516904 | 27795702 | 2.59 | Asia |
3 | Dhaka | Bangladesh | 22478116 | 21741090 | 3.39 | Asia |
4 | Sao Paulo | Brazil | 22429800 | 22237472 | 0.86 | S.America |
5 | Mexico City | Mexico | 22085140 | 21918936 | 0.76 | S.America |
6 | Cairo | Egypt | 21750020 | 21322750 | 2.00 | Africa |
7 | Beijing | China | 21333332 | 20896820 | 2.09 | Asia |
8 | Mumbai | India | 20961472 | 20667656 | 1.42 | Asia |
9 | Osaka | Japan | 19059856 | 19110616 | -0.27 | Asia |
Consider ex1 DataFrame that you created in Exercise 1.
Answer the following questions based on this file.
a) Shows the all starving-poverty information in the first 6 months of the year.
b) Display the starving line only.
c) Display the data where poverty line is below 20000.
d) List the name of the months where the starving line exceeds 7000.
You can import a data from several sources to your enviroment by using pandas.
You can use the pandas.read_() functions, where the is replaced with the file format (e.g. csv or excel).
Before proceeding, you can look at the way that makes importing your data to your colabotary easily. Just drag and drop.
Note that the data that you dragged and dropped will be deleted when you close the notebook in Colab.
Now, consider teams.csv
, teams.xlsx
and https://metustat112.github.io/teams.csv.
Way 1
teams = pd.read_csv("teams.csv")
#teams = pd.read_csv("Teams.csv",sep = ";")
#dataname = pd.data.csv("filename.csv",sep = ";")
#sep argument specifies a custom delimiter for the CSV input, where the default is comma.
teams.head()
Unnamed: 0 | team | ranking | continent | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Brazil | 1 | South America | 5 | 3 | 1 | 1 | 9 | 4 | 7 | 2 |
1 | 1 | Spain | 2 | Europe | 6 | 5 | 0 | 1 | 7 | 2 | 3 | 0 |
2 | 2 | Portugal | 3 | Europe | 4 | 1 | 2 | 1 | 7 | 1 | 8 | 1 |
3 | 3 | Netherlands | 4 | Europe | 6 | 6 | 0 | 0 | 12 | 5 | 15 | 0 |
4 | 4 | Italy | 5 | Europe | 3 | 0 | 2 | 1 | 4 | 5 | 5 | 0 |
Way 2
teams_excel = pd.read_excel('teams.xlsx')
teams_excel.head()
team | ranking | continent | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Brazil | 1 | South America | 5 | 3 | 1 | 1 | 9 | 4 | 7 | 2 |
1 | Spain | 2 | Europe | 6 | 5 | 0 | 1 | 7 | 2 | 3 | 0 |
2 | Portugal | 3 | Europe | 4 | 1 | 2 | 1 | 7 | 1 | 8 | 1 |
3 | Netherlands | 4 | Europe | 6 | 6 | 0 | 0 | 12 | 5 | 15 | 0 |
4 | Italy | 5 | Europe | 3 | 0 | 2 | 1 | 4 | 5 | 5 | 0 |
Way 3
teams_url = pd.read_csv('https://metustat112.github.io/teams.csv')
teams_url.head()
Unnamed: 0 | team | ranking | continent | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Brazil | 1 | South America | 5 | 3 | 1 | 1 | 9 | 4 | 7 | 2 |
1 | 1 | Spain | 2 | Europe | 6 | 5 | 0 | 1 | 7 | 2 | 3 | 0 |
2 | 2 | Portugal | 3 | Europe | 4 | 1 | 2 | 1 | 7 | 1 | 8 | 1 |
3 | 3 | Netherlands | 4 | Europe | 6 | 6 | 0 | 0 | 12 | 5 | 15 | 0 |
4 | 4 | Italy | 5 | Europe | 3 | 0 | 2 | 1 | 4 | 5 | 5 | 0 |
Continue with teams data.
teams.head() #shows the first fice observations
#dataname.function()
Unnamed: 0 | team | ranking | continent | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Brazil | 1 | South America | 5 | 3 | 1 | 1 | 9 | 4 | 7 | 2 |
1 | 1 | Spain | 2 | Europe | 6 | 5 | 0 | 1 | 7 | 2 | 3 | 0 |
2 | 2 | Portugal | 3 | Europe | 4 | 1 | 2 | 1 | 7 | 1 | 8 | 1 |
3 | 3 | Netherlands | 4 | Europe | 6 | 6 | 0 | 0 | 12 | 5 | 15 | 0 |
4 | 4 | Italy | 5 | Europe | 3 | 0 | 2 | 1 | 4 | 5 | 5 | 0 |
On the left hand side of the screen, the sequence starting from 0 is the index of the data frame created by pandas automatically. In addition to that, the first column of the data has Unnamed: 0 columns produced by Pandas.
We should remove them to work efficiently and save the memory. There are various way to get rid of this column. Here is the only one solution, drop()
function. It is used to drop specified labels from rows or columns.
teams_new = teams.drop("Unnamed: 0", axis=1) #axis = 1 is for column. axis = 0 is for row removing.
teams_new.head()
team | ranking | continent | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Brazil | 1 | South America | 5 | 3 | 1 | 1 | 9 | 4 | 7 | 2 |
1 | Spain | 2 | Europe | 6 | 5 | 0 | 1 | 7 | 2 | 3 | 0 |
2 | Portugal | 3 | Europe | 4 | 1 | 2 | 1 | 7 | 1 | 8 | 1 |
3 | Netherlands | 4 | Europe | 6 | 6 | 0 | 0 | 12 | 5 | 15 | 0 |
4 | Italy | 5 | Europe | 3 | 0 | 2 | 1 | 4 | 5 | 5 | 0 |
After dropping the unnecessary column, export your data set.
You can use the df.to_('filename.') functions, where the * is replaced with the file format (e.g. csv or excel), filename is the name of the exported file, and df corresponds the data name.
teams_new.to_csv('teamsnew.csv')
Pandas Properties
Here, we will represent some pandas functions being useful in data manipulation.
groupby
This function groups the data with respect to level of a variable. It is generally used to apply the aggregate functions like min, median, mean, max etc.
Now, group the teams by their continent.
teams_by_continent = teams_new.groupby('continent')
teams_by_continent.first()
#Compute the first non-null entry of each column.
team | ranking | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|---|
continent | ||||||||||
Africa | Cameroon | 19 | 3 | 0 | 0 | 3 | 2 | 5 | 5 | 0 |
Asia | Japan | 45 | 4 | 2 | 1 | 1 | 4 | 2 | 7 | 0 |
Australia | Australia | 20 | 3 | 1 | 1 | 1 | 3 | 6 | 7 | 2 |
Europe | Spain | 2 | 6 | 5 | 0 | 1 | 7 | 2 | 3 | 0 |
North America | USA | 14 | 4 | 1 | 2 | 1 | 5 | 5 | 9 | 0 |
South America | Brazil | 1 | 5 | 3 | 1 | 1 | 9 | 4 | 7 | 2 |
First function displays the first element in each group. To assess the subgroup elements, use get_group(('group name'))
.
teams_by_continent.get_group(('Africa')) #the Africa countries in WC 2010
team | ranking | continent | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|---|---|
15 | Cameroon | 19 | Africa | 3 | 0 | 0 | 3 | 2 | 5 | 5 | 0 |
17 | Nigeria | 21 | Africa | 3 | 0 | 1 | 2 | 3 | 5 | 5 | 1 |
20 | Ivory Coast | 27 | Africa | 3 | 1 | 1 | 1 | 4 | 3 | 5 | 0 |
21 | Algeria | 30 | Africa | 3 | 0 | 1 | 2 | 0 | 2 | 4 | 2 |
23 | Ghana | 32 | Africa | 5 | 2 | 2 | 1 | 5 | 4 | 11 | 0 |
30 | South Africa | 83 | Africa | 3 | 1 | 1 | 1 | 3 | 5 | 4 | 1 |
You can group your data more than one variable using a list object in groupby
function.
teams_by_continent_and_win = teams_new.groupby(['continent','wins'])
teams_new.groupby(['continent','wins']).get_group(('Africa',0)) #the most unsuccessful African teams in WC 2010
team | ranking | continent | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|---|---|
15 | Cameroon | 19 | Africa | 3 | 0 | 0 | 3 | 2 | 5 | 5 | 0 |
17 | Nigeria | 21 | Africa | 3 | 0 | 1 | 2 | 3 | 5 | 5 | 1 |
21 | Algeria | 30 | Africa | 3 | 0 | 1 | 2 | 0 | 2 | 4 | 2 |
aggregate method is used to apply a specific aggregation function to each column of a DataFrame. The famous aggreate functions are min
, max
, sum
, mean
, describe
and more.
teams_new.describe() #shows the several descriptive statistics of the numerical variables.
ranking | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|
count | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.00000 | 32.000000 | 32.000000 | 32.000000 |
mean | 26.031250 | 3.875000 | 1.437500 | 1.000000 | 1.437500 | 4.34375 | 4.343750 | 7.156250 | 0.500000 |
std | 24.233387 | 1.070122 | 1.522678 | 0.879883 | 0.715609 | 3.28839 | 2.208625 | 2.760427 | 0.718421 |
min | 1.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 2.000000 | 0.000000 |
25% | 8.750000 | 3.000000 | 0.750000 | 0.000000 | 1.000000 | 2.00000 | 3.000000 | 5.000000 | 0.000000 |
50% | 19.500000 | 3.500000 | 1.000000 | 1.000000 | 1.000000 | 3.00000 | 5.000000 | 7.000000 | 0.000000 |
75% | 32.500000 | 4.250000 | 2.000000 | 1.250000 | 2.000000 | 5.25000 | 5.000000 | 8.250000 | 1.000000 |
max | 105.000000 | 6.000000 | 6.000000 | 3.000000 | 3.000000 | 13.00000 | 12.000000 | 15.000000 | 2.000000 |
teams_new.describe().transpose() #by transposing, you can have a better view
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ranking | 32.0 | 26.03125 | 24.233387 | 1.0 | 8.75 | 19.5 | 32.50 | 105.0 |
games | 32.0 | 3.87500 | 1.070122 | 3.0 | 3.00 | 3.5 | 4.25 | 6.0 |
wins | 32.0 | 1.43750 | 1.522678 | 0.0 | 0.75 | 1.0 | 2.00 | 6.0 |
draws | 32.0 | 1.00000 | 0.879883 | 0.0 | 0.00 | 1.0 | 1.25 | 3.0 |
losses | 32.0 | 1.43750 | 0.715609 | 0.0 | 1.00 | 1.0 | 2.00 | 3.0 |
goalsFor | 32.0 | 4.34375 | 3.288390 | 0.0 | 2.00 | 3.0 | 5.25 | 13.0 |
goalsAgainst | 32.0 | 4.34375 | 2.208625 | 1.0 | 3.00 | 5.0 | 5.00 | 12.0 |
yellowCards | 32.0 | 7.15625 | 2.760427 | 2.0 | 5.00 | 7.0 | 8.25 | 15.0 |
redCards | 32.0 | 0.50000 | 0.718421 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
What is missing in the output above?
Categorical variables, i.e object type variables.
teams_new.describe(include=['O'])
team | continent | |
---|---|---|
count | 32 | 32 |
unique | 32 | 6 |
top | Brazil | Europe |
freq | 1 | 13 |
If you would like to apply your function onto specific column, you can use .values
or value
attributes. For example,
teams_new['wins'].values.mean() #the average wins
1.4375
teams_new['continent'].value_counts() #frequency distribution of continent
Europe 13 South America 7 Africa 6 Asia 3 Australia 2 North America 1 Name: continent, dtype: int64
teams_by_continent.aggregate('mean') #calculate the average of each variable by continent.
ranking | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|
continent | |||||||||
Africa | 35.333333 | 3.333333 | 0.666667 | 1.000000 | 1.666667 | 2.833333 | 4.000000 | 5.666667 | 0.666667 |
Asia | 65.666667 | 3.666667 | 1.000000 | 0.666667 | 2.000000 | 3.666667 | 7.333333 | 5.000000 | 0.000000 |
Australia | 49.000000 | 3.000000 | 0.500000 | 2.000000 | 0.500000 | 2.500000 | 4.000000 | 6.500000 | 1.000000 |
Europe | 14.153846 | 3.923077 | 1.769231 | 0.769231 | 1.384615 | 4.846154 | 3.846154 | 7.538462 | 0.384615 |
North America | 14.000000 | 4.000000 | 1.000000 | 2.000000 | 1.000000 | 5.000000 | 5.000000 | 9.000000 | 0.000000 |
South America | 18.285714 | 4.571429 | 2.000000 | 1.142857 | 1.428571 | 5.428571 | 4.285714 | 8.571429 | 0.714286 |
teams_by_continent.aggregate('count') #shows the frequency of each column by continent.
team | ranking | games | wins | draws | losses | goalsFor | goalsAgainst | yellowCards | redCards | |
---|---|---|---|---|---|---|---|---|---|---|
continent | ||||||||||
Africa | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
Asia | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
Australia | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
Europe | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 |
North America | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
South America | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 |
You can also specify a different aggregation function for each column by passing a dictionary to the aggregate method, where the keys are the column names and the values are the aggregation functions. For example:
teams_by_continent.aggregate({
"wins": "mean",
"goalsFor": "max",
"goalsAgainst": "min"
})
wins | goalsFor | goalsAgainst | |
---|---|---|---|
continent | |||
Africa | 0.666667 | 5 | 2 |
Asia | 1.000000 | 6 | 2 |
Australia | 0.500000 | 3 | 2 |
Europe | 1.769231 | 13 | 1 |
North America | 1.000000 | 5 | 5 |
South America | 2.000000 | 10 | 2 |
Please read the Players.csv data, and answer the following questions.
players = pd.read_csv('Players.csv')
players.head()
surname | team | position | minutes | shots | passes | tackles | saves | |
---|---|---|---|---|---|---|---|---|
0 | Abdoun | Algeria | midfielder | 16 | 0 | 6 | 0 | 0 |
1 | Belhadj | Algeria | defender | 270 | 1 | 146 | 8 | 0 |
2 | Boudebouz | Algeria | midfielder | 74 | 3 | 28 | 1 | 0 |
3 | Bougherra | Algeria | defender | 270 | 1 | 89 | 11 | 0 |
4 | Chaouchi | Algeria | goalkeeper | 90 | 0 | 17 | 0 | 2 |
a) Show first 5 observations of the data
b) How many players are there in the data?
c) What is the distribution of positions of the players? (i.e display the frequency of each position)
d) What is the average number of pass of the players in WC 2010?
e) What is the average minutes that players play by their positions in WC 2010?
f) State the surname of the player who has the longest game time.
g) State the surname of the goalkeeper who has the highest save.
Missing data occurs when no information is provided. Pandas provides some tools that makes dealing with missing easier.
In pandas, the missing observations have two representations.
None: None is a Python singleton object that is often used for missing data in Python code.
NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation
We can use following functions for missing analysis. i.e We can detect and replace the missing values with these functions
df.isnull()
df.notnull()
df.dropna()
df.fillna()
df.replace()
Consider the example,
import pandas as pd
missing_data = pd.read_csv('https://metustat112.github.io/stolen_cars.csv',sep = ";")
print(missing_data)
Car Number 0 Chevrolet Pick-Up 48206.0 1 Ford Pick-Up 47999.0 2 Honda Civic 31673.0 3 Honda Accord 30274.0 4 Toyota Camry 17270.0 5 GMC Pick-Up NaN 6 Nissan Altima 14108.0 7 Honda CR-V NaN
isnull
and notnull
works for the same purpose, but returns reverse outputs. While isnull
returns True
for missing values, notnull
returns True
for available values.
missing_data.isnull()
Car | Number | |
---|---|---|
0 | False | False |
1 | False | False |
2 | False | False |
3 | False | False |
4 | False | False |
5 | False | True |
6 | False | False |
7 | False | True |
missing_data.notnull()
Car | Number | |
---|---|---|
0 | True | True |
1 | True | True |
2 | True | True |
3 | True | True |
4 | True | True |
5 | True | False |
6 | True | True |
7 | True | False |
Instead of having a matrix including boolean values, we can directly compute the total number of missing values for each column in the data as given below.
missing_data.isnull().sum()
Car 0 Number 2 dtype: int64
As seen above, we have 2 missing values in Number column. Apart from the column names, if we would like to compute the total number of the missing observations across the data,
missing_data.isnull().sum().sum() #total number of missing observations
2
Dealing with Missing Data
The missing data problem is one of the hardest problem in data science. There are several approaches, including both simple and complex cases.
Before remedying your missing problem in the data, you should figure out the source of the missing and pattern of the missing. After that, you should decide your strategies.
The simplest but most unpreferable solution in missing data is dropping the missing terms.
You can remove the NA values with dropna
missing_data.dropna(axis = 0) #axis = 0 removes the rows with na
#axis = 1 drops the columns with na
Car | Number | |
---|---|---|
0 | Chevrolet Pick-Up | 48206.0 |
1 | Ford Pick-Up | 47999.0 |
2 | Honda Civic | 31673.0 |
3 | Honda Accord | 30274.0 |
4 | Toyota Camry | 17270.0 |
6 | Nissan Altima | 14108.0 |
The imputing the missing values is better way, sure. You can fill your missing observations via different way. However, we will talk about the simplest way in the context of the course.
fillna()
fills the all NA values with same value.
missing_data.fillna(0)
Car | Number | |
---|---|---|
0 | Chevrolet Pick-Up | 48206.0 |
1 | Ford Pick-Up | 47999.0 |
2 | Honda Civic | 31673.0 |
3 | Honda Accord | 30274.0 |
4 | Toyota Camry | 17270.0 |
5 | GMC Pick-Up | 0.0 |
6 | Nissan Altima | 14108.0 |
7 | Honda CR-V | 0.0 |
You can use the function to fill the missing values with descriptive statistics.
mean = missing_data['Number'].mean()
median = missing_data['Number'].median()
mode = missing_data['Number'].mode()
print('Mean:',mean,'Median:',median,'Mode:',mode)
Mean: 31588.333333333332 Median: 30973.5 Mode: 0 14108.0 1 17270.0 2 30274.0 3 31673.0 4 47999.0 5 48206.0 Name: Number, dtype: float64
missing_data['Number'].fillna(mean) #mean imputation
0 48206.000000 1 47999.000000 2 31673.000000 3 30274.000000 4 17270.000000 5 31588.333333 6 14108.000000 7 31588.333333 Name: Number, dtype: float64
missing_data['Number'].fillna(median) #mean imputation
0 48206.0 1 47999.0 2 31673.0 3 30274.0 4 17270.0 5 30973.5 6 14108.0 7 30973.5 Name: Number, dtype: float64
missing_data['Number'].fillna(mode) #mean imputation
0 48206.0 1 47999.0 2 31673.0 3 30274.0 4 17270.0 5 48206.0 6 14108.0 7 NaN Name: Number, dtype: float64
replace
replaces the value in the specified unit.
missing_data.replace(to_replace=np.nan,value=0) #replaces the all missing values
Car | Number | |
---|---|---|
0 | Chevrolet Pick-Up | 48206.0 |
1 | Ford Pick-Up | 47999.0 |
2 | Honda Civic | 31673.0 |
3 | Honda Accord | 30274.0 |
4 | Toyota Camry | 17270.0 |
5 | GMC Pick-Up | 0.0 |
6 | Nissan Altima | 14108.0 |
7 | Honda CR-V | 0.0 |
Import the tao.csv
data.
tao = pd.read_csv('tao.csv')
tao.head()
tao_new = tao.drop(['Unnamed: 0'],axis = 1)
tao_new.head()
Year | Latitude | Longitude | Sea.Surface.Temp | Air.Temp | Humidity | UWind | VWind | |
---|---|---|---|---|---|---|---|---|
0 | 1997 | 0 | -110 | 27.590000 | 27.15 | 79.599998 | -6.4 | 5.4 |
1 | 1997 | 0 | -110 | 27.549999 | 27.02 | 75.800003 | -5.3 | 5.3 |
2 | 1997 | 0 | -110 | 27.570000 | 27.00 | 76.500000 | -5.1 | 4.5 |
3 | 1997 | 0 | -110 | 27.620001 | 26.93 | 76.199997 | -4.9 | 2.5 |
4 | 1997 | 0 | -110 | 27.650000 | 26.84 | 76.400002 | -3.5 | 4.1 |
a) Calculate the total number of missing observations in the data.
b) Which column has the highest number of missing observations in the data?
c) Fill the missing values with the mean of the corresponding columns.