Stat 112 - Recitation 9

Ozancan Ozdemir - ozancan@metu.edu.tr

Data Cleaning with Pandas and Numpy

The quality of the data is one of the most important conditions to produce a good results in data science, data analytics or data visualization. It does not matter either you have well-working algorithms or great tools, the quality of your output depends on your data input. That’s why it is important to be sure that the data under the study is properly refined.

Data cleaning is a foundational process in the data science lifecycle and its role cannot be overemphasized when trying to uncover insights and generate reliable answers. More often than not, data will always be dirty in the real world, and data cleaning cannot be completely avoided such that it is estimated that it takes 80% of the data analysis process.

image.png

This can involve finding and removing duplicates and incomplete records, and modifying data to rectify inaccurate records. Unclean or dirty data has always been a problem, yet we have seen an exponential rise in data generation over the last decade.

There is no absolute way to dictate the exact steps of the data cleansing process, as the process is different for each dataset. However, it’s important to create a template for your data cleansing process and make sure you’re doing it right every time.

How do you clean data?

Step 1: Remove duplicate or irrelevant observations

Step 2: Fix structural errors

Step 3: Filter unwanted outliers

Step 4: Handle missing data

Components of quality data

  • Validity. The degree to which your data conforms to defined business rules or constraints.

  • Accuracy. Ensure your data is close to the true values.

  • Completeness. The degree to which all required data is known.

  • Consistency. Ensure your data is consistent within the same dataset and/or across multiple data sets.

  • Uniformity. The degree to which the data is specified using the same unit of measure.

Before data cleaning can properly be done, it is important to understand how it got dirty in the first place. Then, you determine your actions for cleaning.

Pandas and numpy includes functions being useful for data cleaning.

Prior to cleaning procedure, let's have a look at some concepts necessary for data cleaning; Missing data and string operators.

In [3]:
#import the packages 
import pandas as pd
import numpy as np

Missing Data

Missing data occurs when no information is provided. Pandas provides some tools that makes dealing with missing easier.

In pandas, the missing observations have two representations.

  • None: None is a Python singleton object that is often used for missing data in Python code.

  • NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

We can use following functions for missing analysis. i.e We can detect and replace the missing values with these functions

df.isnull()

df.notnull()

df.dropna()

df.fillna()

df.replace()

Consider the example,

In [2]:
missing_data = pd.read_csv('https://metustat112.github.io/stolen_cars.csv',sep = ";")
print(missing_data)
                 Car   Number
0  Chevrolet Pick-Up  48206.0
1       Ford Pick-Up  47999.0
2        Honda Civic  31673.0
3       Honda Accord  30274.0
4       Toyota Camry  17270.0
5        GMC Pick-Up      NaN
6      Nissan Altima  14108.0
7         Honda CR-V      NaN

isnull and notnull works for the same purpose, but returns reverse outputs. While isnull returns True for missing values, notnull returns True for available values.

In [ ]:
missing_data.isnull()
Out[ ]:
Car Number
0 False False
1 False False
2 False False
3 False False
4 False False
5 False True
6 False False
7 False True
In [ ]:
missing_data.notnull()
Out[ ]:
Car Number
0 True True
1 True True
2 True True
3 True True
4 True True
5 True False
6 True True
7 True False

Instead of having a matrix including boolean values, we can directly compute the total number of missing values for each column in the data as given below.

In [ ]:
missing_data.isnull().sum()
Out[ ]:
Car       0
Number    2
dtype: int64
In [ ]:
##  Extra; Missing Percentage 
new_df = pd.DataFrame(missing_data.isnull().sum())
new_df.columns = ['Missing Number']
new_df['Missing Percentage'] = (new_df['Missing Number']*100)/missing_data.shape[0]
new_df
Out[ ]:
Missing Number Missing Percentage
Car 0 0.0
Number 2 25.0

As seen above, we have two missing values in Number column. In other words, 25% of the observations in Number column is missing.

Apart from the column names, if we would like to compute the total number of the missing observations across the data,

In [ ]:
missing_data.isnull().sum().sum() #total number of missing observations
Out[ ]:
2

Dealing with Missing Data

The missing data problem is one of the hardest problem in data science. There are several approaches to handle with missingness, including both simple and complex cases.

Before remedying your missing problem in the data, you should figure out the source of the missing and pattern of the missing. After that, you should decide your strategies.

In this part, we will have a look at some of basic solutions. Note that better alternatives are available in practice.

The simplest but most unpreferable solution for missing problem is to drop the rows or columns .

You can remove the NA values with dropna

In [ ]:
missing_data.dropna(axis = 0) #axis = 0 removes the rows with na 
#axis = 1 drops the columns with na
Out[ ]:
Car Number
0 Chevrolet Pick-Up 48206.0
1 Ford Pick-Up 47999.0
2 Honda Civic 31673.0
3 Honda Accord 30274.0
4 Toyota Camry 17270.0
6 Nissan Altima 14108.0

The imputing the missing values is better way, sure. You can fill your missing observations via different way. However, we will talk about the simplest way in the context of the course.

fillna() fills the all NA values with a user defined constant value.

In [ ]:
missing_data.fillna(0)
Out[ ]:
Car Number
0 Chevrolet Pick-Up 48206.0
1 Ford Pick-Up 47999.0
2 Honda Civic 31673.0
3 Honda Accord 30274.0
4 Toyota Camry 17270.0
5 GMC Pick-Up 0.0
6 Nissan Altima 14108.0
7 Honda CR-V 0.0

You can use the function to fill the missing values with descriptive statistics.

In [ ]:
mean = missing_data['Number'].mean()
median = missing_data['Number'].median()
mode = missing_data['Number'].mode()
print('Mean:',mean,'Median:',median,'Mode:',mode)
Mean: 31588.333333333332 Median: 30973.5 Mode: 0    14108.0
1    17270.0
2    30274.0
3    31673.0
4    47999.0
5    48206.0
dtype: float64
In [ ]:
missing_data['Number'].fillna(mean) #mean imputation
Out[ ]:
0    48206.000000
1    47999.000000
2    31673.000000
3    30274.000000
4    17270.000000
5    31588.333333
6    14108.000000
7    31588.333333
Name: Number, dtype: float64
In [ ]:
missing_data['Number'].fillna(median) #median imputation
Out[ ]:
0    48206.0
1    47999.0
2    31673.0
3    30274.0
4    17270.0
5    30973.5
6    14108.0
7    30973.5
Name: Number, dtype: float64
In [ ]:
missing_data['Number'].fillna(mode) #mode imputation
Out[ ]:
0    48206.0
1    47999.0
2    31673.0
3    30274.0
4    17270.0
5    48206.0
6    14108.0
7        NaN
Name: Number, dtype: float64

replace replaces the value in the specified unit.

In [ ]:
missing_data.replace(to_replace=np.nan,value=0) #replaces the all missing values with predefine value.
Out[ ]:
Car Number
0 Chevrolet Pick-Up 48206.0
1 Ford Pick-Up 47999.0
2 Honda Civic 31673.0
3 Honda Accord 30274.0
4 Toyota Camry 17270.0
5 GMC Pick-Up 0.0
6 Nissan Altima 14108.0
7 Honda CR-V 0.0

Which one is better?

Althogh it depends on the task, we can still make a generalization about these three imputation methods. If the numerical variables does not have any outlier, mean imputation can be used. If it has outlier observations, we can impute the median value instead of mean. On the other hand, if the missing value occurs in the object type variable. Then, we can use mode imputation. Mode imputation can be an option for a numerical variable when it has an approximately symmetric distribution.

Example 1

Import the tao.csv data.

In [4]:
tao = pd.read_csv('tao.csv')

a) Calculate the total number of missing observations in the data.

In [5]:

Out[5]:
177

b) Which column has the highest number of missing observations in the data?

In [6]:

Out[6]:
Unnamed: 0           0
Year                 0
Latitude             0
Longitude            0
Sea.Surface.Temp     3
Air.Temp            81
Humidity            93
UWind                0
VWind                0
dtype: int64

c) Fill the missing values with the mean of the corresponding columns.

In [ ]:

String Objects in Pandas

String is a data type used to represent characters or texts rather than numbers. Pandas provides an easy way of applying string methods to whole columns which are just pandas series objects. .str accessor is used to implement the string arguments in Pandas.

Let's consider the given sample data frame.

In [ ]:
galatasaray_squad = pd.DataFrame({'Name':["Fernando,Muslera  ", "Léo,Dubois" ,"Patrick van,Aanholt","Mathias,Ross","Lucas,Torreira","Fredrik,Midtsjø",   
"Kerem,Aktürkoğlu","Haris,Seferović ","Dries,Mertens",      
"Yunus,Akgün","Emre,Taşdemir","Bafétimbi,Gomis",    
"Berkan,Kutlu","Victor,nelsson","Milot,Rashica ",     
"Sérgio,Oliveira","Yusuf,Demir","Okan,Kocuk",         
"Emin,bayram","Abdülkerim,bardakcı","Barış Alper,Yılmaz", 
"Baran,Aksaka","Juan,Mata","Hamza,Akman",        
"Kazımcan,Karataş","Metehan,baltacı","Sacha,Boey",         
"Mauro,Icardi  "]})
In [ ]:
galatasaray_squad['Name']
Out[ ]:
0      Fernando,Muslera  
1              Léo,Dubois
2     Patrick van,Aanholt
3            Mathias,Ross
4          Lucas,Torreira
5         Fredrik,Midtsjø
6        Kerem,Aktürkoğlu
7        Haris,Seferović 
8           Dries,Mertens
9             Yunus,Akgün
10          Emre,TaÅŸdemir
11        Bafétimbi,Gomis
12           Berkan,Kutlu
13         Victor,nelsson
14         Milot,Rashica 
15        Sérgio,Oliveira
16            Yusuf,Demir
17             Okan,Kocuk
18            Emin,bayram
19    Abdülkerim,bardakcı
20     Barış Alper,Yılmaz
21           Baran,Aksaka
22              Juan,Mata
23            Hamza,Akman
24       Kazımcan,Karataş
25        Metehan,baltacı
26             Sacha,Boey
27         Mauro,Icardi  
Name: Name, dtype: object

Removing unnecesary white spaces

In [ ]:
galatasaray_squad['Name'] = galatasaray_squad['Name'].str.strip() # .str_strip() removes unccessary white spaces
galatasaray_squad['Name']
Out[ ]:
0        Fernando,Muslera
1              Léo,Dubois
2     Patrick van,Aanholt
3            Mathias,Ross
4          Lucas,Torreira
5         Fredrik,Midtsjø
6        Kerem,Aktürkoğlu
7         Haris,Seferović
8           Dries,Mertens
9             Yunus,Akgün
10          Emre,TaÅŸdemir
11        Bafétimbi,Gomis
12           Berkan,Kutlu
13         Victor,nelsson
14          Milot,Rashica
15        Sérgio,Oliveira
16            Yusuf,Demir
17             Okan,Kocuk
18            Emin,bayram
19    Abdülkerim,bardakcı
20     Barış Alper,Yılmaz
21           Baran,Aksaka
22              Juan,Mata
23            Hamza,Akman
24       Kazımcan,Karataş
25        Metehan,baltacı
26             Sacha,Boey
27           Mauro,Icardi
Name: Name, dtype: object

Changing the character formats

In [ ]:
galatasaray_squad['Name'].str.upper().head(5) #Converts all characters to uppercase.
Out[ ]:
0       FERNANDO,MUSLERA
1             LÉO,DUBOIS
2    PATRICK VAN,AANHOLT
3           MATHIAS,ROSS
4         LUCAS,TORREIRA
Name: Name, dtype: object
In [ ]:
galatasaray_squad['Name'].str.lower().head(5) #Converts all characters to lowercase.
Out[ ]:
0       fernando,muslera
1             léo,dubois
2    patrick van,aanholt
3           mathias,ross
4         lucas,torreira
Name: Name, dtype: object
In [ ]:
galatasaray_squad['Name'].str.title().head(5) #Converts first character of each word to uppercase and remaining to lowercase.
Out[ ]:
0       Fernando,Muslera
1             Léo,Dubois
2    Patrick Van,Aanholt
3           Mathias,Ross
4         Lucas,Torreira
Name: Name, dtype: object
In [ ]:
galatasaray_squad['Name'].str.capitalize().head(5) #Converts first character to uppercase and remaining to lowercase.
Out[ ]:
0       Fernando,muslera
1             Léo,dubois
2    Patrick van,aanholt
3           Mathias,ross
4         Lucas,torreira
Name: Name, dtype: object
In [ ]:
galatasaray_squad['Name'] = galatasaray_squad['Name'].str.title() #fix the case problem.

Splitting your string based on a pattern

In [ ]:
galatasaray_squad['Name'].str.split(',', expand = True)
In [ ]:
galatasaray_squad_list = galatasaray_squad['Name'].str.split(',', expand = True)
galatasaray_squad_list.columns = ['Name','Surname']
galatasaray_squad_list

For more function, you can see here

Exercise 2

Please create the df1 by compling the following code. Please apply the following steps respectively.

  • make the all strings in title format.
  • seperate name and surname and add to the data as new columns
In [7]:
df1 = pd.DataFrame({
    'name': ['alice smith','bob jones','charlie joneson','daisy white'],
    'age': [25,20,30,35]
})

df1
Out[7]:
name age
0 alice smith 25
1 bob jones 20
2 charlie joneson 30
3 daisy white 35
In [9]:

Out[9]:
name age
0 Alice Smith 25
1 Bob Jones 20
2 Charlie Joneson 30
3 Daisy White 35
In [11]:

Out[11]:
name age surname
0 Alice 25 Smith
1 Bob 20 Jones
2 Charlie 30 Joneson
3 Daisy 35 White

Let's get clean the data

image.png

We should note that each data set will require different data cleaning procedures as stated above. The steps we are about to apply cover the most common problems likely to occur. In general, you use a few of them when you clean your data.

Before beginning the cleaning process, it's important to consider what your goals are and what you hope to learn or accomplish by analyzing the data. This will help you determine which data is relevant to your objectives, and which is not. By having a clear understanding of your objectives, you can ensure that the data you are cleaning and analyzing is useful and relevant.

With that in mind, let’s get started.

Step 1: Please read dirty_data.xlsx. Remember that we used a similar data in Recitation 3.

In [12]:
dirtydata = pd.read_excel('dirtydata.xlsx')

Step 2: After reading your data, represent the couple of observations from both top and bottom to be sure that you import the data correctly, and examine the variables. Then, check the variable type.

In [ ]:
dirtydata.head()#first five observations
Out[ ]:
Id GenDer AgE Software_ TIME Unit BuY SPEND EARN Download Dates
0 1 Male 35 - 50 IOS 2.0 hour Yes 13.0 1 1.0 2019-02-26 00:00:00
1 2 Male 25 - 35 IOS 1.0 hour No 16.0 3 NaN 26 Feb 2019
2 3 Male 25 - 35 IOS 4.0 hour YES 15.0 2 NaN 2019-02-26 00:00:00
3 4 Male 18 - 25 IOS 2.0 hour YES 20.0 6 NaN 2019-02-26 00:00:00
4 5 MALE 18 - 25 Ios 1.5 hour No 18.0 7 NaN 2019-02-26 00:00:00
In [ ]:
dirtydata.tail() #last five observations
Out[ ]:
Id GenDer AgE Software_ TIME Unit BuY SPEND EARN Download Dates
96 97 MALE_ 35 - 50 Apple IOS 2.0 hour Yes 18.0 5 NaN 2019-02-26 00:00:00
97 98 F 25 - 35 Apple IOS 2.0 hour No 15.0 8 NaN 2019-02-26 00:00:00
98 99 Female 25 - 35 Apple IOS 1.0 hour Yes 12.0 6 NaN 2019-02-26 00:00:00
99 100 Female 18 - 25 Apple IOS 1.0 hour No 16.0 10 NaN 2019-02-26 00:00:00
100 101 Female 18 - 25 Apple IOS 0.0 hour No 16.0 9999999999999 1.0 26 Feb 2019
In [13]:
dirtydata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         101 non-null    int64  
 1   GenDer     101 non-null    object 
 2   AgE        101 non-null    object 
 3   Software_  101 non-null    object 
 4   TIME       101 non-null    float64
 5   Unit       101 non-null    object 
 6   BuY        101 non-null    object 
 7   SPEND      96 non-null     float64
 8   EARN       101 non-null    int64  
 9   Download   4 non-null      float64
 10  Dates      101 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 8.8+ KB

The data has imported correctly. It is seen that we have some problems for column names. The values under the columns are also problematic in addition to the missingness.

Step 3: Fix the column names to provide an ease for the further steps.

In [ ]:
dirtydata.columns = dirtydata.columns.str.title()
dirtydata.columns
Out[ ]:
Index(['Id', 'Gender', 'Age', 'Software_', 'Time', 'Unit', 'Buy', 'Spend',
       'Earn', 'Download', 'Dates'],
      dtype='object')

After fixing the case problem, remove the quotation mark in Software.

In [ ]:
dirtydata = dirtydata.rename({'Software_': 'Software'},axis = "columns")
dirtydata.columns
Out[ ]:
Index(['Id', 'Gender', 'Age', 'Software', 'Time', 'Unit', 'Buy', 'Spend',
       'Earn', 'Download', 'Dates'],
      dtype='object')

Step 4: Check the duplicates and remove them if it is not necessary.

In [ ]:
dirtydata.duplicated().sum() #.duplicated detects the rows including the same observations
Out[ ]:
1

We have one duplicated observation. Since it is a survey data, each row should be unique. Thus, drop the duplicated row.

In [ ]:
dirtydata = dirtydata.drop_duplicates() #remove the duplicated row.
In [ ]:
dirtydata.duplicated().sum() #no duplicates
Out[ ]:
0

Step 5: Get rid of the irrelevant strings in the values.

In this data, we are working with string objects. The best way to examine the values of strings is to create their frequency tables.

As in many problems, there are several ways to create a frequency table using Pandas. One of them is illustrated below.

In [ ]:
for i in dirtydata.select_dtypes(include = 'object').columns:
  print(dirtydata[i].value_counts())
Male        59
M           14
MALE        12
Female       5
F            4
FEMALE       3
FEMALE..     2
MALE_        2
Name: Gender, dtype: int64
18 - 25    53
25 - 35    33
35 - 50    12
1825        2
18-  25     1
Name: Age, dtype: int64
Android      52
ANDROID      16
Apple        15
Apple IOS     6
 IOS          5
  IOS         2
 Ios          1
iphone        1
ıOS           1
Huawei        1
Sony          1
Name: Software, dtype: int64
hour      91
minute     8
second     2
Name: Unit, dtype: int64
No       54
Yes      39
HAYIR     3
YES       2
 No       2
  No      1
Name: Buy, dtype: int64
2019-02-26 00:00:00    92
26 Feb 2019             9
Name: Dates, dtype: int64

The frequency table shows that Gender, Age, Software and Buy columns have the irrelevant punctuations or strings to be removed or replaced.

Note that you should have at least introductory level regex experience for punctuations.

Gender

In [ ]:
dirtydata['Gender'] = dirtydata['Gender'].str.replace("\..","",regex=True) #remove .. 
dirtydata['Gender'] = dirtydata['Gender'].str.replace("\_","",regex=True)#remove _
dirtydata['Gender'].value_counts()
Out[ ]:
Male      59
MALE      14
M         14
FEMALE     5
Female     5
F          4
Name: Gender, dtype: int64
In [ ]:
dirtydata['Gender'].loc[dirtydata['Gender'] == 'M'] = 'Male' #replace the M's with Male
dirtydata['Gender'].loc[dirtydata['Gender'] == 'F'] = 'Female' #replace the F's with Females
dirtydata['Gender'] = dirtydata['Gender'].str.title() #Make them title 
dirtydata['Gender'].value_counts()
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
Out[ ]:
Male      87
Female    14
Name: Gender, dtype: int64

Age

In [ ]:
dirtydata['Age'].loc[dirtydata['Age']=="1825"] = 18-25
dirtydata['Age'].value_counts()
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
Out[ ]:
18 - 25    53
25 - 35    33
35 - 50    12
1825        2
18-  25     1
Name: Age, dtype: int64
In [ ]:
dirtydata['Age'] = dirtydata['Age'].str.replace("1825","18-25")
dirtydata['Age'].value_counts()
Out[ ]:
18 - 25    53
25 - 35    33
35 - 50    12
18-  25     1
Name: Age, dtype: int64

Software

In [ ]:
dirtydata['Software'] = dirtydata['Software'].str.replace("Apple IOS","Ios")
dirtydata['Software'] = dirtydata['Software'].str.replace("Apple","Ios")
dirtydata['Software'] = dirtydata['Software'].str.replace("iphone","Ios")
dirtydata['Software'].value_counts()
Out[ ]:
Android    52
Ios        22
ANDROID    16
 IOS        5
  IOS       2
 Ios        1
ıOS         1
Huawei      1
Sony        1
Name: Software, dtype: int64

Buy

In [ ]:
dirtydata['Buy'].loc[dirtydata['Buy']=="HAYIR"]  = 'No'
dirtydata['Buy'].value_counts()
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
Out[ ]:
No     60
Yes    39
YES     2
Name: Buy, dtype: int64

Step 6: Remove the white spaces in the string values.

The frequency table shows that Age, Software and Buy have unnecessary white spaces in the values.

In [ ]:
dirtydata['Age'] = dirtydata['Age'].str.strip()
dirtydata['Software'] = dirtydata['Software'].str.strip()
dirtydata['Buy'] = dirtydata['Buy'].str.strip()
print(dirtydata['Age'].value_counts())
print(dirtydata['Software'].value_counts())
print(dirtydata['Buy'].value_counts())
18 - 25    53
25 - 35    33
35 - 50    12
18-  25     1
Name: Age, dtype: int64
Android    52
Ios        23
ANDROID    16
IOS         7
ıOS         1
Huawei      1
Sony        1
Name: Software, dtype: int64
No     60
Yes    39
YES     2
Name: Buy, dtype: int64
In [ ]:
dirtydata['Age'] = dirtydata['Age'].str.replace("18-  25","18 - 25")

Step 7: Make sure that all strings are in the same format (e.g all in title)

The consistency in the format of the string values under gender column is satisfied. The problem occurs in only software and buy columns.

Software

In [ ]:
dirtydata['Software'] = dirtydata['Software'].str.title()
dirtydata['Software'].value_counts()
Out[ ]:
Android    68
Ios        31
Huawei      1
Sony        1
Name: Software, dtype: int64

Buy

In [ ]:
dirtydata['Buy'] = dirtydata['Buy'].str.title()
dirtydata['Buy'].value_counts()
Out[ ]:
No     60
Yes    41
Name: Buy, dtype: int64

Step 8: Examine the descriptive statistics of the numerical variables. Search for any unusual behavior. Search also outliers, if there are outliers, replace them with the median.

In [ ]:
dirtydata.describe().transpose()
Out[ ]:
count mean std min 25% 50% 75% max
Id 101.0 5.099010e+01 2.929590e+01 1.0 26.0 51.0 76.0 1.010000e+02
Time 101.0 1.166089e+02 7.969932e+02 -2.0 1.0 2.0 3.0 7.200000e+03
Spend 96.0 1.493750e+01 3.131546e+00 10.0 12.0 15.0 18.0 2.000000e+01
Earn 101.0 9.901980e+10 9.950362e+11 1.0 3.0 5.0 8.0 1.000000e+13
Download 4.0 1.000000e+00 0.000000e+00 1.0 1.0 1.0 1.0 1.000000e+00

The descriptive statistics show that we have unlogical value in time (-2), and in earn (9999999999999). Let's make them correct.

In [ ]:
dirtydata['Time'].loc[dirtydata['Time']== -2] = 2
dirtydata['Earn'].loc[dirtydata['Earn']==dirtydata['Earn'].values.max()] = dirtydata['Earn'].values.median()
dirtydata['Earn'].loc[dirtydata['Earn']==dirtydata['Earn'].values.max()] = dirtydata['Earn'].values.median()
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
In [ ]:
dirtydata.describe().transpose()
Out[ ]:
count mean std min 25% 50% 75% max
Id 101.0 50.990099 29.295902 1.0 26.0 51.0 76.0 101.0
Time 101.0 116.727723 796.975654 0.0 1.0 2.0 3.0 7200.0
Spend 96.0 14.937500 3.131546 10.0 12.0 15.0 18.0 20.0
Earn 101.0 4.435644 2.826361 0.0 2.0 4.0 7.0 9.0
Download 4.0 1.000000 0.000000 1.0 1.0 1.0 1.0 1.0

Step 9: Search for uniformity. That is the measured variables must be in the same unit.

In this data, we have time and unit variable. Note that we have minute and second in unit colums where the majority were dominated by hour. So, let's convert them into hour and change the unit name.

In [ ]:
dirtydata['Time'].loc[dirtydata['Unit'] =="minute"] = dirtydata['Time'].loc[dirtydata['Unit'] =="minute"]/60
dirtydata['Time'].loc[dirtydata['Unit'] =="second"] = dirtydata['Time'].loc[dirtydata['Unit'] =="second"]/3600
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
In [ ]:
dirtydata['Unit'] = dirtydata['Unit'].str.replace('minute','Hour')
dirtydata['Unit'] = dirtydata['Unit'].str.replace('second','Hour')
dirtydata['Unit'] = dirtydata['Unit'].str.title()
dirtydata['Unit'].value_counts()
Out[ ]:
Hour    101
Name: Unit, dtype: int64

Step 10: Search for the missing values. Either impute or drop the observations based on the missing percentage.

Let's create a missing table for the data.

In [ ]:
new_df = pd.DataFrame(dirtydata.isnull().sum())
new_df.columns = ['Missing Number']
new_df['Missing Percentage'] = (new_df['Missing Number']*100)/dirtydata.shape[0]
new_df
Out[ ]:
Missing Number Missing Percentage
Id 0 0.000000
Gender 0 0.000000
Age 2 1.980198
Software 0 0.000000
Time 0 0.000000
Unit 0 0.000000
Buy 0 0.000000
Spend 5 4.950495
Earn 0 0.000000
Download 97 96.039604
Dates 0 0.000000

It is seen that we have 2 missings in one object variable; age, 5 missing variables in spend. On the other hand, 96% of the observations in Download are missing.

  • Mode imputation for age since it is a object variable.

  • Mean imputation for spend since the data does not have outlier.

  • Drop the download column

Age

In [ ]:
mode_for_age = dirtydata['Age'].value_counts().idxmax()#gives you the level with highest frequency 
dirtydata['Age'] = dirtydata['Age'].fillna(mode_for_age)
Out[ ]:
18 - 25    55
25 - 35    33
35 - 50    12
18-25       1
Name: Age, dtype: int64

Spend

In [ ]:
mean_for_spend = dirtydata['Spend'].values.mean()
dirtydata['Spend'] = dirtydata['Spend'].fillna(mean_for_spend)

Download

In [ ]:
dirtydata = dirtydata.drop('Download',axis = 1)
dirtydata.head()
Out[ ]:
Id Gender Age Software Time Unit Buy Spend Earn Dates
0 1 Male 35 - 50 Ios 2.0 Hour Yes 13.0 1 2019-02-26 00:00:00
1 2 Male 25 - 35 Ios 1.0 Hour No 16.0 3 26 Feb 2019
2 3 Male 25 - 35 Ios 4.0 Hour Yes 15.0 2 2019-02-26 00:00:00
3 4 Male 18 - 25 Ios 2.0 Hour Yes 20.0 6 2019-02-26 00:00:00
4 5 Male 18 - 25 Ios 1.5 Hour No 18.0 7 2019-02-26 00:00:00

Step 11: Examine the date type object and be sure that all observations in the same format.

Although we have a date column, it is not a date type data. It is an object data.

In [ ]:
dirtydata['Dates'].loc[dirtydata['Dates'] != dirtydata['Dates'].value_counts().idxmax()] = dirtydata['Dates'].value_counts().idxmax()
dirtydata['Dates'].value_counts()
Out[ ]:
2019-02-26    101
Name: Dates, dtype: int64
In [ ]:
dirtydata['Dates'] = pd.to_datetime(dirtydata['Dates']) #change the object time
In [ ]:
dirtydata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Id        101 non-null    int64         
 1   Gender    101 non-null    object        
 2   Age       99 non-null     object        
 3   Software  101 non-null    object        
 4   Time      101 non-null    float64       
 5   Unit      101 non-null    object        
 6   Buy       101 non-null    object        
 7   Spend     96 non-null     float64       
 8   Earn      101 non-null    int64         
 9   Dates     101 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(2), object(5)
memory usage: 8.0+ KB
In [ ]:
dirtydata.head()
Out[ ]:
Id Gender Age Software Time Unit Buy Spend Earn Dates
0 1 Male 35 - 50 Ios 2.0 Hour Yes 13.0 1 2019-02-26
1 2 Male 25 - 35 Ios 1.0 Hour No 16.0 3 2019-02-26
2 3 Male 25 - 35 Ios 4.0 Hour Yes 15.0 2 2019-02-26
3 4 Male 18 - 25 Ios 2.0 Hour Yes 20.0 6 2019-02-26
4 5 Male 18 - 25 Ios 1.5 Hour No 18.0 7 2019-02-26

We cleaned the data!!!

image.png

Now, we will export the data for the future use.

In [ ]:
dirtydata.to_csv("cleandata.csv") #export the clean data

Example 3

Import the ex3_data.xlsx data. Detect the problems and clean the data using pandas functions.

In [14]:
ex3 = pd.read_excel('ex3_data.xlsx')
ex3.head()
Out[14]:
Time TRICK gender AGE unit country Sta*te Grand Income SpEND Status
0 10.24.2016 No male 22 year Canada Ontario JOY 1502.0 1019.0 Rich
1 10.24.2016 No Male 45 year usa il MEH 1948.0 567.0 Rich
2 10.24.2016 No Female 48 year US Colorado JOY 2405.0 813.0 NaN
3 10.24.2016 No Male 57 year usa il JOY 2372.0 1053.0 NaN
4 10.24.2016 Yes Male 42 year USA ontario MEH 2518.0 235.0 NaN