Ozancan Ozdemir - ozancan@metu.edu.tr
The quality of the data is one of the most important conditions to produce a good results in data science, data analytics or data visualization. It does not matter either you have well-working algorithms or great tools, the quality of your output depends on your data input. That’s why it is important to be sure that the data under the study is properly refined.
Data cleaning is a foundational process in the data science lifecycle and its role cannot be overemphasized when trying to uncover insights and generate reliable answers. More often than not, data will always be dirty in the real world, and data cleaning cannot be completely avoided such that it is estimated that it takes 80% of the data analysis process.
This can involve finding and removing duplicates and incomplete records, and modifying data to rectify inaccurate records. Unclean or dirty data has always been a problem, yet we have seen an exponential rise in data generation over the last decade.
There is no absolute way to dictate the exact steps of the data cleansing process, as the process is different for each dataset. However, it’s important to create a template for your data cleansing process and make sure you’re doing it right every time.
How do you clean data?
Step 1: Remove duplicate or irrelevant observations
Step 2: Fix structural errors
Step 3: Filter unwanted outliers
Step 4: Handle missing data
Components of quality data
Validity. The degree to which your data conforms to defined business rules or constraints.
Accuracy. Ensure your data is close to the true values.
Completeness. The degree to which all required data is known.
Consistency. Ensure your data is consistent within the same dataset and/or across multiple data sets.
Uniformity. The degree to which the data is specified using the same unit of measure.
Before data cleaning can properly be done, it is important to understand how it got dirty in the first place. Then, you determine your actions for cleaning.
Pandas and numpy includes functions being useful for data cleaning.
Prior to cleaning procedure, let's have a look at some concepts necessary for data cleaning; Missing data and string operators.
#import the packages
import pandas as pd
import numpy as np
Missing data occurs when no information is provided. Pandas provides some tools that makes dealing with missing easier.
In pandas, the missing observations have two representations.
None: None is a Python singleton object that is often used for missing data in Python code.
NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation
We can use following functions for missing analysis. i.e We can detect and replace the missing values with these functions
df.isnull()
df.notnull()
df.dropna()
df.fillna()
df.replace()
Consider the example,
missing_data = pd.read_csv('https://metustat112.github.io/stolen_cars.csv',sep = ";")
print(missing_data)
isnull
and notnull
works for the same purpose, but returns reverse outputs. While isnull
returns True
for missing values, notnull
returns True
for available values.
missing_data.isnull()
missing_data.notnull()
Instead of having a matrix including boolean values, we can directly compute the total number of missing values for each column in the data as given below.
missing_data.isnull().sum()
## Extra; Missing Percentage
new_df = pd.DataFrame(missing_data.isnull().sum())
new_df.columns = ['Missing Number']
new_df['Missing Percentage'] = (new_df['Missing Number']*100)/missing_data.shape[0]
new_df
As seen above, we have two missing values in Number column. In other words, 25% of the observations in Number column is missing.
Apart from the column names, if we would like to compute the total number of the missing observations across the data,
missing_data.isnull().sum().sum() #total number of missing observations
Dealing with Missing Data
The missing data problem is one of the hardest problem in data science. There are several approaches to handle with missingness, including both simple and complex cases.
Before remedying your missing problem in the data, you should figure out the source of the missing and pattern of the missing. After that, you should decide your strategies.
In this part, we will have a look at some of basic solutions. Note that better alternatives are available in practice.
The simplest but most unpreferable solution for missing problem is to drop the rows or columns .
You can remove the NA values with dropna
missing_data.dropna(axis = 0) #axis = 0 removes the rows with na
#axis = 1 drops the columns with na
The imputing the missing values is better way, sure. You can fill your missing observations via different way. However, we will talk about the simplest way in the context of the course.
fillna()
fills the all NA values with a user defined constant value.
missing_data.fillna(0)
You can use the function to fill the missing values with descriptive statistics.
mean = missing_data['Number'].mean()
median = missing_data['Number'].median()
mode = missing_data['Number'].mode()
print('Mean:',mean,'Median:',median,'Mode:',mode)
missing_data['Number'].fillna(mean) #mean imputation
missing_data['Number'].fillna(median) #median imputation
missing_data['Number'].fillna(mode) #mode imputation
replace
replaces the value in the specified unit.
missing_data.replace(to_replace=np.nan,value=0) #replaces the all missing values with predefine value.
Which one is better?
Althogh it depends on the task, we can still make a generalization about these three imputation methods. If the numerical variables does not have any outlier, mean imputation can be used. If it has outlier observations, we can impute the median value instead of mean. On the other hand, if the missing value occurs in the object type variable. Then, we can use mode imputation. Mode imputation can be an option for a numerical variable when it has an approximately symmetric distribution.
Import the tao.csv
data.
tao = pd.read_csv('tao.csv')
a) Calculate the total number of missing observations in the data.
b) Which column has the highest number of missing observations in the data?
c) Fill the missing values with the mean of the corresponding columns.
String is a data type used to represent characters or texts rather than numbers. Pandas provides an easy way of applying string methods to whole columns which are just pandas series objects. .str
accessor is used to implement the string arguments in Pandas.
Let's consider the given sample data frame.
galatasaray_squad = pd.DataFrame({'Name':["Fernando,Muslera ", "Léo,Dubois" ,"Patrick van,Aanholt","Mathias,Ross","Lucas,Torreira","Fredrik,Midtsjø",
"Kerem,Aktürkoğlu","Haris,Seferović ","Dries,Mertens",
"Yunus,Akgün","Emre,Taşdemir","Bafétimbi,Gomis",
"Berkan,Kutlu","Victor,nelsson","Milot,Rashica ",
"Sérgio,Oliveira","Yusuf,Demir","Okan,Kocuk",
"Emin,bayram","Abdülkerim,bardakcı","Barış Alper,Yılmaz",
"Baran,Aksaka","Juan,Mata","Hamza,Akman",
"Kazımcan,Karataş","Metehan,baltacı","Sacha,Boey",
"Mauro,Icardi "]})
galatasaray_squad['Name']
Removing unnecesary white spaces
galatasaray_squad['Name'] = galatasaray_squad['Name'].str.strip() # .str_strip() removes unccessary white spaces
galatasaray_squad['Name']
Changing the character formats
galatasaray_squad['Name'].str.upper().head(5) #Converts all characters to uppercase.
galatasaray_squad['Name'].str.lower().head(5) #Converts all characters to lowercase.
galatasaray_squad['Name'].str.title().head(5) #Converts first character of each word to uppercase and remaining to lowercase.
galatasaray_squad['Name'].str.capitalize().head(5) #Converts first character to uppercase and remaining to lowercase.
galatasaray_squad['Name'] = galatasaray_squad['Name'].str.title() #fix the case problem.
Splitting your string based on a pattern
galatasaray_squad['Name'].str.split(',', expand = True)
galatasaray_squad_list = galatasaray_squad['Name'].str.split(',', expand = True)
galatasaray_squad_list.columns = ['Name','Surname']
galatasaray_squad_list
For more function, you can see here
Please create the df1 by compling the following code. Please apply the following steps respectively.
df1 = pd.DataFrame({
'name': ['alice smith','bob jones','charlie joneson','daisy white'],
'age': [25,20,30,35]
})
df1
We should note that each data set will require different data cleaning procedures as stated above. The steps we are about to apply cover the most common problems likely to occur. In general, you use a few of them when you clean your data.
Before beginning the cleaning process, it's important to consider what your goals are and what you hope to learn or accomplish by analyzing the data. This will help you determine which data is relevant to your objectives, and which is not. By having a clear understanding of your objectives, you can ensure that the data you are cleaning and analyzing is useful and relevant.
With that in mind, let’s get started.
Step 1: Please read dirty_data.xlsx
. Remember that we used a similar data in Recitation 3.
dirtydata = pd.read_excel('dirtydata.xlsx')
Step 2: After reading your data, represent the couple of observations from both top and bottom to be sure that you import the data correctly, and examine the variables. Then, check the variable type.
dirtydata.head()#first five observations
dirtydata.tail() #last five observations
dirtydata.info()
The data has imported correctly. It is seen that we have some problems for column names. The values under the columns are also problematic in addition to the missingness.
Step 3: Fix the column names to provide an ease for the further steps.
dirtydata.columns = dirtydata.columns.str.title()
dirtydata.columns
After fixing the case problem, remove the quotation mark in Software.
dirtydata = dirtydata.rename({'Software_': 'Software'},axis = "columns")
dirtydata.columns
Step 4: Check the duplicates and remove them if it is not necessary.
dirtydata.duplicated().sum() #.duplicated detects the rows including the same observations
We have one duplicated observation. Since it is a survey data, each row should be unique. Thus, drop the duplicated row.
dirtydata = dirtydata.drop_duplicates() #remove the duplicated row.
dirtydata.duplicated().sum() #no duplicates
Step 5: Get rid of the irrelevant strings in the values.
In this data, we are working with string objects. The best way to examine the values of strings is to create their frequency tables.
As in many problems, there are several ways to create a frequency table using Pandas. One of them is illustrated below.
for i in dirtydata.select_dtypes(include = 'object').columns:
print(dirtydata[i].value_counts())
The frequency table shows that Gender, Age, Software and Buy columns have the irrelevant punctuations or strings to be removed or replaced.
Note that you should have at least introductory level regex experience for punctuations.
Gender
dirtydata['Gender'] = dirtydata['Gender'].str.replace("\..","",regex=True) #remove ..
dirtydata['Gender'] = dirtydata['Gender'].str.replace("\_","",regex=True)#remove _
dirtydata['Gender'].value_counts()
dirtydata['Gender'].loc[dirtydata['Gender'] == 'M'] = 'Male' #replace the M's with Male
dirtydata['Gender'].loc[dirtydata['Gender'] == 'F'] = 'Female' #replace the F's with Females
dirtydata['Gender'] = dirtydata['Gender'].str.title() #Make them title
dirtydata['Gender'].value_counts()
Age
dirtydata['Age'].loc[dirtydata['Age']=="1825"] = 18-25
dirtydata['Age'].value_counts()
dirtydata['Age'] = dirtydata['Age'].str.replace("1825","18-25")
dirtydata['Age'].value_counts()
Software
dirtydata['Software'] = dirtydata['Software'].str.replace("Apple IOS","Ios")
dirtydata['Software'] = dirtydata['Software'].str.replace("Apple","Ios")
dirtydata['Software'] = dirtydata['Software'].str.replace("iphone","Ios")
dirtydata['Software'].value_counts()
Buy
dirtydata['Buy'].loc[dirtydata['Buy']=="HAYIR"] = 'No'
dirtydata['Buy'].value_counts()
Step 6: Remove the white spaces in the string values.
The frequency table shows that Age, Software and Buy have unnecessary white spaces in the values.
dirtydata['Age'] = dirtydata['Age'].str.strip()
dirtydata['Software'] = dirtydata['Software'].str.strip()
dirtydata['Buy'] = dirtydata['Buy'].str.strip()
print(dirtydata['Age'].value_counts())
print(dirtydata['Software'].value_counts())
print(dirtydata['Buy'].value_counts())
dirtydata['Age'] = dirtydata['Age'].str.replace("18- 25","18 - 25")
Step 7: Make sure that all strings are in the same format (e.g all in title)
The consistency in the format of the string values under gender column is satisfied. The problem occurs in only software and buy columns.
Software
dirtydata['Software'] = dirtydata['Software'].str.title()
dirtydata['Software'].value_counts()
Buy
dirtydata['Buy'] = dirtydata['Buy'].str.title()
dirtydata['Buy'].value_counts()
Step 8: Examine the descriptive statistics of the numerical variables. Search for any unusual behavior. Search also outliers, if there are outliers, replace them with the median.
dirtydata.describe().transpose()
The descriptive statistics show that we have unlogical value in time (-2), and in earn (9999999999999). Let's make them correct.
dirtydata['Time'].loc[dirtydata['Time']== -2] = 2
dirtydata['Earn'].loc[dirtydata['Earn']==dirtydata['Earn'].values.max()] = dirtydata['Earn'].values.median()
dirtydata['Earn'].loc[dirtydata['Earn']==dirtydata['Earn'].values.max()] = dirtydata['Earn'].values.median()
dirtydata.describe().transpose()
Step 9: Search for uniformity. That is the measured variables must be in the same unit.
In this data, we have time and unit variable. Note that we have minute and second in unit colums where the majority were dominated by hour. So, let's convert them into hour and change the unit name.
dirtydata['Time'].loc[dirtydata['Unit'] =="minute"] = dirtydata['Time'].loc[dirtydata['Unit'] =="minute"]/60
dirtydata['Time'].loc[dirtydata['Unit'] =="second"] = dirtydata['Time'].loc[dirtydata['Unit'] =="second"]/3600
dirtydata['Unit'] = dirtydata['Unit'].str.replace('minute','Hour')
dirtydata['Unit'] = dirtydata['Unit'].str.replace('second','Hour')
dirtydata['Unit'] = dirtydata['Unit'].str.title()
dirtydata['Unit'].value_counts()
Step 10: Search for the missing values. Either impute or drop the observations based on the missing percentage.
Let's create a missing table for the data.
new_df = pd.DataFrame(dirtydata.isnull().sum())
new_df.columns = ['Missing Number']
new_df['Missing Percentage'] = (new_df['Missing Number']*100)/dirtydata.shape[0]
new_df
It is seen that we have 2 missings in one object variable; age, 5 missing variables in spend. On the other hand, 96% of the observations in Download are missing.
Mode imputation for age since it is a object variable.
Mean imputation for spend since the data does not have outlier.
Drop the download column
Age
mode_for_age = dirtydata['Age'].value_counts().idxmax()#gives you the level with highest frequency
dirtydata['Age'] = dirtydata['Age'].fillna(mode_for_age)
Spend
mean_for_spend = dirtydata['Spend'].values.mean()
dirtydata['Spend'] = dirtydata['Spend'].fillna(mean_for_spend)
Download
dirtydata = dirtydata.drop('Download',axis = 1)
dirtydata.head()
Step 11: Examine the date type object and be sure that all observations in the same format.
Although we have a date column, it is not a date type data. It is an object data.
dirtydata['Dates'].loc[dirtydata['Dates'] != dirtydata['Dates'].value_counts().idxmax()] = dirtydata['Dates'].value_counts().idxmax()
dirtydata['Dates'].value_counts()
dirtydata['Dates'] = pd.to_datetime(dirtydata['Dates']) #change the object time
dirtydata.info()
dirtydata.head()
We cleaned the data!!!
Now, we will export the data for the future use.
dirtydata.to_csv("cleandata.csv") #export the clean data
Import the ex3_data.xlsx
data. Detect the problems and clean the data using pandas functions.
ex3 = pd.read_excel('ex3_data.xlsx')
ex3.head()