Stat 112 - Recitation 9¶

Ozancan Ozdemir - ozancan@metu.edu.tr

Data Cleaning with Pandas and Numpy¶

The quality of the data is one of the most important conditions to produce a good results in data science, data analytics or data visualization. It does not matter either you have well-working algorithms or great tools, the quality of your output depends on your data input. That’s why it is important to be sure that the data under the study is properly refined.

Data cleaning is a foundational process in the data science lifecycle and its role cannot be overemphasized when trying to uncover insights and generate reliable answers. More often than not, data will always be dirty in the real world, and data cleaning cannot be completely avoided such that it is estimated that it takes 80% of the data analysis process.

This can involve finding and removing duplicates and incomplete records, and modifying data to rectify inaccurate records. Unclean or dirty data has always been a problem, yet we have seen an exponential rise in data generation over the last decade.

There is no absolute way to dictate the exact steps of the data cleansing process, as the process is different for each dataset. However, it’s important to create a template for your data cleansing process and make sure you’re doing it right every time.

How do you clean data?

Step 1: Remove duplicate or irrelevant observations

Step 2: Fix structural errors

Step 3: Filter unwanted outliers

Step 4: Handle missing data

Components of quality data

Validity. The degree to which your data conforms to defined business rules or constraints.
Accuracy. Ensure your data is close to the true values.
Completeness. The degree to which all required data is known.
Consistency. Ensure your data is consistent within the same dataset and/or across multiple data sets.
Uniformity. The degree to which the data is specified using the same unit of measure.

Before data cleaning can properly be done, it is important to understand how it got dirty in the first place. Then, you determine your actions for cleaning.

Pandas and numpy includes functions being useful for data cleaning.

Prior to cleaning procedure, let's have a look at some concepts necessary for data cleaning; Missing data and string operators.

#import the packages 
import pandas as pd
import numpy as np

Missing Data¶

Missing data occurs when no information is provided. Pandas provides some tools that makes dealing with missing easier.

In pandas, the missing observations have two representations.

None: None is a Python singleton object that is often used for missing data in Python code.
NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

We can use following functions for missing analysis. i.e We can detect and replace the missing values with these functions

df.isnull()

df.notnull()

df.dropna()

df.fillna()

df.replace()

Consider the example,

missing_data = pd.read_csv('https://metustat112.github.io/stolen_cars.csv',sep = ";")
print(missing_data)

                 Car   Number
0  Chevrolet Pick-Up  48206.0
1       Ford Pick-Up  47999.0
2        Honda Civic  31673.0
3       Honda Accord  30274.0
4       Toyota Camry  17270.0
5        GMC Pick-Up      NaN
6      Nissan Altima  14108.0
7         Honda CR-V      NaN

isnull and notnull works for the same purpose, but returns reverse outputs. While isnull returns True for missing values, notnull returns True for available values.

missing_data.isnull()

missing_data.notnull()

Instead of having a matrix including boolean values, we can directly compute the total number of missing values for each column in the data as given below.

missing_data.isnull().sum()

Car       0
Number    2
dtype: int64

##  Extra; Missing Percentage 
new_df = pd.DataFrame(missing_data.isnull().sum())
new_df.columns = ['Missing Number']
new_df['Missing Percentage'] = (new_df['Missing Number']*100)/missing_data.shape[0]
new_df

As seen above, we have two missing values in Number column. In other words, 25% of the observations in Number column is missing.

Apart from the column names, if we would like to compute the total number of the missing observations across the data,

missing_data.isnull().sum().sum() #total number of missing observations

2

Dealing with Missing Data

The missing data problem is one of the hardest problem in data science. There are several approaches to handle with missingness, including both simple and complex cases.

Before remedying your missing problem in the data, you should figure out the source of the missing and pattern of the missing. After that, you should decide your strategies.

In this part, we will have a look at some of basic solutions. Note that better alternatives are available in practice.

The simplest but most unpreferable solution for missing problem is to drop the rows or columns .

You can remove the NA values with dropna

missing_data.dropna(axis = 0) #axis = 0 removes the rows with na 
#axis = 1 drops the columns with na

The imputing the missing values is better way, sure. You can fill your missing observations via different way. However, we will talk about the simplest way in the context of the course.

fillna() fills the all NA values with a user defined constant value.

missing_data.fillna(0)

You can use the function to fill the missing values with descriptive statistics.

mean = missing_data['Number'].mean()
median = missing_data['Number'].median()
mode = missing_data['Number'].mode()
print('Mean:',mean,'Median:',median,'Mode:',mode)

Mean: 31588.333333333332 Median: 30973.5 Mode: 0    14108.0
1    17270.0
2    30274.0
3    31673.0
4    47999.0
5    48206.0
dtype: float64

missing_data['Number'].fillna(mean) #mean imputation

0    48206.000000
1    47999.000000
2    31673.000000
3    30274.000000
4    17270.000000
5    31588.333333
6    14108.000000
7    31588.333333
Name: Number, dtype: float64

missing_data['Number'].fillna(median) #median imputation

0    48206.0
1    47999.0
2    31673.0
3    30274.0
4    17270.0
5    30973.5
6    14108.0
7    30973.5
Name: Number, dtype: float64

missing_data['Number'].fillna(mode) #mode imputation

0    48206.0
1    47999.0
2    31673.0
3    30274.0
4    17270.0
5    48206.0
6    14108.0
7        NaN
Name: Number, dtype: float64

replace replaces the value in the specified unit.

missing_data.replace(to_replace=np.nan,value=0) #replaces the all missing values with predefine value.

Which one is better?

Althogh it depends on the task, we can still make a generalization about these three imputation methods. If the numerical variables does not have any outlier, mean imputation can be used. If it has outlier observations, we can impute the median value instead of mean. On the other hand, if the missing value occurs in the object type variable. Then, we can use mode imputation. Mode imputation can be an option for a numerical variable when it has an approximately symmetric distribution.

Example 1¶

Import the tao.csv data.

tao = pd.read_csv('tao.csv')

a) Calculate the total number of missing observations in the data.

177

b) Which column has the highest number of missing observations in the data?

Unnamed: 0           0
Year                 0
Latitude             0
Longitude            0
Sea.Surface.Temp     3
Air.Temp            81
Humidity            93
UWind                0
VWind                0
dtype: int64

c) Fill the missing values with the mean of the corresponding columns.

String Objects in Pandas¶

String is a data type used to represent characters or texts rather than numbers. Pandas provides an easy way of applying string methods to whole columns which are just pandas series objects. .str accessor is used to implement the string arguments in Pandas.

Let's consider the given sample data frame.

galatasaray_squad = pd.DataFrame({'Name':["Fernando,Muslera  ", "Léo,Dubois" ,"Patrick van,Aanholt","Mathias,Ross","Lucas,Torreira","Fredrik,Midtsjø",   
"Kerem,Aktürkoğlu","Haris,Seferović ","Dries,Mertens",      
"Yunus,Akgün","Emre,Taşdemir","Bafétimbi,Gomis",    
"Berkan,Kutlu","Victor,nelsson","Milot,Rashica ",     
"Sérgio,Oliveira","Yusuf,Demir","Okan,Kocuk",         
"Emin,bayram","Abdülkerim,bardakcı","Barış Alper,Yılmaz", 
"Baran,Aksaka","Juan,Mata","Hamza,Akman",        
"Kazımcan,Karataş","Metehan,baltacı","Sacha,Boey",         
"Mauro,Icardi  "]})

galatasaray_squad['Name']

0      Fernando,Muslera  
1              Léo,Dubois
2     Patrick van,Aanholt
3            Mathias,Ross
4          Lucas,Torreira
5         Fredrik,Midtsjø
6        Kerem,Aktürkoğlu
7        Haris,Seferović 
8           Dries,Mertens
9             Yunus,Akgün
10          Emre,Taşdemir
11        Bafétimbi,Gomis
12           Berkan,Kutlu
13         Victor,nelsson
14         Milot,Rashica 
15        Sérgio,Oliveira
16            Yusuf,Demir
17             Okan,Kocuk
18            Emin,bayram
19    Abdülkerim,bardakcı
20     Barış Alper,Yılmaz
21           Baran,Aksaka
22              Juan,Mata
23            Hamza,Akman
24       Kazımcan,Karataş
25        Metehan,baltacı
26             Sacha,Boey
27         Mauro,Icardi  
Name: Name, dtype: object

Removing unnecesary white spaces

galatasaray_squad['Name'] = galatasaray_squad['Name'].str.strip() # .str_strip() removes unccessary white spaces
galatasaray_squad['Name']

0        Fernando,Muslera
1              Léo,Dubois
2     Patrick van,Aanholt
3            Mathias,Ross
4          Lucas,Torreira
5         Fredrik,Midtsjø
6        Kerem,Aktürkoğlu
7         Haris,Seferović
8           Dries,Mertens
9             Yunus,Akgün
10          Emre,Taşdemir
11        Bafétimbi,Gomis
12           Berkan,Kutlu
13         Victor,nelsson
14          Milot,Rashica
15        Sérgio,Oliveira
16            Yusuf,Demir
17             Okan,Kocuk
18            Emin,bayram
19    Abdülkerim,bardakcı
20     Barış Alper,Yılmaz
21           Baran,Aksaka
22              Juan,Mata
23            Hamza,Akman
24       Kazımcan,Karataş
25        Metehan,baltacı
26             Sacha,Boey
27           Mauro,Icardi
Name: Name, dtype: object

Changing the character formats

galatasaray_squad['Name'].str.upper().head(5) #Converts all characters to uppercase.

0       FERNANDO,MUSLERA
1             LÉO,DUBOIS
2    PATRICK VAN,AANHOLT
3           MATHIAS,ROSS
4         LUCAS,TORREIRA
Name: Name, dtype: object

galatasaray_squad['Name'].str.lower().head(5) #Converts all characters to lowercase.

0       fernando,muslera
1             léo,dubois
2    patrick van,aanholt
3           mathias,ross
4         lucas,torreira
Name: Name, dtype: object

galatasaray_squad['Name'].str.title().head(5) #Converts first character of each word to uppercase and remaining to lowercase.

0       Fernando,Muslera
1             Léo,Dubois
2    Patrick Van,Aanholt
3           Mathias,Ross
4         Lucas,Torreira
Name: Name, dtype: object

galatasaray_squad['Name'].str.capitalize().head(5) #Converts first character to uppercase and remaining to lowercase.

0       Fernando,muslera
1             Léo,dubois
2    Patrick van,aanholt
3           Mathias,ross
4         Lucas,torreira
Name: Name, dtype: object

galatasaray_squad['Name'] = galatasaray_squad['Name'].str.title() #fix the case problem.

Splitting your string based on a pattern

galatasaray_squad['Name'].str.split(',', expand = True)

galatasaray_squad_list = galatasaray_squad['Name'].str.split(',', expand = True)
galatasaray_squad_list.columns = ['Name','Surname']
galatasaray_squad_list

For more function, you can see here

Exercise 2¶

Please create the df1 by compling the following code. Please apply the following steps respectively.

make the all strings in title format.
seperate name and surname and add to the data as new columns

df1 = pd.DataFrame({
    'name': ['alice smith','bob jones','charlie joneson','daisy white'],
    'age': [25,20,30,35]
})

df1

Let's get clean the data¶

We should note that each data set will require different data cleaning procedures as stated above. The steps we are about to apply cover the most common problems likely to occur. In general, you use a few of them when you clean your data.

Before beginning the cleaning process, it's important to consider what your goals are and what you hope to learn or accomplish by analyzing the data. This will help you determine which data is relevant to your objectives, and which is not. By having a clear understanding of your objectives, you can ensure that the data you are cleaning and analyzing is useful and relevant.

With that in mind, let’s get started.

Step 1: Please read dirty_data.xlsx. Remember that we used a similar data in Recitation 3.

dirtydata = pd.read_excel('dirtydata.xlsx')

Step 2: After reading your data, represent the couple of observations from both top and bottom to be sure that you import the data correctly, and examine the variables. Then, check the variable type.

dirtydata.head()#first five observations

dirtydata.tail() #last five observations

dirtydata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         101 non-null    int64  
 1   GenDer     101 non-null    object 
 2   AgE        101 non-null    object 
 3   Software_  101 non-null    object 
 4   TIME       101 non-null    float64
 5   Unit       101 non-null    object 
 6   BuY        101 non-null    object 
 7   SPEND      96 non-null     float64
 8   EARN       101 non-null    int64  
 9   Download   4 non-null      float64
 10  Dates      101 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 8.8+ KB

The data has imported correctly. It is seen that we have some problems for column names. The values under the columns are also problematic in addition to the missingness.

Step 3: Fix the column names to provide an ease for the further steps.

dirtydata.columns = dirtydata.columns.str.title()
dirtydata.columns

Index(['Id', 'Gender', 'Age', 'Software_', 'Time', 'Unit', 'Buy', 'Spend',
       'Earn', 'Download', 'Dates'],
      dtype='object')

After fixing the case problem, remove the quotation mark in Software.

dirtydata = dirtydata.rename({'Software_': 'Software'},axis = "columns")
dirtydata.columns

Index(['Id', 'Gender', 'Age', 'Software', 'Time', 'Unit', 'Buy', 'Spend',
       'Earn', 'Download', 'Dates'],
      dtype='object')

Step 4: Check the duplicates and remove them if it is not necessary.

dirtydata.duplicated().sum() #.duplicated detects the rows including the same observations

1

We have one duplicated observation. Since it is a survey data, each row should be unique. Thus, drop the duplicated row.

dirtydata = dirtydata.drop_duplicates() #remove the duplicated row.

dirtydata.duplicated().sum() #no duplicates

0

Step 5: Get rid of the irrelevant strings in the values.

In this data, we are working with string objects. The best way to examine the values of strings is to create their frequency tables.

As in many problems, there are several ways to create a frequency table using Pandas. One of them is illustrated below.

for i in dirtydata.select_dtypes(include = 'object').columns:
  print(dirtydata[i].value_counts())

Male        59
M           14
MALE        12
Female       5
F            4
FEMALE       3
FEMALE..     2
MALE_        2
Name: Gender, dtype: int64
18 - 25    53
25 - 35    33
35 - 50    12
1825        2
18-  25     1
Name: Age, dtype: int64
Android      52
ANDROID      16
Apple        15
Apple IOS     6
 IOS          5
  IOS         2
 Ios          1
iphone        1
ıOS           1
Huawei        1
Sony          1
Name: Software, dtype: int64
hour      91
minute     8
second     2
Name: Unit, dtype: int64
No       54
Yes      39
HAYIR     3
YES       2
 No       2
  No      1
Name: Buy, dtype: int64
2019-02-26 00:00:00    92
26 Feb 2019             9
Name: Dates, dtype: int64

The frequency table shows that Gender, Age, Software and Buy columns have the irrelevant punctuations or strings to be removed or replaced.

Note that you should have at least introductory level regex experience for punctuations.

Gender

dirtydata['Gender'] = dirtydata['Gender'].str.replace("\..","",regex=True) #remove .. 
dirtydata['Gender'] = dirtydata['Gender'].str.replace("\_","",regex=True)#remove _
dirtydata['Gender'].value_counts()

Male      59
MALE      14
M         14
FEMALE     5
Female     5
F          4
Name: Gender, dtype: int64

dirtydata['Gender'].loc[dirtydata['Gender'] == 'M'] = 'Male' #replace the M's with Male
dirtydata['Gender'].loc[dirtydata['Gender'] == 'F'] = 'Female' #replace the F's with Females
dirtydata['Gender'] = dirtydata['Gender'].str.title() #Make them title 
dirtydata['Gender'].value_counts()

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

Male      87
Female    14
Name: Gender, dtype: int64

Age

dirtydata['Age'].loc[dirtydata['Age']=="1825"] = 18-25
dirtydata['Age'].value_counts()

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

18 - 25    53
25 - 35    33
35 - 50    12
1825        2
18-  25     1
Name: Age, dtype: int64

dirtydata['Age'] = dirtydata['Age'].str.replace("1825","18-25")
dirtydata['Age'].value_counts()

18 - 25    53
25 - 35    33
35 - 50    12
18-  25     1
Name: Age, dtype: int64

Software

dirtydata['Software'] = dirtydata['Software'].str.replace("Apple IOS","Ios")
dirtydata['Software'] = dirtydata['Software'].str.replace("Apple","Ios")
dirtydata['Software'] = dirtydata['Software'].str.replace("iphone","Ios")
dirtydata['Software'].value_counts()

Android    52
Ios        22
ANDROID    16
 IOS        5
  IOS       2
 Ios        1
ıOS         1
Huawei      1
Sony        1
Name: Software, dtype: int64

Buy

dirtydata['Buy'].loc[dirtydata['Buy']=="HAYIR"]  = 'No'
dirtydata['Buy'].value_counts()

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

No     60
Yes    39
YES     2
Name: Buy, dtype: int64

Step 6: Remove the white spaces in the string values.

The frequency table shows that Age, Software and Buy have unnecessary white spaces in the values.

dirtydata['Age'] = dirtydata['Age'].str.strip()
dirtydata['Software'] = dirtydata['Software'].str.strip()
dirtydata['Buy'] = dirtydata['Buy'].str.strip()
print(dirtydata['Age'].value_counts())
print(dirtydata['Software'].value_counts())
print(dirtydata['Buy'].value_counts())

18 - 25    53
25 - 35    33
35 - 50    12
18-  25     1
Name: Age, dtype: int64
Android    52
Ios        23
ANDROID    16
IOS         7
ıOS         1
Huawei      1
Sony        1
Name: Software, dtype: int64
No     60
Yes    39
YES     2
Name: Buy, dtype: int64

dirtydata['Age'] = dirtydata['Age'].str.replace("18-  25","18 - 25")

Step 7: Make sure that all strings are in the same format (e.g all in title)

The consistency in the format of the string values under gender column is satisfied. The problem occurs in only software and buy columns.

Software

dirtydata['Software'] = dirtydata['Software'].str.title()
dirtydata['Software'].value_counts()

Android    68
Ios        31
Huawei      1
Sony        1
Name: Software, dtype: int64

Buy

dirtydata['Buy'] = dirtydata['Buy'].str.title()
dirtydata['Buy'].value_counts()

No     60
Yes    41
Name: Buy, dtype: int64

Step 8: Examine the descriptive statistics of the numerical variables. Search for any unusual behavior. Search also outliers, if there are outliers, replace them with the median.

dirtydata.describe().transpose()

The descriptive statistics show that we have unlogical value in time (-2), and in earn (9999999999999). Let's make them correct.

dirtydata['Time'].loc[dirtydata['Time']== -2] = 2
dirtydata['Earn'].loc[dirtydata['Earn']==dirtydata['Earn'].values.max()] = dirtydata['Earn'].values.median()
dirtydata['Earn'].loc[dirtydata['Earn']==dirtydata['Earn'].values.max()] = dirtydata['Earn'].values.median()

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

dirtydata.describe().transpose()

Step 9: Search for uniformity. That is the measured variables must be in the same unit.

In this data, we have time and unit variable. Note that we have minute and second in unit colums where the majority were dominated by hour. So, let's convert them into hour and change the unit name.

dirtydata['Time'].loc[dirtydata['Unit'] =="minute"] = dirtydata['Time'].loc[dirtydata['Unit'] =="minute"]/60
dirtydata['Time'].loc[dirtydata['Unit'] =="second"] = dirtydata['Time'].loc[dirtydata['Unit'] =="second"]/3600

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

dirtydata['Unit'] = dirtydata['Unit'].str.replace('minute','Hour')
dirtydata['Unit'] = dirtydata['Unit'].str.replace('second','Hour')
dirtydata['Unit'] = dirtydata['Unit'].str.title()
dirtydata['Unit'].value_counts()

Hour    101
Name: Unit, dtype: int64

Step 10: Search for the missing values. Either impute or drop the observations based on the missing percentage.

Let's create a missing table for the data.

new_df = pd.DataFrame(dirtydata.isnull().sum())
new_df.columns = ['Missing Number']
new_df['Missing Percentage'] = (new_df['Missing Number']*100)/dirtydata.shape[0]
new_df

It is seen that we have 2 missings in one object variable; age, 5 missing variables in spend. On the other hand, 96% of the observations in Download are missing.

Mode imputation for age since it is a object variable.
Mean imputation for spend since the data does not have outlier.
Drop the download column

Age

mode_for_age = dirtydata['Age'].value_counts().idxmax()#gives you the level with highest frequency 
dirtydata['Age'] = dirtydata['Age'].fillna(mode_for_age)

18 - 25    55
25 - 35    33
35 - 50    12
18-25       1
Name: Age, dtype: int64

Spend

mean_for_spend = dirtydata['Spend'].values.mean()
dirtydata['Spend'] = dirtydata['Spend'].fillna(mean_for_spend)

Download

dirtydata = dirtydata.drop('Download',axis = 1)
dirtydata.head()

Step 11: Examine the date type object and be sure that all observations in the same format.

Although we have a date column, it is not a date type data. It is an object data.

dirtydata['Dates'].loc[dirtydata['Dates'] != dirtydata['Dates'].value_counts().idxmax()] = dirtydata['Dates'].value_counts().idxmax()
dirtydata['Dates'].value_counts()

2019-02-26    101
Name: Dates, dtype: int64

dirtydata['Dates'] = pd.to_datetime(dirtydata['Dates']) #change the object time

dirtydata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Id        101 non-null    int64         
 1   Gender    101 non-null    object        
 2   Age       99 non-null     object        
 3   Software  101 non-null    object        
 4   Time      101 non-null    float64       
 5   Unit      101 non-null    object        
 6   Buy       101 non-null    object        
 7   Spend     96 non-null     float64       
 8   Earn      101 non-null    int64         
 9   Dates     101 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(2), object(5)
memory usage: 8.0+ KB

dirtydata.head()

We cleaned the data!!!

Now, we will export the data for the future use.

dirtydata.to_csv("cleandata.csv") #export the clean data

Example 3¶

Import the ex3_data.xlsx data. Detect the problems and clean the data using pandas functions.

ex3 = pd.read_excel('ex3_data.xlsx')
ex3.head()

	Car	Number
0	False	False
1	False	False
2	False	False
3	False	False
4	False	False
5	False	True
6	False	False
7	False	True

	name	age	surname
0	Alice	25	Smith
1	Bob	20	Jones
2	Charlie	30	Joneson
3	Daisy	35	White

	count	mean	std	min	25%	50%	75%	max
Id	101.0	5.099010e+01	2.929590e+01	1.0	26.0	51.0	76.0	1.010000e+02
Time	101.0	1.166089e+02	7.969932e+02	-2.0	1.0	2.0	3.0	7.200000e+03
Spend	96.0	1.493750e+01	3.131546e+00	10.0	12.0	15.0	18.0	2.000000e+01
Earn	101.0	9.901980e+10	9.950362e+11	1.0	3.0	5.0	8.0	1.000000e+13
Download	4.0	1.000000e+00	0.000000e+00	1.0	1.0	1.0	1.0	1.000000e+00

	count	mean	std	min	25%	50%	75%	max
Id	101.0	50.990099	29.295902	1.0	26.0	51.0	76.0	101.0
Time	101.0	116.727723	796.975654	0.0	1.0	2.0	3.0	7200.0
Spend	96.0	14.937500	3.131546	10.0	12.0	15.0	18.0	20.0
Earn	101.0	4.435644	2.826361	0.0	2.0	4.0	7.0	9.0
Download	4.0	1.000000	0.000000	1.0	1.0	1.0	1.0	1.0

	Time	TRICK	gender	AGE	unit	country	Sta*te	Grand	Income	SpEND	Status
0	10.24.2016	No	male	22	year	Canada	Ontario	JOY	1502.0	1019.0	Rich
1	10.24.2016	No	Male	45	year	usa	il	MEH	1948.0	567.0	Rich
2	10.24.2016	No	Female	48	year	US	Colorado	JOY	2405.0	813.0	NaN
3	10.24.2016	No	Male	57	year	usa	il	JOY	2372.0	1053.0	NaN
4	10.24.2016	Yes	Male	42	year	USA	ontario	MEH	2518.0	235.0	NaN

	Car	Number
0	Chevrolet Pick-Up	48206.0
1	Ford Pick-Up	47999.0
2	Honda Civic	31673.0
3	Honda Accord	30274.0
4	Toyota Camry	17270.0
6	Nissan Altima	14108.0

	Id	GenDer	AgE	Software_	TIME	Unit	BuY	SPEND	EARN	Download	Dates
0	1	Male	35 - 50	IOS	2.0	hour	Yes	13.0	1	1.0	2019-02-26 00:00:00
1	2	Male	25 - 35	IOS	1.0	hour	No	16.0	3	NaN	26 Feb 2019
2	3	Male	25 - 35	IOS	4.0	hour	YES	15.0	2	NaN	2019-02-26 00:00:00
3	4	Male	18 - 25	IOS	2.0	hour	YES	20.0	6	NaN	2019-02-26 00:00:00
4	5	MALE	18 - 25	Ios	1.5	hour	No	18.0	7	NaN	2019-02-26 00:00:00

	Id	GenDer	AgE	Software_	TIME	Unit	BuY	SPEND	EARN	Download	Dates
96	97	MALE_	35 - 50	Apple IOS	2.0	hour	Yes	18.0	5	NaN	2019-02-26 00:00:00
97	98	F	25 - 35	Apple IOS	2.0	hour	No	15.0	8	NaN	2019-02-26 00:00:00
98	99	Female	25 - 35	Apple IOS	1.0	hour	Yes	12.0	6	NaN	2019-02-26 00:00:00
99	100	Female	18 - 25	Apple IOS	1.0	hour	No	16.0	10	NaN	2019-02-26 00:00:00
100	101	Female	18 - 25	Apple IOS	0.0	hour	No	16.0	9999999999999	1.0	26 Feb 2019

	name	age
0	alice smith	25
1	bob jones	20
2	charlie joneson	30
3	daisy white	35