Data Cleaning

Thu 13 June 2019

by Lucas Ho

Let me tell you one of the biggest misconceptions aspiring data science newbies have nowadays. Real-world work is not something that the newbies think while studying in school. Unfortunately, you will surprise that the real-world projects give you a headache with data cleaning. In fact, 70%-80% of your time should just be collecting and cleaning data. You have to work with messy and unclean data. And, it is something that eventually becomes part of your routine work.

In school, we could be provided datasets which are clean and spotless. For example, hackathons and Kaggle have a good database that we can download them and start working on the problem. Moreover, accuracy is not always the primary goal. Sometimes, you don’t really need a result as 1 plus 1 equal 2. Your boss just might want to know how you can optimize your algorithm for impact, not for the increasing of accuracy.

Therefore, you should know how to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers, etc. In this blog, I’ll leverage Python’s Pandas and NumPy libraries to clean data.

We will use a data set on Sacramento real estate sales as an example. The data set contains information on qualities of the property, location of the property, and time of sale.

First of all, we need to import Numpy and Pandas libraries.

import numpy as np
import pandas as pd

sac_csv = 'sacramento_real_estate_transactions.csv'

Check our Data … Quickly

The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head(), .info(), .describe() methods.

df = pd.read_csv(sac_csv)
df.head(5)

	street	city	zip	state	beds	baths	sq__ft	type	sale_date	price	latitude	longitude
0	3526 HIGH ST	SACRAMENTO	95838	CA	2	1	836	Residential	Wed May 21 00:00:00 EDT 2008	59222	38.631913	-121.434879
1	51 OMAHA CT	SACRAMENTO	95823	CA	3	1	1167	Residential	Wed May 21 00:00:00 EDT 2008	68212	38.478902	-121.431028
2	2796 BRANCH ST	SACRAMENTO	95815	CA	2	1	796	Residential	Wed May 21 00:00:00 EDT 2008	68880	38.618305	-121.443839
3	2805 JANETTE WAY	SACRAMENTO	95815	CA	2	1	852	Residential	Wed May 21 00:00:00 EDT 2008	69307	38.616835	-121.439146
4	6001 MCMAHON DR	SACRAMENTO	95824	CA	2	1	797	Residential	Wed May 21 00:00:00 EDT 2008	81900	38.519470	-121.435768

df.info()

# Now let’s quickly see the names and types of the columns.
# Most of the time you’re going get data that is not quite what you expected

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 985 entries, 0 to 984
Data columns (total 12 columns):
street       985 non-null object
city         985 non-null object
zip          985 non-null int64
state        985 non-null object
beds         985 non-null int64
baths        985 non-null int64
sq__ft       985 non-null int64
type         985 non-null object
sale_date    985 non-null object
price        985 non-null int64
latitude     985 non-null float64
longitude    985 non-null float64
dtypes: float64(2), int64(5), object(5)
memory usage: 92.4+ KB

df.describe()

# We find that 'Sq_ft' and 'price' have negative values. Take a look and min values.

	zip	beds	baths	sq__ft	price	latitude	longitude
count	985.000000	985.000000	985.000000	985.000000	985.000000	985.000000	985.000000
mean	95750.697462	2.911675	1.776650	1312.918782	233715.951269	38.445121	-121.193371
std	85.176072	1.307932	0.895371	856.123224	139088.818896	5.103637	5.100670
min	95603.000000	0.000000	0.000000	-984.000000	-210944.000000	-121.503471	-121.551704
25%	95660.000000	2.000000	1.000000	950.000000	145000.000000	38.482704	-121.446119
50%	95762.000000	3.000000	2.000000	1304.000000	213750.000000	38.625932	-121.375799
75%	95828.000000	4.000000	2.000000	1718.000000	300000.000000	38.695589	-121.294893
max	95864.000000	8.000000	5.000000	5822.000000	884790.000000	39.020808	38.668433

# Fixing 'Sq_ft' and 'price' error for negative values

df['sq__ft'] = df['sq__ft'].abs()

df['price'] = df['price'].abs()

df.describe()

	zip	beds	baths	sq__ft	price	latitude	longitude
count	985.000000	985.000000	985.000000	985.000000	985.000000	985.000000	985.000000
mean	95750.697462	2.911675	1.776650	1314.916751	234144.263959	38.445121	-121.193371
std	85.176072	1.307932	0.895371	853.048243	138365.839085	5.103637	5.100670
min	95603.000000	0.000000	0.000000	0.000000	1551.000000	-121.503471	-121.551704
25%	95660.000000	2.000000	1.000000	952.000000	145000.000000	38.482704	-121.446119
50%	95762.000000	3.000000	2.000000	1304.000000	213750.000000	38.625932	-121.375799
75%	95828.000000	4.000000	2.000000	1718.000000	300000.000000	38.695589	-121.294893
max	95864.000000	8.000000	5.000000	5822.000000	884790.000000	39.020808	38.668433

We also need to check if there is any empty values

df.empty

False

We also need to check unique values for each columns. Thus, we can detect any errors and fix them.

df['city'].unique()

array(['SACRAMENTO', 'RANCHO CORDOVA', 'RIO LINDA', 'CITRUS HEIGHTS',
       'NORTH HIGHLANDS', 'ANTELOPE', 'ELK GROVE', 'ELVERTA', 'GALT',
       'CARMICHAEL', 'ORANGEVALE', 'FOLSOM', 'MATHER', 'POLLOCK PINES',
       'GOLD RIVER', 'EL DORADO HILLS', 'RANCHO MURIETA', 'WILTON',
       'GREENWOOD', 'FAIR OAKS', 'CAMERON PARK', 'LINCOLN', 'PLACERVILLE',
       'MEADOW VISTA', 'ROSEVILLE', 'ROCKLIN', 'AUBURN', 'LOOMIS',
       'EL DORADO', 'PENRYN', 'GRANITE BAY', 'FORESTHILL',
       'DIAMOND SPRINGS', 'SHINGLE SPRINGS', 'COOL', 'WALNUT GROVE',
       'GARDEN VALLEY', 'SLOUGHHOUSE', 'WEST SACRAMENTO'], dtype=object)

df['state'].unique()

#There is no state which has postal code abbreviation "AC"

array(['CA', 'AC'], dtype=object)

df['beds'].unique()

array([2, 3, 1, 4, 0, 5, 8, 6], dtype=int64)

df['type'].unique()

# Typo "Unkown"

array(['Residential', 'Condo', 'Multi-Family', 'Unkown'], dtype=object)

# Fixing typo

df['type'].replace('Unkown', 'Unknown',inplace=True)

df['type'].unique()

array(['Residential', 'Condo', 'Multi-Family', 'Unknown'], dtype=object)

df['baths'].unique()

array([1, 2, 3, 4, 0, 5], dtype=int64)

# Fixing state typo

df['state'].replace('AC', 'CA',inplace=True)

#Checking if error was fixed

df['state'].unique()

array(['CA'], dtype=object)

Just an extra step to make the column's name look nicer

# sq__ft should only have one '_'

df.rename(columns={'sq__ft':'sq_ft'}, inplace=True)
df.head(3)

	street	city	zip	state	beds	baths	sq_ft	type	sale_date	price	latitude	longitude
0	3526 HIGH ST	SACRAMENTO	95838	CA	2	1	836	Residential	Wed May 21 00:00:00 EDT 2008	59222	38.631913	-121.434879
1	51 OMAHA CT	SACRAMENTO	95823	CA	3	1	1167	Residential	Wed May 21 00:00:00 EDT 2008	68212	38.478902	-121.431028
2	2796 BRANCH ST	SACRAMENTO	95815	CA	2	1	796	Residential	Wed May 21 00:00:00 EDT 2008	68880	38.618305	-121.443839

Ahha, checking duplicate rows is also needed

duplicate = df[df.street.duplicated(keep=False)]

# duplicate = df.loc[duplicate_bool == True]

duplicate

	street	city	zip	state	beds	baths	sq_ft	type	sale_date	price	latitude	longitude
341	4734 14TH AVE	SACRAMENTO	95820	CA	2	1	834	Residential	Mon May 19 00:00:00 EDT 2008	68000	38.539447	-121.450858
342	4734 14TH AVE	SACRAMENTO	95820	CA	2	1	834	Residential	Mon May 19 00:00:00 EDT 2008	68000	38.539447	-121.450858
393	1223 LAMBERTON CIR	SACRAMENTO	95838	CA	3	2	1370	Residential	Mon May 19 00:00:00 EDT 2008	155435	38.646677	-121.437573
394	1223 LAMBERTON CIR	SACRAMENTO	95838	CA	3	2	1370	Residential	Mon May 19 00:00:00 EDT 2008	155500	38.646677	-121.437573
403	8306 CURLEW CT	CITRUS HEIGHTS	95621	CA	4	2	1280	Residential	Mon May 19 00:00:00 EDT 2008	167293	38.715781	-121.298519
404	8306 CURLEW CT	CITRUS HEIGHTS	95621	CA	4	2	1280	Residential	Mon May 19 00:00:00 EDT 2008	167293	38.715781	-121.298519
600	7 CRYSTALWOOD CIR	LINCOLN	95648	CA	0	0	0	Residential	Mon May 19 00:00:00 EDT 2008	4897	38.885962	-121.289436
601	7 CRYSTALWOOD CIR	LINCOLN	95648	CA	0	0	0	Residential	Mon May 19 00:00:00 EDT 2008	4897	38.885962	-121.289436

# Fixing duplicate rows

df = df.drop_duplicates(keep='first')

df[df.duplicated()]

	street	city	zip	state	beds	baths	sq_ft	type	sale_date	price	latitude	longitude

Sometimes, we have to explore the data set in order to understand every single value in cells

# latitude range [-90, 90]
# Longtitude range [-180, 180]

# df['latitude'] is not correct

df.loc[df['latitude'] == df['latitude'].min()]

	street	city	zip	state	beds	baths	sq_ft	type	sale_date	price	latitude	longitude
703	1900 DANBROOK DR	SACRAMENTO	95835	CA	1	1	984	Condo	Fri May 16 00:00:00 EDT 2008	210944	-121.503471	38.668433

#Switching values

lat = df.loc[703,'longitude']
long = df.loc[703, 'latitude']

df['latitude'].replace(df.loc[703,'latitude'], lat,inplace=True)
df['longitude'].replace(df.loc[703,'longitude'], long,inplace=True)

We always deal with NaN values, but we also have to be aware of zero values which may affect our data. For missing values, if you need to fill in errors or blanks, use the fillna() and dropna() methods. You could just drop them from the dataframe entirely, either by the row or by the column, but you have to keep in mind, do not delete over 5% of dataset.

You could fill the NaNs with strings, or if they are numbers you could use imputation methods which use the mean or the median value. In this case, we will treat zero values as NaN values.

#Cleaning data with sq_ft = 0

zero = df.loc[(df['sq_ft'] == 0) ]
zero.head(3)

	street	city	zip	state	type	sale_date	price	latitude	longitude
73	17 SERASPI CT	SACRAMENTO	95834	CA	Residential	Wed May 21 00:00:00 EDT 2008	206000	38.631481	-121.501880
89	2866 KARITSA AVE	SACRAMENTO	95833	CA	Residential	Wed May 21 00:00:00 EDT 2008	244500	38.626671	-121.525970
100	12209 CONSERVANCY WAY	RANCHO CORDOVA	95742	CA	Residential	Wed May 21 00:00:00 EDT 2008	263500	38.553867	-121.219141

zero['city'].unique()

array(['SACRAMENTO', 'RANCHO CORDOVA', 'EL DORADO HILLS', 'ELK GROVE',
       'FOLSOM', 'RANCHO MURIETA', 'CAMERON PARK', 'LINCOLN',
       'PLACERVILLE', 'GOLD RIVER', 'ROSEVILLE', 'ROCKLIN', 'AUBURN',
       'WILTON', 'GALT', 'RIO LINDA', 'FORESTHILL', 'SHINGLE SPRINGS',
       'GARDEN VALLEY', 'WEST SACRAMENTO'], dtype=object)

nan = df.groupby('city')['sq_ft'].median().to_frame()
nan

	sq_ft
city
ANTELOPE	1669.0
AUBURN	960.0
CAMERON PARK	0.0
CARMICHAEL	1472.5
CITRUS HEIGHTS	1292.5
COOL	1457.0
DIAMOND SPRINGS	1300.0
EL DORADO	1332.0
EL DORADO HILLS	0.0
ELK GROVE	1722.5
ELVERTA	1235.0
FAIR OAKS	1450.0
FOLSOM	1939.0
FORESTHILL	0.0
GALT	1479.0
GARDEN VALLEY	0.0
GOLD RIVER	1750.5
GRANITE BAY	2896.0
GREENWOOD	2846.0
LINCOLN	0.0
LOOMIS	1476.0
MATHER	2093.0
MEADOW VISTA	1216.0
NORTH HIGHLANDS	1152.0
ORANGEVALE	1574.0
PENRYN	1320.0
PLACERVILLE	474.0
POLLOCK PINES	1320.0
RANCHO CORDOVA	1222.0
RANCHO MURIETA	2734.0
RIO LINDA	1182.0
ROCKLIN	0.0
ROSEVILLE	1049.0
SACRAMENTO	1318.0
SHINGLE SPRINGS	0.0
SLOUGHHOUSE	5822.0
WALNUT GROVE	1727.0
WEST SACRAMENTO	0.0
WILTON	3741.0

# Deleting extra rows

df.drop( df[ df['city'] == 'FORESTHILL' ].index , inplace=True)

df.drop( df[ df['city'] == 'CAMERON PARK' ].index , inplace=True)

df.drop( df[ df['city'] == 'EL DORADO HILLS' ].index , inplace=True)

df.drop( df[ df['city'] == 'GARDEN VALLEY' ].index , inplace=True)

df.drop( df[ df['city'] == 'LINCOLN' ].index , inplace=True)

df.drop( df[ df['city'] == 'ROCKLIN' ].index , inplace=True)

df.drop( df[ df['city'] == 'SHINGLE SPRINGS' ].index , inplace=True)

df.drop( df[ df['city'] == 'WEST SACRAMENTO' ].index , inplace=True)

# Replaceing median value to sq_ft with value '0'

df.loc[(df.city == 'SACRAMENTO') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][33]

df.loc[(df.city == 'RANCHO CORDOVA') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][28]

df.loc[(df.city == 'ELK GROVE') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][9]

df.loc[(df.city == 'FOLSOM') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][12]

df.loc[(df.city == 'RANCHO MURIETA') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][29]

df.loc[(df.city == 'GOLD RIVER') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][16]

df.loc[(df.city == 'ROSEVILLE') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][32]

df.loc[(df.city == 'WILTON') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][-1]

df.loc[(df.city == 'RIO LINDA') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][30]

df.loc[(df.city == 'PLACERVILLE') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][26]

df.loc[(df.city == 'GALT') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][14]

# Replaceing median value to beds with value '0'

one = df.loc[(df['beds'] == 0) ]
one.head(3)

	street	city	zip	state	sq_ft	type	sale_date	price	latitude	longitude
73	17 SERASPI CT	SACRAMENTO	95834	CA	1318.0	Residential	Wed May 21 00:00:00 EDT 2008	206000	38.631481	-121.501880
89	2866 KARITSA AVE	SACRAMENTO	95833	CA	1318.0	Residential	Wed May 21 00:00:00 EDT 2008	244500	38.626671	-121.525970
100	12209 CONSERVANCY WAY	RANCHO CORDOVA	95742	CA	1222.0	Residential	Wed May 21 00:00:00 EDT 2008	263500	38.553867	-121.219141

one['city'].unique()

array(['SACRAMENTO', 'RANCHO CORDOVA', 'ELK GROVE', 'FOLSOM',
       'RANCHO MURIETA', 'GOLD RIVER', 'ROSEVILLE', 'WILTON', 'GALT',
       'RIO LINDA', 'PLACERVILLE'], dtype=object)

bed = df.groupby('city')['beds'].median().to_frame()
bed.head(3)

	beds
city
ANTELOPE	3.0
AUBURN	3.0
CARMICHAEL	3.0

df.loc[(df.city == 'SACRAMENTO') & (df.beds == 0) , 'beds'] = bed['beds'][27]

df.loc[(df.city == 'RANCHO CORDOVA') & (df.beds == 0) , 'beds'] = bed['beds'][23]

df.loc[(df.city == 'ELK GROVE') & (df.beds == 0) , 'beds'] = bed['beds'][7]

df.loc[(df.city == 'FOLSOM') & (df.beds == 0) , 'beds'] = bed['beds'][10]

df.loc[(df.city == 'RANCHO MURIETA') & (df.beds == 0) , 'beds'] = bed['beds'][24]

df.loc[(df.city == 'GOLD RIVER') & (df.beds == 0) , 'beds'] = bed['beds'][12]

df.loc[(df.city == 'WILTON') & (df.beds == 0) , 'beds'] = bed['beds'][-1]

df.loc[(df.city == 'GALT') & (df.beds == 0) , 'beds'] = bed['beds'][11]

df.loc[(df.city == 'RIO LINDA') & (df.beds == 0) , 'beds'] = bed['beds'][25]

df.loc[(df.city == 'PLACERVILLE') & (df.beds == 0) , 'beds'] = bed['beds'][21]

df.loc[(df.city == 'ROSEVILLE') & (df.beds == 0) , 'beds'] = bed['beds'][26]

# Replaceing median value to baths with value '0'

two = df.loc[(df['baths'] == 0) ]
two.head(3)

	street	city	zip	state	beds	sq_ft	type	sale_date	price	latitude	longitude
73	17 SERASPI CT	SACRAMENTO	95834	CA	3.0	1318.0	Residential	Wed May 21 00:00:00 EDT 2008	206000	38.631481	-121.501880
89	2866 KARITSA AVE	SACRAMENTO	95833	CA	3.0	1318.0	Residential	Wed May 21 00:00:00 EDT 2008	244500	38.626671	-121.525970
100	12209 CONSERVANCY WAY	RANCHO CORDOVA	95742	CA	3.0	1222.0	Residential	Wed May 21 00:00:00 EDT 2008	263500	38.553867	-121.219141

two['city'].unique()

array(['SACRAMENTO', 'RANCHO CORDOVA', 'ELK GROVE', 'FOLSOM',
       'RANCHO MURIETA', 'GOLD RIVER', 'ROSEVILLE', 'WILTON', 'GALT',
       'RIO LINDA', 'PLACERVILLE'], dtype=object)

bath = df.groupby('city')['baths'].median().to_frame()
bath.head(3)

	baths
city
ANTELOPE	2.0
AUBURN	3.0
CARMICHAEL	2.0

df.loc[(df.city == 'SACRAMENTO') & (df.baths == 0) , 'baths'] = bath['baths'][27]

df.loc[(df.city == 'RANCHO CORDOVA') & (df.baths == 0) , 'baths'] = bath['baths'][23]

df.loc[(df.city == 'ELK GROVE') & (df.baths == 0) , 'baths'] = bath['baths'][7]

df.loc[(df.city == 'FOLSOM') & (df.baths == 0) , 'baths'] = bath['baths'][10]

df.loc[(df.city == 'RANCHO MURIETA') & (df.baths == 0) , 'baths'] = bath['baths'][24]

df.loc[(df.city == 'GOLD RIVER') & (df.baths == 0) , 'baths'] = bath['baths'][12]

df.loc[(df.city == 'ROSEVILLE') & (df.baths == 0) , 'baths'] = bath['baths'][26]

df.loc[(df.city == 'WILTON') & (df.baths == 0) , 'baths'] = bath['baths'][-1]

df.loc[(df.city == 'GALT') & (df.baths == 0) , 'baths'] = bath['baths'][11]

df.loc[(df.city == 'RIO LINDA') & (df.baths == 0) , 'baths'] = bath['baths'][25]

df.loc[(df.city == 'PLACERVILLE') & (df.baths == 0) , 'baths'] = bath['baths'][21]

After some work, our dataframe looks much nicer.

df.head(3)

	street	city	zip	state	beds	baths	sq_ft	type	sale_date	price	latitude	longitude
0	3526 HIGH ST	SACRAMENTO	95838	CA	2.0	1.0	836.0	Residential	Wed May 21 00:00:00 EDT 2008	59222	38.631913	-121.434879
1	51 OMAHA CT	SACRAMENTO	95823	CA	3.0	1.0	1167.0	Residential	Wed May 21 00:00:00 EDT 2008	68212	38.478902	-121.431028
2	2796 BRANCH ST	SACRAMENTO	95815	CA	2.0	1.0	796.0	Residential	Wed May 21 00:00:00 EDT 2008	68880	38.618305	-121.443839

Dealing with messy data is a pain. Data cleaning is very important because it is a part of the process on a data science project. In this blog, we went over some ways to detect, summarize, and replace odd values. Usually, we deal with NaN values, but in this blog, we got chance to treat zero values as NaN values.

If you spend less time on data cleaning, you will have more time for exploring and modeling.