Data Cleaning

Let me tell you one of the biggest misconceptions aspiring data science newbies have nowadays. Real-world work is not something that the newbies think while studying in school. Unfortunately, you will surprise that the real-world projects give you a headache with data cleaning. In fact, 70%-80% of your time should just be collecting and cleaning data. You have to work with messy and unclean data. And, it is something that eventually becomes part of your routine work.

In school, we could be provided datasets which are clean and spotless. For example, hackathons and Kaggle have a good database that we can download them and start working on the problem. Moreover, accuracy is not always the primary goal. Sometimes, you don’t really need a result as 1 plus 1 equal 2. Your boss just might want to know how you can optimize your algorithm for impact, not for the increasing of accuracy.

Therefore, you should know how to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers, etc. In this blog, I’ll leverage Python’s Pandas and NumPy libraries to clean data.

We will use a data set on Sacramento real estate sales as an example. The data set contains information on qualities of the property, location of the property, and time of sale.

First of all, we need to import Numpy and Pandas libraries.

import numpy as np
import pandas as pd
sac_csv = 'sacramento_real_estate_transactions.csv'

Check our Data … Quickly

The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head(), .info(), .describe() methods.

df = pd.read_csv(sac_csv)
df.head(5)
street city zip state beds baths sq__ft type sale_date price latitude longitude
0 3526 HIGH ST SACRAMENTO 95838 CA 2 1 836 Residential Wed May 21 00:00:00 EDT 2008 59222 38.631913 -121.434879
1 51 OMAHA CT SACRAMENTO 95823 CA 3 1 1167 Residential Wed May 21 00:00:00 EDT 2008 68212 38.478902 -121.431028
2 2796 BRANCH ST SACRAMENTO 95815 CA 2 1 796 Residential Wed May 21 00:00:00 EDT 2008 68880 38.618305 -121.443839
3 2805 JANETTE WAY SACRAMENTO 95815 CA 2 1 852 Residential Wed May 21 00:00:00 EDT 2008 69307 38.616835 -121.439146
4 6001 MCMAHON DR SACRAMENTO 95824 CA 2 1 797 Residential Wed May 21 00:00:00 EDT 2008 81900 38.519470 -121.435768
df.info()

# Now let’s quickly see the names and types of the columns.
# Most of the time you’re going get data that is not quite what you expected
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 985 entries, 0 to 984
Data columns (total 12 columns):
street       985 non-null object
city         985 non-null object
zip          985 non-null int64
state        985 non-null object
beds         985 non-null int64
baths        985 non-null int64
sq__ft       985 non-null int64
type         985 non-null object
sale_date    985 non-null object
price        985 non-null int64
latitude     985 non-null float64
longitude    985 non-null float64
dtypes: float64(2), int64(5), object(5)
memory usage: 92.4+ KB
df.describe()

# We find that 'Sq_ft' and 'price' have negative values. Take a look and min values.
zip beds baths sq__ft price latitude longitude
count 985.000000 985.000000 985.000000 985.000000 985.000000 985.000000 985.000000
mean 95750.697462 2.911675 1.776650 1312.918782 233715.951269 38.445121 -121.193371
std 85.176072 1.307932 0.895371 856.123224 139088.818896 5.103637 5.100670
min 95603.000000 0.000000 0.000000 -984.000000 -210944.000000 -121.503471 -121.551704
25% 95660.000000 2.000000 1.000000 950.000000 145000.000000 38.482704 -121.446119
50% 95762.000000 3.000000 2.000000 1304.000000 213750.000000 38.625932 -121.375799
75% 95828.000000 4.000000 2.000000 1718.000000 300000.000000 38.695589 -121.294893
max 95864.000000 8.000000 5.000000 5822.000000 884790.000000 39.020808 38.668433
# Fixing 'Sq_ft' and 'price' error for negative values

df['sq__ft'] = df['sq__ft'].abs()

df['price'] = df['price'].abs()
df.describe()
zip beds baths sq__ft price latitude longitude
count 985.000000 985.000000 985.000000 985.000000 985.000000 985.000000 985.000000
mean 95750.697462 2.911675 1.776650 1314.916751 234144.263959 38.445121 -121.193371
std 85.176072 1.307932 0.895371 853.048243 138365.839085 5.103637 5.100670
min 95603.000000 0.000000 0.000000 0.000000 1551.000000 -121.503471 -121.551704
25% 95660.000000 2.000000 1.000000 952.000000 145000.000000 38.482704 -121.446119
50% 95762.000000 3.000000 2.000000 1304.000000 213750.000000 38.625932 -121.375799
75% 95828.000000 4.000000 2.000000 1718.000000 300000.000000 38.695589 -121.294893
max 95864.000000 8.000000 5.000000 5822.000000 884790.000000 39.020808 38.668433

We also need to check if there is any empty values

df.empty
False

We also need to check unique values for each columns. Thus, we can detect any errors and fix them.

df['city'].unique()
array(['SACRAMENTO', 'RANCHO CORDOVA', 'RIO LINDA', 'CITRUS HEIGHTS',
       'NORTH HIGHLANDS', 'ANTELOPE', 'ELK GROVE', 'ELVERTA', 'GALT',
       'CARMICHAEL', 'ORANGEVALE', 'FOLSOM', 'MATHER', 'POLLOCK PINES',
       'GOLD RIVER', 'EL DORADO HILLS', 'RANCHO MURIETA', 'WILTON',
       'GREENWOOD', 'FAIR OAKS', 'CAMERON PARK', 'LINCOLN', 'PLACERVILLE',
       'MEADOW VISTA', 'ROSEVILLE', 'ROCKLIN', 'AUBURN', 'LOOMIS',
       'EL DORADO', 'PENRYN', 'GRANITE BAY', 'FORESTHILL',
       'DIAMOND SPRINGS', 'SHINGLE SPRINGS', 'COOL', 'WALNUT GROVE',
       'GARDEN VALLEY', 'SLOUGHHOUSE', 'WEST SACRAMENTO'], dtype=object)
df['state'].unique()

#There is no state which has postal code abbreviation "AC"
array(['CA', 'AC'], dtype=object)
df['beds'].unique()
array([2, 3, 1, 4, 0, 5, 8, 6], dtype=int64)
df['type'].unique()

# Typo "Unkown"
array(['Residential', 'Condo', 'Multi-Family', 'Unkown'], dtype=object)
# Fixing typo

df['type'].replace('Unkown', 'Unknown',inplace=True)
df['type'].unique()
array(['Residential', 'Condo', 'Multi-Family', 'Unknown'], dtype=object)
df['baths'].unique()
array([1, 2, 3, 4, 0, 5], dtype=int64)
# Fixing state typo

df['state'].replace('AC', 'CA',inplace=True)
#Checking if error was fixed

df['state'].unique()
array(['CA'], dtype=object)

Just an extra step to make the column's name look nicer

# sq__ft should only have one '_'

df.rename(columns={'sq__ft':'sq_ft'}, inplace=True)
df.head(3)
street city zip state beds baths sq_ft type sale_date price latitude longitude
0 3526 HIGH ST SACRAMENTO 95838 CA 2 1 836 Residential Wed May 21 00:00:00 EDT 2008 59222 38.631913 -121.434879
1 51 OMAHA CT SACRAMENTO 95823 CA 3 1 1167 Residential Wed May 21 00:00:00 EDT 2008 68212 38.478902 -121.431028
2 2796 BRANCH ST SACRAMENTO 95815 CA 2 1 796 Residential Wed May 21 00:00:00 EDT 2008 68880 38.618305 -121.443839

Ahha, checking duplicate rows is also needed

duplicate = df[df.street.duplicated(keep=False)]

# duplicate = df.loc[duplicate_bool == True]

duplicate
street city zip state beds baths sq_ft type sale_date price latitude longitude
341 4734 14TH AVE SACRAMENTO 95820 CA 2 1 834 Residential Mon May 19 00:00:00 EDT 2008 68000 38.539447 -121.450858
342 4734 14TH AVE SACRAMENTO 95820 CA 2 1 834 Residential Mon May 19 00:00:00 EDT 2008 68000 38.539447 -121.450858
393 1223 LAMBERTON CIR SACRAMENTO 95838 CA 3 2 1370 Residential Mon May 19 00:00:00 EDT 2008 155435 38.646677 -121.437573
394 1223 LAMBERTON CIR SACRAMENTO 95838 CA 3 2 1370 Residential Mon May 19 00:00:00 EDT 2008 155500 38.646677 -121.437573
403 8306 CURLEW CT CITRUS HEIGHTS 95621 CA 4 2 1280 Residential Mon May 19 00:00:00 EDT 2008 167293 38.715781 -121.298519
404 8306 CURLEW CT CITRUS HEIGHTS 95621 CA 4 2 1280 Residential Mon May 19 00:00:00 EDT 2008 167293 38.715781 -121.298519
600 7 CRYSTALWOOD CIR LINCOLN 95648 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 4897 38.885962 -121.289436
601 7 CRYSTALWOOD CIR LINCOLN 95648 CA 0 0 0 Residential Mon May 19 00:00:00 EDT 2008 4897 38.885962 -121.289436
# Fixing duplicate rows

df = df.drop_duplicates(keep='first')
df[df.duplicated()]
street city zip state beds baths sq_ft type sale_date price latitude longitude

Sometimes, we have to explore the data set in order to understand every single value in cells

# latitude range [-90, 90]
# Longtitude range [-180, 180]

# df['latitude'] is not correct

df.loc[df['latitude'] == df['latitude'].min()]
street city zip state beds baths sq_ft type sale_date price latitude longitude
703 1900 DANBROOK DR SACRAMENTO 95835 CA 1 1 984 Condo Fri May 16 00:00:00 EDT 2008 210944 -121.503471 38.668433
#Switching values

lat = df.loc[703,'longitude']
long = df.loc[703, 'latitude']

df['latitude'].replace(df.loc[703,'latitude'], lat,inplace=True)
df['longitude'].replace(df.loc[703,'longitude'], long,inplace=True)

We always deal with NaN values, but we also have to be aware of zero values which may affect our data. For missing values, if you need to fill in errors or blanks, use the fillna() and dropna() methods. You could just drop them from the dataframe entirely, either by the row or by the column, but you have to keep in mind, do not delete over 5% of dataset.

You could fill the NaNs with strings, or if they are numbers you could use imputation methods which use the mean or the median value. In this case, we will treat zero values as NaN values.

#Cleaning data with sq_ft = 0

zero = df.loc[(df['sq_ft'] == 0) ]
zero.head(3)
street city zip state beds baths sq_ft type sale_date price latitude longitude
73 17 SERASPI CT SACRAMENTO 95834 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 206000 38.631481 -121.501880
89 2866 KARITSA AVE SACRAMENTO 95833 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 244500 38.626671 -121.525970
100 12209 CONSERVANCY WAY RANCHO CORDOVA 95742 CA 0 0 0 Residential Wed May 21 00:00:00 EDT 2008 263500 38.553867 -121.219141
zero['city'].unique()
array(['SACRAMENTO', 'RANCHO CORDOVA', 'EL DORADO HILLS', 'ELK GROVE',
       'FOLSOM', 'RANCHO MURIETA', 'CAMERON PARK', 'LINCOLN',
       'PLACERVILLE', 'GOLD RIVER', 'ROSEVILLE', 'ROCKLIN', 'AUBURN',
       'WILTON', 'GALT', 'RIO LINDA', 'FORESTHILL', 'SHINGLE SPRINGS',
       'GARDEN VALLEY', 'WEST SACRAMENTO'], dtype=object)
nan = df.groupby('city')['sq_ft'].median().to_frame()
nan
sq_ft
city
ANTELOPE 1669.0
AUBURN 960.0
CAMERON PARK 0.0
CARMICHAEL 1472.5
CITRUS HEIGHTS 1292.5
COOL 1457.0
DIAMOND SPRINGS 1300.0
EL DORADO 1332.0
EL DORADO HILLS 0.0
ELK GROVE 1722.5
ELVERTA 1235.0
FAIR OAKS 1450.0
FOLSOM 1939.0
FORESTHILL 0.0
GALT 1479.0
GARDEN VALLEY 0.0
GOLD RIVER 1750.5
GRANITE BAY 2896.0
GREENWOOD 2846.0
LINCOLN 0.0
LOOMIS 1476.0
MATHER 2093.0
MEADOW VISTA 1216.0
NORTH HIGHLANDS 1152.0
ORANGEVALE 1574.0
PENRYN 1320.0
PLACERVILLE 474.0
POLLOCK PINES 1320.0
RANCHO CORDOVA 1222.0
RANCHO MURIETA 2734.0
RIO LINDA 1182.0
ROCKLIN 0.0
ROSEVILLE 1049.0
SACRAMENTO 1318.0
SHINGLE SPRINGS 0.0
SLOUGHHOUSE 5822.0
WALNUT GROVE 1727.0
WEST SACRAMENTO 0.0
WILTON 3741.0
# Deleting extra rows

df.drop( df[ df['city'] == 'FORESTHILL' ].index , inplace=True)

df.drop( df[ df['city'] == 'CAMERON PARK' ].index , inplace=True)

df.drop( df[ df['city'] == 'EL DORADO HILLS' ].index , inplace=True)

df.drop( df[ df['city'] == 'GARDEN VALLEY' ].index , inplace=True)

df.drop( df[ df['city'] == 'LINCOLN' ].index , inplace=True)

df.drop( df[ df['city'] == 'ROCKLIN' ].index , inplace=True)

df.drop( df[ df['city'] == 'SHINGLE SPRINGS' ].index , inplace=True)

df.drop( df[ df['city'] == 'WEST SACRAMENTO' ].index , inplace=True)
# Replaceing median value to sq_ft with value '0'

df.loc[(df.city == 'SACRAMENTO') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][33]

df.loc[(df.city == 'RANCHO CORDOVA') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][28]

df.loc[(df.city == 'ELK GROVE') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][9]

df.loc[(df.city == 'FOLSOM') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][12]

df.loc[(df.city == 'RANCHO MURIETA') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][29]

df.loc[(df.city == 'GOLD RIVER') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][16]

df.loc[(df.city == 'ROSEVILLE') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][32]

df.loc[(df.city == 'WILTON') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][-1]

df.loc[(df.city == 'RIO LINDA') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][30]

df.loc[(df.city == 'PLACERVILLE') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][26]

df.loc[(df.city == 'GALT') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][14]
# Replaceing median value to beds with value '0'

one = df.loc[(df['beds'] == 0) ]
one.head(3)
street city zip state beds baths sq_ft type sale_date price latitude longitude
73 17 SERASPI CT SACRAMENTO 95834 CA 0 0 1318.0 Residential Wed May 21 00:00:00 EDT 2008 206000 38.631481 -121.501880
89 2866 KARITSA AVE SACRAMENTO 95833 CA 0 0 1318.0 Residential Wed May 21 00:00:00 EDT 2008 244500 38.626671 -121.525970
100 12209 CONSERVANCY WAY RANCHO CORDOVA 95742 CA 0 0 1222.0 Residential Wed May 21 00:00:00 EDT 2008 263500 38.553867 -121.219141
one['city'].unique()
array(['SACRAMENTO', 'RANCHO CORDOVA', 'ELK GROVE', 'FOLSOM',
       'RANCHO MURIETA', 'GOLD RIVER', 'ROSEVILLE', 'WILTON', 'GALT',
       'RIO LINDA', 'PLACERVILLE'], dtype=object)
bed = df.groupby('city')['beds'].median().to_frame()
bed.head(3)
beds
city
ANTELOPE 3.0
AUBURN 3.0
CARMICHAEL 3.0
df.loc[(df.city == 'SACRAMENTO') & (df.beds == 0) , 'beds'] = bed['beds'][27]

df.loc[(df.city == 'RANCHO CORDOVA') & (df.beds == 0) , 'beds'] = bed['beds'][23]

df.loc[(df.city == 'ELK GROVE') & (df.beds == 0) , 'beds'] = bed['beds'][7]

df.loc[(df.city == 'FOLSOM') & (df.beds == 0) , 'beds'] = bed['beds'][10]

df.loc[(df.city == 'RANCHO MURIETA') & (df.beds == 0) , 'beds'] = bed['beds'][24]

df.loc[(df.city == 'GOLD RIVER') & (df.beds == 0) , 'beds'] = bed['beds'][12]

df.loc[(df.city == 'WILTON') & (df.beds == 0) , 'beds'] = bed['beds'][-1]

df.loc[(df.city == 'GALT') & (df.beds == 0) , 'beds'] = bed['beds'][11]

df.loc[(df.city == 'RIO LINDA') & (df.beds == 0) , 'beds'] = bed['beds'][25]

df.loc[(df.city == 'PLACERVILLE') & (df.beds == 0) , 'beds'] = bed['beds'][21]

df.loc[(df.city == 'ROSEVILLE') & (df.beds == 0) , 'beds'] = bed['beds'][26]
# Replaceing median value to baths with value '0'

two = df.loc[(df['baths'] == 0) ]
two.head(3)
street city zip state beds baths sq_ft type sale_date price latitude longitude
73 17 SERASPI CT SACRAMENTO 95834 CA 3.0 0 1318.0 Residential Wed May 21 00:00:00 EDT 2008 206000 38.631481 -121.501880
89 2866 KARITSA AVE SACRAMENTO 95833 CA 3.0 0 1318.0 Residential Wed May 21 00:00:00 EDT 2008 244500 38.626671 -121.525970
100 12209 CONSERVANCY WAY RANCHO CORDOVA 95742 CA 3.0 0 1222.0 Residential Wed May 21 00:00:00 EDT 2008 263500 38.553867 -121.219141
two['city'].unique()
array(['SACRAMENTO', 'RANCHO CORDOVA', 'ELK GROVE', 'FOLSOM',
       'RANCHO MURIETA', 'GOLD RIVER', 'ROSEVILLE', 'WILTON', 'GALT',
       'RIO LINDA', 'PLACERVILLE'], dtype=object)
bath = df.groupby('city')['baths'].median().to_frame()
bath.head(3)
baths
city
ANTELOPE 2.0
AUBURN 3.0
CARMICHAEL 2.0
df.loc[(df.city == 'SACRAMENTO') & (df.baths == 0) , 'baths'] = bath['baths'][27]

df.loc[(df.city == 'RANCHO CORDOVA') & (df.baths == 0) , 'baths'] = bath['baths'][23]

df.loc[(df.city == 'ELK GROVE') & (df.baths == 0) , 'baths'] = bath['baths'][7]

df.loc[(df.city == 'FOLSOM') & (df.baths == 0) , 'baths'] = bath['baths'][10]

df.loc[(df.city == 'RANCHO MURIETA') & (df.baths == 0) , 'baths'] = bath['baths'][24]

df.loc[(df.city == 'GOLD RIVER') & (df.baths == 0) , 'baths'] = bath['baths'][12]

df.loc[(df.city == 'ROSEVILLE') & (df.baths == 0) , 'baths'] = bath['baths'][26]

df.loc[(df.city == 'WILTON') & (df.baths == 0) , 'baths'] = bath['baths'][-1]

df.loc[(df.city == 'GALT') & (df.baths == 0) , 'baths'] = bath['baths'][11]

df.loc[(df.city == 'RIO LINDA') & (df.baths == 0) , 'baths'] = bath['baths'][25]

df.loc[(df.city == 'PLACERVILLE') & (df.baths == 0) , 'baths'] = bath['baths'][21]

After some work, our dataframe looks much nicer.

df.head(3)

s

street city zip state beds baths sq_ft type sale_date price latitude longitude
0 3526 HIGH ST SACRAMENTO 95838 CA 2.0 1.0 836.0 Residential Wed May 21 00:00:00 EDT 2008 59222 38.631913 -121.434879
1 51 OMAHA CT SACRAMENTO 95823 CA 3.0 1.0 1167.0 Residential Wed May 21 00:00:00 EDT 2008 68212 38.478902 -121.431028
2 2796 BRANCH ST SACRAMENTO 95815 CA 2.0 1.0 796.0 Residential Wed May 21 00:00:00 EDT 2008 68880 38.618305 -121.443839

Dealing with messy data is a pain. Data cleaning is very important because it is a part of the process on a data science project. In this blog, we went over some ways to detect, summarize, and replace odd values. Usually, we deal with NaN values, but in this blog, we got chance to treat zero values as NaN values.

If you spend less time on data cleaning, you will have more time for exploring and modeling.