Data Cleaning
Let me tell you one of the biggest misconceptions aspiring data science newbies have nowadays. Real-world work is not something that the newbies think while studying in school. Unfortunately, you will surprise that the real-world projects give you a headache with data cleaning. In fact, 70%-80% of your time should just be collecting and cleaning data. You have to work with messy and unclean data. And, it is something that eventually becomes part of your routine work.
In school, we could be provided datasets which are clean and spotless. For example, hackathons and Kaggle have a good database that we can download them and start working on the problem. Moreover, accuracy is not always the primary goal. Sometimes, you don’t really need a result as 1 plus 1 equal 2. Your boss just might want to know how you can optimize your algorithm for impact, not for the increasing of accuracy.
Therefore, you should know how to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers, etc. In this blog, I’ll leverage Python’s Pandas and NumPy libraries to clean data.
We will use a data set on Sacramento real estate sales as an example. The data set contains information on qualities of the property, location of the property, and time of sale.
First of all, we need to import Numpy and Pandas libraries.
import numpy as np
import pandas as pd
sac_csv = 'sacramento_real_estate_transactions.csv'
Check our Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head(), .info(), .describe() methods.
df = pd.read_csv(sac_csv)
df.head(5)
street | city | zip | state | beds | baths | sq__ft | type | sale_date | price | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3526 HIGH ST | SACRAMENTO | 95838 | CA | 2 | 1 | 836 | Residential | Wed May 21 00:00:00 EDT 2008 | 59222 | 38.631913 | -121.434879 |
1 | 51 OMAHA CT | SACRAMENTO | 95823 | CA | 3 | 1 | 1167 | Residential | Wed May 21 00:00:00 EDT 2008 | 68212 | 38.478902 | -121.431028 |
2 | 2796 BRANCH ST | SACRAMENTO | 95815 | CA | 2 | 1 | 796 | Residential | Wed May 21 00:00:00 EDT 2008 | 68880 | 38.618305 | -121.443839 |
3 | 2805 JANETTE WAY | SACRAMENTO | 95815 | CA | 2 | 1 | 852 | Residential | Wed May 21 00:00:00 EDT 2008 | 69307 | 38.616835 | -121.439146 |
4 | 6001 MCMAHON DR | SACRAMENTO | 95824 | CA | 2 | 1 | 797 | Residential | Wed May 21 00:00:00 EDT 2008 | 81900 | 38.519470 | -121.435768 |
df.info()
# Now let’s quickly see the names and types of the columns.
# Most of the time you’re going get data that is not quite what you expected
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 985 entries, 0 to 984
Data columns (total 12 columns):
street 985 non-null object
city 985 non-null object
zip 985 non-null int64
state 985 non-null object
beds 985 non-null int64
baths 985 non-null int64
sq__ft 985 non-null int64
type 985 non-null object
sale_date 985 non-null object
price 985 non-null int64
latitude 985 non-null float64
longitude 985 non-null float64
dtypes: float64(2), int64(5), object(5)
memory usage: 92.4+ KB
df.describe()
# We find that 'Sq_ft' and 'price' have negative values. Take a look and min values.
zip | beds | baths | sq__ft | price | latitude | longitude | |
---|---|---|---|---|---|---|---|
count | 985.000000 | 985.000000 | 985.000000 | 985.000000 | 985.000000 | 985.000000 | 985.000000 |
mean | 95750.697462 | 2.911675 | 1.776650 | 1312.918782 | 233715.951269 | 38.445121 | -121.193371 |
std | 85.176072 | 1.307932 | 0.895371 | 856.123224 | 139088.818896 | 5.103637 | 5.100670 |
min | 95603.000000 | 0.000000 | 0.000000 | -984.000000 | -210944.000000 | -121.503471 | -121.551704 |
25% | 95660.000000 | 2.000000 | 1.000000 | 950.000000 | 145000.000000 | 38.482704 | -121.446119 |
50% | 95762.000000 | 3.000000 | 2.000000 | 1304.000000 | 213750.000000 | 38.625932 | -121.375799 |
75% | 95828.000000 | 4.000000 | 2.000000 | 1718.000000 | 300000.000000 | 38.695589 | -121.294893 |
max | 95864.000000 | 8.000000 | 5.000000 | 5822.000000 | 884790.000000 | 39.020808 | 38.668433 |
# Fixing 'Sq_ft' and 'price' error for negative values
df['sq__ft'] = df['sq__ft'].abs()
df['price'] = df['price'].abs()
df.describe()
zip | beds | baths | sq__ft | price | latitude | longitude | |
---|---|---|---|---|---|---|---|
count | 985.000000 | 985.000000 | 985.000000 | 985.000000 | 985.000000 | 985.000000 | 985.000000 |
mean | 95750.697462 | 2.911675 | 1.776650 | 1314.916751 | 234144.263959 | 38.445121 | -121.193371 |
std | 85.176072 | 1.307932 | 0.895371 | 853.048243 | 138365.839085 | 5.103637 | 5.100670 |
min | 95603.000000 | 0.000000 | 0.000000 | 0.000000 | 1551.000000 | -121.503471 | -121.551704 |
25% | 95660.000000 | 2.000000 | 1.000000 | 952.000000 | 145000.000000 | 38.482704 | -121.446119 |
50% | 95762.000000 | 3.000000 | 2.000000 | 1304.000000 | 213750.000000 | 38.625932 | -121.375799 |
75% | 95828.000000 | 4.000000 | 2.000000 | 1718.000000 | 300000.000000 | 38.695589 | -121.294893 |
max | 95864.000000 | 8.000000 | 5.000000 | 5822.000000 | 884790.000000 | 39.020808 | 38.668433 |
We also need to check if there is any empty values
df.empty
False
We also need to check unique values for each columns. Thus, we can detect any errors and fix them.
df['city'].unique()
array(['SACRAMENTO', 'RANCHO CORDOVA', 'RIO LINDA', 'CITRUS HEIGHTS',
'NORTH HIGHLANDS', 'ANTELOPE', 'ELK GROVE', 'ELVERTA', 'GALT',
'CARMICHAEL', 'ORANGEVALE', 'FOLSOM', 'MATHER', 'POLLOCK PINES',
'GOLD RIVER', 'EL DORADO HILLS', 'RANCHO MURIETA', 'WILTON',
'GREENWOOD', 'FAIR OAKS', 'CAMERON PARK', 'LINCOLN', 'PLACERVILLE',
'MEADOW VISTA', 'ROSEVILLE', 'ROCKLIN', 'AUBURN', 'LOOMIS',
'EL DORADO', 'PENRYN', 'GRANITE BAY', 'FORESTHILL',
'DIAMOND SPRINGS', 'SHINGLE SPRINGS', 'COOL', 'WALNUT GROVE',
'GARDEN VALLEY', 'SLOUGHHOUSE', 'WEST SACRAMENTO'], dtype=object)
df['state'].unique()
#There is no state which has postal code abbreviation "AC"
array(['CA', 'AC'], dtype=object)
df['beds'].unique()
array([2, 3, 1, 4, 0, 5, 8, 6], dtype=int64)
df['type'].unique()
# Typo "Unkown"
array(['Residential', 'Condo', 'Multi-Family', 'Unkown'], dtype=object)
# Fixing typo
df['type'].replace('Unkown', 'Unknown',inplace=True)
df['type'].unique()
array(['Residential', 'Condo', 'Multi-Family', 'Unknown'], dtype=object)
df['baths'].unique()
array([1, 2, 3, 4, 0, 5], dtype=int64)
# Fixing state typo
df['state'].replace('AC', 'CA',inplace=True)
#Checking if error was fixed
df['state'].unique()
array(['CA'], dtype=object)
Just an extra step to make the column's name look nicer
# sq__ft should only have one '_'
df.rename(columns={'sq__ft':'sq_ft'}, inplace=True)
df.head(3)
street | city | zip | state | beds | baths | sq_ft | type | sale_date | price | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3526 HIGH ST | SACRAMENTO | 95838 | CA | 2 | 1 | 836 | Residential | Wed May 21 00:00:00 EDT 2008 | 59222 | 38.631913 | -121.434879 |
1 | 51 OMAHA CT | SACRAMENTO | 95823 | CA | 3 | 1 | 1167 | Residential | Wed May 21 00:00:00 EDT 2008 | 68212 | 38.478902 | -121.431028 |
2 | 2796 BRANCH ST | SACRAMENTO | 95815 | CA | 2 | 1 | 796 | Residential | Wed May 21 00:00:00 EDT 2008 | 68880 | 38.618305 | -121.443839 |
Ahha, checking duplicate rows is also needed
duplicate = df[df.street.duplicated(keep=False)]
# duplicate = df.loc[duplicate_bool == True]
duplicate
street | city | zip | state | beds | baths | sq_ft | type | sale_date | price | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
341 | 4734 14TH AVE | SACRAMENTO | 95820 | CA | 2 | 1 | 834 | Residential | Mon May 19 00:00:00 EDT 2008 | 68000 | 38.539447 | -121.450858 |
342 | 4734 14TH AVE | SACRAMENTO | 95820 | CA | 2 | 1 | 834 | Residential | Mon May 19 00:00:00 EDT 2008 | 68000 | 38.539447 | -121.450858 |
393 | 1223 LAMBERTON CIR | SACRAMENTO | 95838 | CA | 3 | 2 | 1370 | Residential | Mon May 19 00:00:00 EDT 2008 | 155435 | 38.646677 | -121.437573 |
394 | 1223 LAMBERTON CIR | SACRAMENTO | 95838 | CA | 3 | 2 | 1370 | Residential | Mon May 19 00:00:00 EDT 2008 | 155500 | 38.646677 | -121.437573 |
403 | 8306 CURLEW CT | CITRUS HEIGHTS | 95621 | CA | 4 | 2 | 1280 | Residential | Mon May 19 00:00:00 EDT 2008 | 167293 | 38.715781 | -121.298519 |
404 | 8306 CURLEW CT | CITRUS HEIGHTS | 95621 | CA | 4 | 2 | 1280 | Residential | Mon May 19 00:00:00 EDT 2008 | 167293 | 38.715781 | -121.298519 |
600 | 7 CRYSTALWOOD CIR | LINCOLN | 95648 | CA | 0 | 0 | 0 | Residential | Mon May 19 00:00:00 EDT 2008 | 4897 | 38.885962 | -121.289436 |
601 | 7 CRYSTALWOOD CIR | LINCOLN | 95648 | CA | 0 | 0 | 0 | Residential | Mon May 19 00:00:00 EDT 2008 | 4897 | 38.885962 | -121.289436 |
# Fixing duplicate rows
df = df.drop_duplicates(keep='first')
df[df.duplicated()]
street | city | zip | state | beds | baths | sq_ft | type | sale_date | price | latitude | longitude |
---|
Sometimes, we have to explore the data set in order to understand every single value in cells
# latitude range [-90, 90]
# Longtitude range [-180, 180]
# df['latitude'] is not correct
df.loc[df['latitude'] == df['latitude'].min()]
street | city | zip | state | beds | baths | sq_ft | type | sale_date | price | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
703 | 1900 DANBROOK DR | SACRAMENTO | 95835 | CA | 1 | 1 | 984 | Condo | Fri May 16 00:00:00 EDT 2008 | 210944 | -121.503471 | 38.668433 |
#Switching values
lat = df.loc[703,'longitude']
long = df.loc[703, 'latitude']
df['latitude'].replace(df.loc[703,'latitude'], lat,inplace=True)
df['longitude'].replace(df.loc[703,'longitude'], long,inplace=True)
We always deal with NaN values, but we also have to be aware of zero values which may affect our data. For missing values, if you need to fill in errors or blanks, use the fillna() and dropna() methods. You could just drop them from the dataframe entirely, either by the row or by the column, but you have to keep in mind, do not delete over 5% of dataset.
You could fill the NaNs with strings, or if they are numbers you could use imputation methods which use the mean or the median value. In this case, we will treat zero values as NaN values.
#Cleaning data with sq_ft = 0
zero = df.loc[(df['sq_ft'] == 0) ]
zero.head(3)
street | city | zip | state | beds | baths | sq_ft | type | sale_date | price | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
73 | 17 SERASPI CT | SACRAMENTO | 95834 | CA | 0 | 0 | 0 | Residential | Wed May 21 00:00:00 EDT 2008 | 206000 | 38.631481 | -121.501880 |
89 | 2866 KARITSA AVE | SACRAMENTO | 95833 | CA | 0 | 0 | 0 | Residential | Wed May 21 00:00:00 EDT 2008 | 244500 | 38.626671 | -121.525970 |
100 | 12209 CONSERVANCY WAY | RANCHO CORDOVA | 95742 | CA | 0 | 0 | 0 | Residential | Wed May 21 00:00:00 EDT 2008 | 263500 | 38.553867 | -121.219141 |
zero['city'].unique()
array(['SACRAMENTO', 'RANCHO CORDOVA', 'EL DORADO HILLS', 'ELK GROVE',
'FOLSOM', 'RANCHO MURIETA', 'CAMERON PARK', 'LINCOLN',
'PLACERVILLE', 'GOLD RIVER', 'ROSEVILLE', 'ROCKLIN', 'AUBURN',
'WILTON', 'GALT', 'RIO LINDA', 'FORESTHILL', 'SHINGLE SPRINGS',
'GARDEN VALLEY', 'WEST SACRAMENTO'], dtype=object)
nan = df.groupby('city')['sq_ft'].median().to_frame()
nan
sq_ft | |
---|---|
city | |
ANTELOPE | 1669.0 |
AUBURN | 960.0 |
CAMERON PARK | 0.0 |
CARMICHAEL | 1472.5 |
CITRUS HEIGHTS | 1292.5 |
COOL | 1457.0 |
DIAMOND SPRINGS | 1300.0 |
EL DORADO | 1332.0 |
EL DORADO HILLS | 0.0 |
ELK GROVE | 1722.5 |
ELVERTA | 1235.0 |
FAIR OAKS | 1450.0 |
FOLSOM | 1939.0 |
FORESTHILL | 0.0 |
GALT | 1479.0 |
GARDEN VALLEY | 0.0 |
GOLD RIVER | 1750.5 |
GRANITE BAY | 2896.0 |
GREENWOOD | 2846.0 |
LINCOLN | 0.0 |
LOOMIS | 1476.0 |
MATHER | 2093.0 |
MEADOW VISTA | 1216.0 |
NORTH HIGHLANDS | 1152.0 |
ORANGEVALE | 1574.0 |
PENRYN | 1320.0 |
PLACERVILLE | 474.0 |
POLLOCK PINES | 1320.0 |
RANCHO CORDOVA | 1222.0 |
RANCHO MURIETA | 2734.0 |
RIO LINDA | 1182.0 |
ROCKLIN | 0.0 |
ROSEVILLE | 1049.0 |
SACRAMENTO | 1318.0 |
SHINGLE SPRINGS | 0.0 |
SLOUGHHOUSE | 5822.0 |
WALNUT GROVE | 1727.0 |
WEST SACRAMENTO | 0.0 |
WILTON | 3741.0 |
# Deleting extra rows
df.drop( df[ df['city'] == 'FORESTHILL' ].index , inplace=True)
df.drop( df[ df['city'] == 'CAMERON PARK' ].index , inplace=True)
df.drop( df[ df['city'] == 'EL DORADO HILLS' ].index , inplace=True)
df.drop( df[ df['city'] == 'GARDEN VALLEY' ].index , inplace=True)
df.drop( df[ df['city'] == 'LINCOLN' ].index , inplace=True)
df.drop( df[ df['city'] == 'ROCKLIN' ].index , inplace=True)
df.drop( df[ df['city'] == 'SHINGLE SPRINGS' ].index , inplace=True)
df.drop( df[ df['city'] == 'WEST SACRAMENTO' ].index , inplace=True)
# Replaceing median value to sq_ft with value '0'
df.loc[(df.city == 'SACRAMENTO') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][33]
df.loc[(df.city == 'RANCHO CORDOVA') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][28]
df.loc[(df.city == 'ELK GROVE') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][9]
df.loc[(df.city == 'FOLSOM') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][12]
df.loc[(df.city == 'RANCHO MURIETA') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][29]
df.loc[(df.city == 'GOLD RIVER') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][16]
df.loc[(df.city == 'ROSEVILLE') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][32]
df.loc[(df.city == 'WILTON') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][-1]
df.loc[(df.city == 'RIO LINDA') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][30]
df.loc[(df.city == 'PLACERVILLE') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][26]
df.loc[(df.city == 'GALT') & (df.sq_ft == 0) , 'sq_ft'] = nan['sq_ft'][14]
# Replaceing median value to beds with value '0'
one = df.loc[(df['beds'] == 0) ]
one.head(3)
street | city | zip | state | beds | baths | sq_ft | type | sale_date | price | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
73 | 17 SERASPI CT | SACRAMENTO | 95834 | CA | 0 | 0 | 1318.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 206000 | 38.631481 | -121.501880 |
89 | 2866 KARITSA AVE | SACRAMENTO | 95833 | CA | 0 | 0 | 1318.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 244500 | 38.626671 | -121.525970 |
100 | 12209 CONSERVANCY WAY | RANCHO CORDOVA | 95742 | CA | 0 | 0 | 1222.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 263500 | 38.553867 | -121.219141 |
one['city'].unique()
array(['SACRAMENTO', 'RANCHO CORDOVA', 'ELK GROVE', 'FOLSOM',
'RANCHO MURIETA', 'GOLD RIVER', 'ROSEVILLE', 'WILTON', 'GALT',
'RIO LINDA', 'PLACERVILLE'], dtype=object)
bed = df.groupby('city')['beds'].median().to_frame()
bed.head(3)
beds | |
---|---|
city | |
ANTELOPE | 3.0 |
AUBURN | 3.0 |
CARMICHAEL | 3.0 |
df.loc[(df.city == 'SACRAMENTO') & (df.beds == 0) , 'beds'] = bed['beds'][27]
df.loc[(df.city == 'RANCHO CORDOVA') & (df.beds == 0) , 'beds'] = bed['beds'][23]
df.loc[(df.city == 'ELK GROVE') & (df.beds == 0) , 'beds'] = bed['beds'][7]
df.loc[(df.city == 'FOLSOM') & (df.beds == 0) , 'beds'] = bed['beds'][10]
df.loc[(df.city == 'RANCHO MURIETA') & (df.beds == 0) , 'beds'] = bed['beds'][24]
df.loc[(df.city == 'GOLD RIVER') & (df.beds == 0) , 'beds'] = bed['beds'][12]
df.loc[(df.city == 'WILTON') & (df.beds == 0) , 'beds'] = bed['beds'][-1]
df.loc[(df.city == 'GALT') & (df.beds == 0) , 'beds'] = bed['beds'][11]
df.loc[(df.city == 'RIO LINDA') & (df.beds == 0) , 'beds'] = bed['beds'][25]
df.loc[(df.city == 'PLACERVILLE') & (df.beds == 0) , 'beds'] = bed['beds'][21]
df.loc[(df.city == 'ROSEVILLE') & (df.beds == 0) , 'beds'] = bed['beds'][26]
# Replaceing median value to baths with value '0'
two = df.loc[(df['baths'] == 0) ]
two.head(3)
street | city | zip | state | beds | baths | sq_ft | type | sale_date | price | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
73 | 17 SERASPI CT | SACRAMENTO | 95834 | CA | 3.0 | 0 | 1318.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 206000 | 38.631481 | -121.501880 |
89 | 2866 KARITSA AVE | SACRAMENTO | 95833 | CA | 3.0 | 0 | 1318.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 244500 | 38.626671 | -121.525970 |
100 | 12209 CONSERVANCY WAY | RANCHO CORDOVA | 95742 | CA | 3.0 | 0 | 1222.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 263500 | 38.553867 | -121.219141 |
two['city'].unique()
array(['SACRAMENTO', 'RANCHO CORDOVA', 'ELK GROVE', 'FOLSOM',
'RANCHO MURIETA', 'GOLD RIVER', 'ROSEVILLE', 'WILTON', 'GALT',
'RIO LINDA', 'PLACERVILLE'], dtype=object)
bath = df.groupby('city')['baths'].median().to_frame()
bath.head(3)
baths | |
---|---|
city | |
ANTELOPE | 2.0 |
AUBURN | 3.0 |
CARMICHAEL | 2.0 |
df.loc[(df.city == 'SACRAMENTO') & (df.baths == 0) , 'baths'] = bath['baths'][27]
df.loc[(df.city == 'RANCHO CORDOVA') & (df.baths == 0) , 'baths'] = bath['baths'][23]
df.loc[(df.city == 'ELK GROVE') & (df.baths == 0) , 'baths'] = bath['baths'][7]
df.loc[(df.city == 'FOLSOM') & (df.baths == 0) , 'baths'] = bath['baths'][10]
df.loc[(df.city == 'RANCHO MURIETA') & (df.baths == 0) , 'baths'] = bath['baths'][24]
df.loc[(df.city == 'GOLD RIVER') & (df.baths == 0) , 'baths'] = bath['baths'][12]
df.loc[(df.city == 'ROSEVILLE') & (df.baths == 0) , 'baths'] = bath['baths'][26]
df.loc[(df.city == 'WILTON') & (df.baths == 0) , 'baths'] = bath['baths'][-1]
df.loc[(df.city == 'GALT') & (df.baths == 0) , 'baths'] = bath['baths'][11]
df.loc[(df.city == 'RIO LINDA') & (df.baths == 0) , 'baths'] = bath['baths'][25]
df.loc[(df.city == 'PLACERVILLE') & (df.baths == 0) , 'baths'] = bath['baths'][21]
After some work, our dataframe looks much nicer.
df.head(3)
s
street | city | zip | state | beds | baths | sq_ft | type | sale_date | price | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3526 HIGH ST | SACRAMENTO | 95838 | CA | 2.0 | 1.0 | 836.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 59222 | 38.631913 | -121.434879 |
1 | 51 OMAHA CT | SACRAMENTO | 95823 | CA | 3.0 | 1.0 | 1167.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 68212 | 38.478902 | -121.431028 |
2 | 2796 BRANCH ST | SACRAMENTO | 95815 | CA | 2.0 | 1.0 | 796.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 68880 | 38.618305 | -121.443839 |
Dealing with messy data is a pain. Data cleaning is very important because it is a part of the process on a data science project. In this blog, we went over some ways to detect, summarize, and replace odd values. Usually, we deal with NaN values, but in this blog, we got chance to treat zero values as NaN values.
If you spend less time on data cleaning, you will have more time for exploring and modeling.