🐼 Pandas Data Cleaning¶

Data cleaning is a vital step in the data analysis process, as the results of your analysis are only as good as the quality of your data. In this tutorial, we will go over how to clean data using the pandas library.

1. Importing the Pandas library¶

Before cleaning your data, you need to import the pandas library. This is typically imported under the pd alias.

In [1]:

            
                Copied!
                
import pandas as pd
import pandas as pd

2. Loading your data¶

In [2]:

            
                Copied!
                
                    
                    
                
                

        
# Create a DataFrame by passing a numpy array, with a datetime index and labeled columns
data = [
    ['John', 28, 1.82],
    ['Anna', None, 1.65],
    ['Peter', 35, 1.76],
    ['John', 28, 1.82],
    ['Linda', 32, 1.79],
    ['Alice', 41, None],
    ['Carl', 29, 1.72],
]

df = pd.DataFrame(data, columns=['name', 'age', 'height'])
df = df.set_index("name")
df
# Create a DataFrame by passing a numpy array, with a datetime index and labeled columns
data = [
    ['John', 28, 1.82],
    ['Anna', None, 1.65],
    ['Peter', 35, 1.76],
    ['John', 28, 1.82],
    ['Linda', 32, 1.79],
    ['Alice', 41, None],
    ['Carl', 29, 1.72],
]

df = pd.DataFrame(data, columns=['name', 'age', 'height'])
df = df.set_index("name")
df

Out[2]:

	age	height
name
John	28.0	1.82
Anna	NaN	1.65
Peter	35.0	1.76
John	28.0	1.82
Linda	32.0	1.79
Alice	41.0	NaN
Carl	29.0	1.72

3. Handling Missing Values¶

3.1 Detecting missing values¶

You can use the isna() function to identify missing or NaN values in the DataFrame:

In [3]:

            
                Copied!
                
df.isna()
df.isna()

Out[3]:

	age	height
name
John	False	False
Anna	True	False
Peter	False	False
John	False	False
Linda	False	False
Alice	False	True
Carl	False	False

This will return a DataFrame of the same shape as df, but with True in places where the original DataFrame has NaN or None and False elsewhere.

You can use the isna().sum() to get a count of missing values in each column:

In [4]:

            
                Copied!
                
df.isna().sum()
df.isna().sum()

Out[4]:

age       1
height    1
dtype: int64

3.2 Removing missing values¶

One way to handle missing values is to remove them using the dropna() function:

In [5]:

            
                Copied!
                
df.dropna()
df.dropna()

Out[5]:

	age	height
name
John	28.0	1.82
Peter	35.0	1.76
John	28.0	1.82
Linda	32.0	1.79
Carl	29.0	1.72

This will return a new DataFrame with rows containing NaN values dropped. You can specify the axis parameter as 1 to drop columns containing NaN values:

In [6]:

            
                Copied!
                
df.dropna(axis=1)
df.dropna(axis=1)

Out[6]:


name
John
Anna
Peter
John
Linda
Alice
Carl

Use it with caution

This could potentially remove a lot of your data.

3.3 Replacing missing values¶

Another way to handle missing values is to replace them with a valid value. This value can be a single number like zero, or some sort of imputation like the mean or median of the column. Use the fillna() function to do this:

In [7]:

            
                Copied!
                
df.fillna(0)  # replace all NaN values with 0
df.fillna(0)  # replace all NaN values with 0

Out[7]:

	age	height
name
John	28.0	1.82
Anna	0.0	1.65
Peter	35.0	1.76
John	28.0	1.82
Linda	32.0	1.79
Alice	41.0	0.00
Carl	29.0	1.72

In [8]:

            
                Copied!
                
df.fillna(df.mean())  # replace all NaN values with the mean of each column
df.fillna(df.mean())  # replace all NaN values with the mean of each column

Out[8]:

	age	height
name
John	28.000000	1.82
Anna	32.166667	1.65
Peter	35.000000	1.76
John	28.000000	1.82
Linda	32.000000	1.79
Alice	41.000000	1.76
Carl	29.000000	1.72

4. Removing Duplicates¶

You can use the duplicated() function to check for duplicate rows:

In [9]:

            
                Copied!
                
df.duplicated()
df.duplicated()

Out[9]:

name
John     False
Anna     False
Peter    False
John      True
Linda    False
Alice    False
Carl     False
dtype: bool

This will return a Series that is True where a row is duplicated.

You can remove the duplicate rows using the drop_duplicates() function:

In [10]:

            
                Copied!
                
df.drop_duplicates()
df.drop_duplicates()

Out[10]:

	age	height
name
John	28.0	1.82
Anna	NaN	1.65
Peter	35.0	1.76
Linda	32.0	1.79
Alice	41.0	NaN
Carl	29.0	1.72

This will return a new DataFrame with the duplicates removed.

5.Renaming Columns¶

You can rename columns using the rename() function and passing a dictionary of {old_name: new_name} pairs:

In [13]:

            
                Copied!
                
df.rename(columns={'height': 'tall'})
df.rename(columns={'height': 'tall'})

Out[13]:

	age	tall
name
John	28.0	1.82
Anna	NaN	1.65
Peter	35.0	1.76
John	28.0	1.82
Linda	32.0	1.79
Alice	41.0	NaN
Carl	29.0	1.72

6. Changing Data Types¶

Sometimes, you might need to change the data type of a column. Use the astype() function for this:

In [15]:

            
                Copied!
                
df['age'] = df['age'].astype(float)
df
df['age'] = df['age'].astype(float)
df

Out[15]:

	age	height
name
John	28.0	1.82
Anna	NaN	1.65
Peter	35.0	1.76
John	28.0	1.82
Linda	32.0	1.79
Alice	41.0	NaN
Carl	29.0	1.72

This concludes our brief tutorial on data cleaning using pandas. Remember, data cleaning is a very important step in the data analysis process and spending more time on this step can often save you time in the later stages of the project.