🐼 Pandas Data Cleaning¶
Data cleaning is a vital step in the data analysis process, as the results of your analysis are only as good as the quality of your data. In this tutorial, we will go over how to clean data using the pandas library.
1. Importing the Pandas library¶
Before cleaning your data, you need to import the pandas library. This is typically imported under the pd
alias.
import pandas as pd
2. Loading your data¶
# Create a DataFrame by passing a numpy array, with a datetime index and labeled columns
data = [
['John', 28, 1.82],
['Anna', None, 1.65],
['Peter', 35, 1.76],
['John', 28, 1.82],
['Linda', 32, 1.79],
['Alice', 41, None],
['Carl', 29, 1.72],
]
df = pd.DataFrame(data, columns=['name', 'age', 'height'])
df = df.set_index("name")
df
age | height | |
---|---|---|
name | ||
John | 28.0 | 1.82 |
Anna | NaN | 1.65 |
Peter | 35.0 | 1.76 |
John | 28.0 | 1.82 |
Linda | 32.0 | 1.79 |
Alice | 41.0 | NaN |
Carl | 29.0 | 1.72 |
df.isna()
age | height | |
---|---|---|
name | ||
John | False | False |
Anna | True | False |
Peter | False | False |
John | False | False |
Linda | False | False |
Alice | False | True |
Carl | False | False |
This will return a DataFrame of the same shape as df
, but with True
in places where the original DataFrame has NaN
or None
and False
elsewhere.
You can use the isna().sum()
to get a count of missing values in each column:
df.isna().sum()
age 1 height 1 dtype: int64
3.2 Removing missing values¶
One way to handle missing values is to remove them using the dropna()
function:
df.dropna()
age | height | |
---|---|---|
name | ||
John | 28.0 | 1.82 |
Peter | 35.0 | 1.76 |
John | 28.0 | 1.82 |
Linda | 32.0 | 1.79 |
Carl | 29.0 | 1.72 |
This will return a new DataFrame with rows containing NaN
values dropped. You can specify the axis
parameter as 1
to drop columns containing NaN
values:
df.dropna(axis=1)
name |
---|
John |
Anna |
Peter |
John |
Linda |
Alice |
Carl |
Use it with caution
This could potentially remove a lot of your data.
3.3 Replacing missing values¶
Another way to handle missing values is to replace them with a valid value. This value can be a single number like zero, or some sort of imputation like the mean or median of the column. Use the fillna()
function to do this:
df.fillna(0) # replace all NaN values with 0
age | height | |
---|---|---|
name | ||
John | 28.0 | 1.82 |
Anna | 0.0 | 1.65 |
Peter | 35.0 | 1.76 |
John | 28.0 | 1.82 |
Linda | 32.0 | 1.79 |
Alice | 41.0 | 0.00 |
Carl | 29.0 | 1.72 |
df.fillna(df.mean()) # replace all NaN values with the mean of each column
age | height | |
---|---|---|
name | ||
John | 28.000000 | 1.82 |
Anna | 32.166667 | 1.65 |
Peter | 35.000000 | 1.76 |
John | 28.000000 | 1.82 |
Linda | 32.000000 | 1.79 |
Alice | 41.000000 | 1.76 |
Carl | 29.000000 | 1.72 |
4. Removing Duplicates¶
You can use the duplicated()
function to check for duplicate rows:
df.duplicated()
name John False Anna False Peter False John True Linda False Alice False Carl False dtype: bool
This will return a Series that is True
where a row is duplicated.
You can remove the duplicate rows using the drop_duplicates()
function:
df.drop_duplicates()
age | height | |
---|---|---|
name | ||
John | 28.0 | 1.82 |
Anna | NaN | 1.65 |
Peter | 35.0 | 1.76 |
Linda | 32.0 | 1.79 |
Alice | 41.0 | NaN |
Carl | 29.0 | 1.72 |
This will return a new DataFrame with the duplicates removed.
5.Renaming Columns¶
You can rename columns using the rename()
function and passing a dictionary of {old_name: new_name}
pairs:
df.rename(columns={'height': 'tall'})
age | tall | |
---|---|---|
name | ||
John | 28.0 | 1.82 |
Anna | NaN | 1.65 |
Peter | 35.0 | 1.76 |
John | 28.0 | 1.82 |
Linda | 32.0 | 1.79 |
Alice | 41.0 | NaN |
Carl | 29.0 | 1.72 |
6. Changing Data Types¶
Sometimes, you might need to change the data type of a column. Use the astype()
function for this:
df['age'] = df['age'].astype(float)
df
age | height | |
---|---|---|
name | ||
John | 28.0 | 1.82 |
Anna | NaN | 1.65 |
Peter | 35.0 | 1.76 |
John | 28.0 | 1.82 |
Linda | 32.0 | 1.79 |
Alice | 41.0 | NaN |
Carl | 29.0 | 1.72 |
This concludes our brief tutorial on data cleaning using pandas. Remember, data cleaning is a very important step in the data analysis process and spending more time on this step can often save you time in the later stages of the project.