🐼 Pandas Basics¶
Pandas is a popular Python library for data analysis. It is built on top of two core Python libraries - Matplotlib for data visualization and NumPy for mathematical operations. Pandas provides a flexible and efficient DataFrame object, which is similar to a spreadsheet and can be manipulated in a similar way to SQL tables.
Let's get started with some basic operations in pandas.
1 Installing and Importing Pandas¶
If you haven't installed pandas yet, you can do so using pip
or poetry
:
pip install pandas
poetry add pandas
Once installed, you can import pandas as:
import pandas as pd
2. DataFrame and Series¶
A DataFrame is a table of entries (like an Excel spreadsheet), with labeled axes (rows and columns). A Series, on the other hand, is a single column of a DataFrame.
import numpy as np
# Create a Series
s = pd.Series([1, 2, 3, np.nan, 5, 6])
s
0 1.0 1 2.0 2 3.0 3 NaN 4 5.0 5 6.0 dtype: float64
# Create a DataFrame by passing a numpy array, with a datetime index and labeled columns
data = [
['John', 28, 1.82],
['Anna', 24, 1.65],
['Peter', 35, 1.76],
['Linda', 32, 1.79],
['Alice', 41, 1.69],
['Carl', 29, 1.72],
]
df = pd.DataFrame(data, columns=['name', 'age', 'height'])
More info about Data Cleaning
There are multiple ways of creating a dataframe, check out the dataframe creation tutorial to know more.
3. Viewing Data¶
You can view the top and bottom rows of the DataFrame using head()
and tail()
methods:
# View top rows
df.head()
name | age | height | |
---|---|---|---|
0 | John | 28 | 1.82 |
1 | Anna | 24 | 1.65 |
2 | Peter | 35 | 1.76 |
3 | Linda | 32 | 1.79 |
4 | Alice | 41 | 1.69 |
# View bottom rows
df.tail()
name | age | height | |
---|---|---|---|
1 | Anna | 24 | 1.65 |
2 | Peter | 35 | 1.76 |
3 | Linda | 32 | 1.79 |
4 | Alice | 41 | 1.69 |
5 | Carl | 29 | 1.72 |
You can also display the index, columns, and the underlying numpy data:
# Display index, columns, and the underlying numpy data
print(df.index, "\n")
print(df.columns, "\n")
print(df.values, "\n")
RangeIndex(start=0, stop=6, step=1) Index(['name', 'age', 'height'], dtype='object') [['John' 28 1.82] ['Anna' 24 1.65] ['Peter' 35 1.76] ['Linda' 32 1.79] ['Alice' 41 1.69] ['Carl' 29 1.72]]
4. Statistics¶
A quick statistical summary of your data can be shown using describe()
:
df.describe()
age | height | |
---|---|---|
count | 6.000000 | 6.000000 |
mean | 31.500000 | 1.738333 |
std | 5.958188 | 0.063692 |
min | 24.000000 | 1.650000 |
25% | 28.250000 | 1.697500 |
50% | 30.500000 | 1.740000 |
75% | 34.250000 | 1.782500 |
max | 41.000000 | 1.820000 |
5. Sorting¶
You can sort your data by the values in a particular column:
df.sort_values(by='name')
name | age | height | |
---|---|---|---|
4 | Alice | 41 | 1.69 |
1 | Anna | 24 | 1.65 |
5 | Carl | 29 | 1.72 |
0 | John | 28 | 1.82 |
3 | Linda | 32 | 1.79 |
2 | Peter | 35 | 1.76 |
You can also sort by the index or column names:
# Sorting by index
df.sort_index(axis=0, ascending=False)
name | age | height | |
---|---|---|---|
5 | Carl | 29 | 1.72 |
4 | Alice | 41 | 1.69 |
3 | Linda | 32 | 1.79 |
2 | Peter | 35 | 1.76 |
1 | Anna | 24 | 1.65 |
0 | John | 28 | 1.82 |
# Sorting by column names
df.sort_index(axis=1, ascending=False)
name | height | age | |
---|---|---|---|
0 | John | 1.82 | 28 |
1 | Anna | 1.65 | 24 |
2 | Peter | 1.76 | 35 |
3 | Linda | 1.79 | 32 |
4 | Alice | 1.69 | 41 |
5 | Carl | 1.72 | 29 |
6. Indexing¶
Indexing is used for selecting rows and columns of data from a DataFrame. There are several ways to do it
More info about Data Cleaning
If you want to dug deeper into dataframe indexing, check out the data indexing tutorial.
You can select a single column by its label:
df['name']
0 John 1 Anna 2 Peter 3 Linda 4 Alice 5 Carl Name: name, dtype: object
Or through []
, which slices the rows:
df[0:3]
name | age | height | |
---|---|---|---|
0 | John | 28 | 1.82 |
1 | Anna | 24 | 1.65 |
2 | Peter | 35 | 1.76 |
For selection by label, you can use loc
:
df.loc[0:3, ['name', 'age']]
name | age | |
---|---|---|
0 | John | 28 |
1 | Anna | 24 |
2 | Peter | 35 |
3 | Linda | 32 |
For selection by position, you can use iloc
:
df.iloc[3]
name Linda age 32 height 1.79 Name: 3, dtype: object
7. Data Cleaning¶
Data cleaning is generally the most time-consuming part of a data analysis project. Pandas provides a number of features to make this easier.
More info about Data Cleaning
If you want to dug deeper into data cleaning, check out the data cleaning tutorial.
We add some NaN values to our DataFrame to demonstrate data cleaning.
# add NaN values
df.loc[0, 'age'] = np.nan
df
name | age | height | |
---|---|---|---|
0 | John | NaN | 1.82 |
1 | Anna | 24.0 | 1.65 |
2 | Peter | 35.0 | 1.76 |
3 | Linda | 32.0 | 1.79 |
4 | Alice | 41.0 | 1.69 |
5 | Carl | 29.0 | 1.72 |
To check the missing data, you can use isna()
or notna()
:
df.isna()
name | age | height | |
---|---|---|---|
0 | False | True | False |
1 | False | False | False |
2 | False | False | False |
3 | False | False | False |
4 | False | False | False |
5 | False | False | False |
df.notna()
name | age | height | |
---|---|---|---|
0 | True | False | True |
1 | True | True | True |
2 | True | True | True |
3 | True | True | True |
4 | True | True | True |
5 | True | True | True |
To drop any rows that have missing data, you can use dropna()
:
df.dropna(how='any') # how is used to specify if any or all rows with NaNs should be dropped
name | age | height | |
---|---|---|---|
1 | Anna | 24.0 | 1.65 |
2 | Peter | 35.0 | 1.76 |
3 | Linda | 32.0 | 1.79 |
4 | Alice | 41.0 | 1.69 |
5 | Carl | 29.0 | 1.72 |
To fill missing data with a specific value, you can use fillna()
:
df.fillna(value=50)
name | age | height | |
---|---|---|---|
0 | John | 50.0 | 1.82 |
1 | Anna | 24.0 | 1.65 |
2 | Peter | 35.0 | 1.76 |
3 | Linda | 32.0 | 1.79 |
4 | Alice | 41.0 | 1.69 |
5 | Carl | 29.0 | 1.72 |
8. Applying Functions¶
You can apply functions to the data:
df.apply(np.cumsum) # cumulative sum applies a sum from a row to its previous rows
name | age | height | |
---|---|---|---|
0 | John | NaN | 1.82 |
1 | JohnAnna | 24.0 | 3.47 |
2 | JohnAnnaPeter | 59.0 | 5.23 |
3 | JohnAnnaPeterLinda | 91.0 | 7.02 |
4 | JohnAnnaPeterLindaAlice | 132.0 | 8.71 |
5 | JohnAnnaPeterLindaAliceCarl | 161.0 | 10.43 |
Or apply a lambda function:
df.apply(lambda x: x*2) # apply a function to each column
name | age | height | |
---|---|---|---|
0 | JohnJohn | NaN | 3.64 |
1 | AnnaAnna | 48.0 | 3.30 |
2 | PeterPeter | 70.0 | 3.52 |
3 | LindaLinda | 64.0 | 3.58 |
4 | AliceAlice | 82.0 | 3.38 |
5 | CarlCarl | 58.0 | 3.44 |
This concludes our brief tutorial on the basics of pandas. These are just the basics - pandas has many more features and functionalities that you can explore as per your data manipulation and analysis needs!