🐼 Pandas Basics¶

Pandas is a popular Python library for data analysis. It is built on top of two core Python libraries - Matplotlib for data visualization and NumPy for mathematical operations. Pandas provides a flexible and efficient DataFrame object, which is similar to a spreadsheet and can be manipulated in a similar way to SQL tables.

Let's get started with some basic operations in pandas.

1 Installing and Importing Pandas¶

If you haven't installed pandas yet, you can do so using pip or poetry:

pip install pandas
poetry add pandas

Once installed, you can import pandas as:

In [1]:

            
                Copied!
                
import pandas as pd
import pandas as pd

2. DataFrame and Series¶

A DataFrame is a table of entries (like an Excel spreadsheet), with labeled axes (rows and columns). A Series, on the other hand, is a single column of a DataFrame.

In [2]:

            
                Copied!
                
import numpy as np

# Create a Series
s = pd.Series([1, 2, 3, np.nan, 5, 6])
s
import numpy as np

# Create a Series
s = pd.Series([1, 2, 3, np.nan, 5, 6])
s

Out[2]:

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64

In [23]:

            
                Copied!
                
                    
                    
                
                

        
# Create a DataFrame by passing a numpy array, with a datetime index and labeled columns
data = [
    ['John', 28, 1.82],
    ['Anna', 24, 1.65],
    ['Peter', 35, 1.76],
    ['Linda', 32, 1.79],
    ['Alice', 41, 1.69],
    ['Carl', 29, 1.72],
]

df = pd.DataFrame(data, columns=['name', 'age', 'height'])
# Create a DataFrame by passing a numpy array, with a datetime index and labeled columns
data = [
    ['John', 28, 1.82],
    ['Anna', 24, 1.65],
    ['Peter', 35, 1.76],
    ['Linda', 32, 1.79],
    ['Alice', 41, 1.69],
    ['Carl', 29, 1.72],
]

df = pd.DataFrame(data, columns=['name', 'age', 'height'])

More info about Data Cleaning

There are multiple ways of creating a dataframe, check out the dataframe creation tutorial to know more.

3. Viewing Data¶

You can view the top and bottom rows of the DataFrame using head() and tail() methods:

In [24]:

            
                Copied!
                
# View top rows
df.head()
# View top rows
df.head()

Out[24]:

	name	age	height
0	John	28	1.82
1	Anna	24	1.65
2	Peter	35	1.76
3	Linda	32	1.79
4	Alice	41	1.69

In [25]:

            
                Copied!
                
# View bottom rows
df.tail()
# View bottom rows
df.tail()

Out[25]:

	name	age	height
1	Anna	24	1.65
2	Peter	35	1.76
3	Linda	32	1.79
4	Alice	41	1.69
5	Carl	29	1.72

You can also display the index, columns, and the underlying numpy data:

In [26]:

            
                Copied!
                
# Display index, columns, and the underlying numpy data
print(df.index, "\n")
print(df.columns, "\n")
print(df.values, "\n")
# Display index, columns, and the underlying numpy data
print(df.index, "\n")
print(df.columns, "\n")
print(df.values, "\n")

RangeIndex(start=0, stop=6, step=1) 

Index(['name', 'age', 'height'], dtype='object') 

[['John' 28 1.82]
 ['Anna' 24 1.65]
 ['Peter' 35 1.76]
 ['Linda' 32 1.79]
 ['Alice' 41 1.69]
 ['Carl' 29 1.72]]

4. Statistics¶

A quick statistical summary of your data can be shown using describe():

In [27]:

            
                Copied!
                
df.describe()
df.describe()

Out[27]:

	age	height
count	6.000000	6.000000
mean	31.500000	1.738333
std	5.958188	0.063692
min	24.000000	1.650000
25%	28.250000	1.697500
50%	30.500000	1.740000
75%	34.250000	1.782500
max	41.000000	1.820000

5. Sorting¶

You can sort your data by the values in a particular column:

In [32]:

            
                Copied!
                
df.sort_values(by='name')
df.sort_values(by='name')

Out[32]:

	name	age	height
4	Alice	41	1.69
1	Anna	24	1.65
5	Carl	29	1.72
0	John	28	1.82
3	Linda	32	1.79
2	Peter	35	1.76

You can also sort by the index or column names:

In [33]:

            
                Copied!
                
# Sorting by index
df.sort_index(axis=0, ascending=False)
# Sorting by index
df.sort_index(axis=0, ascending=False)

Out[33]:

	name	age	height
5	Carl	29	1.72
4	Alice	41	1.69
3	Linda	32	1.79
2	Peter	35	1.76
1	Anna	24	1.65
0	John	28	1.82

In [34]:

            
                Copied!
                
# Sorting by column names
df.sort_index(axis=1, ascending=False)
# Sorting by column names
df.sort_index(axis=1, ascending=False)

Out[34]:

	name	height	age
0	John	1.82	28
1	Anna	1.65	24
2	Peter	1.76	35
3	Linda	1.79	32
4	Alice	1.69	41
5	Carl	1.72	29

6. Indexing¶

Indexing is used for selecting rows and columns of data from a DataFrame. There are several ways to do it

More info about Data Cleaning

If you want to dug deeper into dataframe indexing, check out the data indexing tutorial.

You can select a single column by its label:

In [35]:

            
                Copied!
                
df['name']
df['name']

Out[35]:

0     John
1     Anna
2    Peter
3    Linda
4    Alice
5     Carl
Name: name, dtype: object

Or through [], which slices the rows:

In [36]:

            
                Copied!
                
df[0:3]
df[0:3]

Out[36]:

	name	age	height
0	John	28	1.82
1	Anna	24	1.65
2	Peter	35	1.76

For selection by label, you can use loc:

In [39]:

            
                Copied!
                
df.loc[0:3, ['name', 'age']]
df.loc[0:3, ['name', 'age']]

Out[39]:

	name	age
0	John	28
1	Anna	24
2	Peter	35
3	Linda	32

For selection by position, you can use iloc:

In [40]:

            
                Copied!
                
df.iloc[3]
df.iloc[3]

Out[40]:

name      Linda
age          32
height     1.79
Name: 3, dtype: object

7. Data Cleaning¶

Data cleaning is generally the most time-consuming part of a data analysis project. Pandas provides a number of features to make this easier.

More info about Data Cleaning

If you want to dug deeper into data cleaning, check out the data cleaning tutorial.

We add some NaN values to our DataFrame to demonstrate data cleaning.

In [45]:

            
                Copied!
                
# add NaN values
df.loc[0, 'age'] = np.nan
df
# add NaN values
df.loc[0, 'age'] = np.nan
df

Out[45]:

	name	age	height
0	John	NaN	1.82
1	Anna	24.0	1.65
2	Peter	35.0	1.76
3	Linda	32.0	1.79
4	Alice	41.0	1.69
5	Carl	29.0	1.72

To check the missing data, you can use isna() or notna():

In [47]:

            
                Copied!
                
df.isna()
df.isna()

Out[47]:

	name	age	height
0	False	True	False
1	False	False	False
2	False	False	False
3	False	False	False
4	False	False	False
5	False	False	False

In [48]:

            
                Copied!
                
df.notna()
df.notna()

Out[48]:

	name	age	height
0	True	False	True
1	True	True	True
2	True	True	True
3	True	True	True
4	True	True	True
5	True	True	True

To drop any rows that have missing data, you can use dropna():

In [49]:

            
                Copied!
                
df.dropna(how='any')  # how is used to specify if any or all rows with NaNs should be dropped
df.dropna(how='any')  # how is used to specify if any or all rows with NaNs should be dropped

Out[49]:

	name	age	height
1	Anna	24.0	1.65
2	Peter	35.0	1.76
3	Linda	32.0	1.79
4	Alice	41.0	1.69
5	Carl	29.0	1.72

To fill missing data with a specific value, you can use fillna():

In [51]:

            
                Copied!
                
df.fillna(value=50)
df.fillna(value=50)

Out[51]:

	name	age	height
0	John	50.0	1.82
1	Anna	24.0	1.65
2	Peter	35.0	1.76
3	Linda	32.0	1.79
4	Alice	41.0	1.69
5	Carl	29.0	1.72

8. Applying Functions¶

You can apply functions to the data:

In [52]:

            
                Copied!
                
df.apply(np.cumsum)  # cumulative sum applies a sum from a row to its previous rows
df.apply(np.cumsum)  # cumulative sum applies a sum from a row to its previous rows

Out[52]:

	name	age	height
0	John	NaN	1.82
1	JohnAnna	24.0	3.47
2	JohnAnnaPeter	59.0	5.23
3	JohnAnnaPeterLinda	91.0	7.02
4	JohnAnnaPeterLindaAlice	132.0	8.71
5	JohnAnnaPeterLindaAliceCarl	161.0	10.43

Or apply a lambda function:

In [58]:

            
                Copied!
                
df.apply(lambda x: x*2)  # apply a function to each column
df.apply(lambda x: x*2)  # apply a function to each column

Out[58]:

	name	age	height
0	JohnJohn	NaN	3.64
1	AnnaAnna	48.0	3.30
2	PeterPeter	70.0	3.52
3	LindaLinda	64.0	3.58
4	AliceAlice	82.0	3.38
5	CarlCarl	58.0	3.44

This concludes our brief tutorial on the basics of pandas. These are just the basics - pandas has many more features and functionalities that you can explore as per your data manipulation and analysis needs!