📉 Scikit-learn Basics¶

Let's dive into Scikit-learn, one of the most popular machine learning libraries for Python. Scikit-learn provides easy-to-use tools for data preprocessing, feature selection, model building, and evaluation.

In this tutorial, we'll cover the following topics:

What is Scikit-learn?
Installing Scikit-learn
Loading Data
Data Preprocessing
Building a Simple Machine Learning Model
Evaluating the Model

Let's get started!

1. What is Scikit-learn?¶

Scikit-learn (sklearn) is a Python library that provides simple and efficient tools for data mining and data analysis. It is built on top of other scientific Python libraries like NumPy and SciPy, and it integrates well with Pandas and Matplotlib. Scikit-learn supports various machine learning algorithms, including regression, classification, clustering, and more.

2. Installing Scikit-learn¶

You can install Scikit-learn using pip. Open your terminal or command prompt and type the following:

pip install scikit-learn
poetry add scikit-learn

3. Loading Data¶

For this tutorial, we'll use a simple dataset that comes with Scikit-learn: the Iris dataset. The Iris dataset consists of 150 samples of iris flowers, each with four features (sepal length, sepal width, petal length, and petal width), and a target variable specifying the species of iris (Setosa, Versicolor, or Virginica).

In [9]:

            
                Copied!
                
from sklearn.datasets import load_iris

# Load the Iris dataset
data = load_iris()
X = data.data  # Features
y = data.target  # Target variable
from sklearn.datasets import load_iris

# Load the Iris dataset
data = load_iris()
X = data.data  # Features
y = data.target  # Target variable

4. Data Preprocessing¶

Before building a machine learning model, it's essential to preprocess the data. Scikit-learn provides many tools for data preprocessing, but we'll keep it simple for this tutorial. We'll split the data into training and testing sets.

In [10]:

            
                Copied!
                
from sklearn.model_selection import train_test_split

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.model_selection import train_test_split

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Building a Simple Machine Learning Model¶

For this tutorial, we'll use a basic classification algorithm called the k-Nearest Neighbors (k-NN) algorithm. k-NN is a non-parametric and lazy learning algorithm used for classification and regression tasks.

In [11]:

            
                Copied!
                
from sklearn.neighbors import KNeighborsClassifier

# Create the k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier on the training data
knn.fit(X_train, y_train)
from sklearn.neighbors import KNeighborsClassifier

# Create the k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier on the training data
knn.fit(X_train, y_train)

Out[11]:

KNeighborsClassifier(n_neighbors=3)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

6. Evaluating the Model¶

Now that we have trained our model, it's time to evaluate its performance on the test set.

In [12]:

            
                Copied!
                
                    
                    
                
                

        
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Generate a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Generate a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Generate a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Generate a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 1.00

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30


Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

The accuracy score will tell us how well our model performed on the test set. The classification report will give us precision, recall, F1-score, and support for each class. The confusion matrix will show the number of true positive, false positive, true negative, and false negative predictions for each class.

That's it! You've completed a simple tutorial on using Scikit-learn for machine learning tasks. You can explore other algorithms, preprocessing techniques, and more complex datasets as you advance your skills in machine learning.

Remember, Scikit-learn offers a vast array of tools and functionalities to explore, so keep practicing and experimenting with different datasets and algorithms to enhance your understanding of machine learning in Python. Happy learning!