Logging Datasets¶

In MLFlow is also possible to log the datasets used in the training process. This is useful to keep track of the data used in the training process and to reproduce the results.

In [1]:

Copied!





from sklearn.datasets import load_iris

# load iris dataset
X, y = load_iris(return_X_y=True, as_frame=True)
dataset = X.join(y)
dataset.head()
from sklearn.datasets import load_iris

# load iris dataset
X, y = load_iris(return_X_y=True, as_frame=True)
dataset = X.join(y)
dataset.head()

Out[1]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

Convert to a MLFlow dataset.

In [2]:

Copied!





import mlflow

# Create the PandasDataset for use in mlflow evaluate
mlflow_dataset = mlflow.data.from_pandas(
    dataset,
    targets="target",  # we specify the target column
    name="iris dataset" # we specify the name of the dataset
)
import mlflow

# Create the PandasDataset for use in mlflow evaluate
mlflow_dataset = mlflow.data.from_pandas(
    dataset,
    targets="target",  # we specify the target column
    name="iris dataset" # we specify the name of the dataset
)

Log the dataset

In [ ]:

Copied!

import mlflow

with mlflow.start_run():

    mlflow.log_input(mlflow_dataset, context="full_dataset")
import mlflow

with mlflow.start_run():

    mlflow.log_input(mlflow_dataset, context="full_dataset")

In the MLFlow Server we can find the following information (taken from oficial MLFlow documentation):

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2