Logging Datasets¶
In MLFlow is also possible to log the datasets used in the training process. This is useful to keep track of the data used in the training process and to reproduce the results.
In [1]:
Copied!
from sklearn.datasets import load_iris
# load iris dataset
X, y = load_iris(return_X_y=True, as_frame=True)
dataset = X.join(y)
dataset.head()
from sklearn.datasets import load_iris
# load iris dataset
X, y = load_iris(return_X_y=True, as_frame=True)
dataset = X.join(y)
dataset.head()
Out[1]:
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
Convert to a MLFlow dataset.
In [2]:
Copied!
import mlflow
# Create the PandasDataset for use in mlflow evaluate
mlflow_dataset = mlflow.data.from_pandas(
dataset,
targets="target", # we specify the target column
name="iris dataset" # we specify the name of the dataset
)
import mlflow
# Create the PandasDataset for use in mlflow evaluate
mlflow_dataset = mlflow.data.from_pandas(
dataset,
targets="target", # we specify the target column
name="iris dataset" # we specify the name of the dataset
)
Log the dataset
In [ ]:
Copied!
import mlflow
with mlflow.start_run():
mlflow.log_input(mlflow_dataset, context="full_dataset")
import mlflow
with mlflow.start_run():
mlflow.log_input(mlflow_dataset, context="full_dataset")
In the MLFlow Server we can find the following information (taken from oficial MLFlow documentation):