TPOT ¶
This example shows how TPOT can be used with Dask for distributed automated machine learning.
TPOT is an automated machine learning library. It evaluates many scikit-learn pipelines and hyperparameter combinations to find a model that works well for your data. Evaluating all these computations is computationally expensive, but ammenable to parallelism. TPOT can use Dask to distribute these computations on a cluster of machines.
This notebook can be run interactively on the dask examples binder . The following video shows a larger version of this notebook on a cluster.
from IPython.display import HTML
HTML('<div style="position:relative;height:0;padding-bottom:56.25%"><iframe src="https://www.youtube.com/embed/uyx9nBuOYQQ?ecver=2" width="640" height="360" frameborder="0" allow="autoplay; encrypted-media" style="position:absolute;width:100%;height:100%;left:0" allowfullscreen></iframe></div>')
# Currently requires TPOT dev
!pip install git+https://github.com/EpistasisLab/tpot@development
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
Setup Dask ¶
We first start a Dask client in order to get access to the Dask dashboard, which will provide progress and performance metrics.
You can view the dashboard by clicking on the dashboard link after you run the cell.
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1)
client
Create Data ¶
We'll use the digits dataset. To ensure the example runs quickly, we'll make the training dataset relatively small.
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data,
digits.target,
train_size=0.05,
test_size=0.95,
)
These are just small, in-memory NumPy arrays. This example is not applicable to larger-than-memory Dask arrays.
Using Dask ¶
TPOT follows the scikit-learn API; we specify a
TPOTClassifier
with a few hyperparameters, and then fit it on some data.
By default, TPOT trains on your single machine.
To ensure your cluster is used, specify the
use_dask
keyword.
# scale up: Increase the TPOT parameters like population_size, generations
tp = TPOTClassifier(
generations=2,
population_size=10,
cv=2,
n_jobs=-1,
random_state=0,
verbosity=0,
use_dask=True
)
tp.fit(X_train, y_train)