This example shows how TPOT can be used with Dask for distributed automated machine learning.

TPOT is an automated machine learning library. It evaluates many scikit-learn pipelines and hyperparameter combinations to find a model that works well for your data. Evaluating all these computations is computationally expensive, but ammenable to parallelism. TPOT can use Dask to distribute these computations on a cluster of machines.

This notebook can be run interactively on the dask examples binder . The following video shows a larger version of this notebook on a cluster.

In [1]:

from IPython.display import HTML

HTML('<div style="position:relative;height:0;padding-bottom:56.25%"><iframe src="https://www.youtube.com/embed/uyx9nBuOYQQ?ecver=2" width="640" height="360" frameborder="0" allow="autoplay; encrypted-media" style="position:absolute;width:100%;height:100%;left:0" allowfullscreen></iframe></div>')

Out[1]:

In [2]:

# Currently requires TPOT dev
!pip install git+https://github.com/EpistasisLab/tpot@development

Collecting git+https://github.com/EpistasisLab/tpot@development
  Cloning https://github.com/EpistasisLab/tpot (to revision development) to /tmp/pip-req-build-arx_ttjs
Requirement already satisfied: numpy>=1.12.1 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from TPOT==0.9.4) (1.15.0)
Requirement already satisfied: scipy>=0.19.0 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from TPOT==0.9.4) (1.1.0)
Requirement already satisfied: scikit-learn>=0.18.1 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from TPOT==0.9.4) (0.19.1)
Collecting deap>=1.0 (from TPOT==0.9.4)
  Downloading https://files.pythonhosted.org/packages/af/29/e7f2ecbe02997b16a768baed076f5fc4781d7057cd5d9adf7c94027845ba/deap-1.2.2.tar.gz (936kB)
    100% |████████████████████████████████| 942kB 23.8MB/s 
Collecting update_checker>=0.16 (from TPOT==0.9.4)
  Downloading https://files.pythonhosted.org/packages/17/c9/ab11855af164d03be0ff4fddd4c46a5bd44799a9ecc1770e01a669c21168/update_checker-0.16-py2.py3-none-any.whl
Collecting tqdm>=4.11.2 (from TPOT==0.9.4)
  Downloading https://files.pythonhosted.org/packages/c7/e0/52b2faaef4fd87f86eb8a8f1afa2cd6eb11146822033e29c04ac48ada32c/tqdm-4.25.0-py2.py3-none-any.whl (43kB)
    100% |████████████████████████████████| 51kB 21.9MB/s 
Collecting stopit>=1.1.1 (from TPOT==0.9.4)
  Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Requirement already satisfied: pandas>=0.20.2 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from TPOT==0.9.4) (0.23.4)
Requirement already satisfied: requests>=2.3.0 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from update_checker>=0.16->TPOT==0.9.4) (2.19.1)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from pandas>=0.20.2->TPOT==0.9.4) (2.7.3)
Requirement already satisfied: pytz>=2011k in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from pandas>=0.20.2->TPOT==0.9.4) (2018.5)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from requests>=2.3.0->update_checker>=0.16->TPOT==0.9.4) (3.0.4)
Requirement already satisfied: urllib3<1.24,>=1.21.1 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from requests>=2.3.0->update_checker>=0.16->TPOT==0.9.4) (1.23)
Requirement already satisfied: idna<2.8,>=2.5 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from requests>=2.3.0->update_checker>=0.16->TPOT==0.9.4) (2.7)
Requirement already satisfied: certifi>=2017.4.17 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from requests>=2.3.0->update_checker>=0.16->TPOT==0.9.4) (2018.8.24)
Requirement already satisfied: six>=1.5 in /home/travis/miniconda/envs/test/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas>=0.20.2->TPOT==0.9.4) (1.11.0)
Building wheels for collected packages: TPOT, deap, stopit
  Running setup.py bdist_wheel for TPOT ... - done
  Stored in directory: /tmp/pip-ephem-wheel-cache-11rzqxzl/wheels/36/7d/ff/9064c383dafa37df13f7177a086505f65c8ef3eaee66686bb7
  Running setup.py bdist_wheel for deap ... - \ | / - \ | done
  Stored in directory: /home/travis/.cache/pip/wheels/22/ea/bf/dc7c8a2262025a0ab5da9ef02282c198be88902791ca0c6658
  Running setup.py bdist_wheel for stopit ... - done
  Stored in directory: /home/travis/.cache/pip/wheels/3c/85/2b/2580190404636bfc63e8de3dff629c03bb795021e1983a6cc7
Successfully built TPOT deap stopit
twisted 18.7.0 requires PyHamcrest>=1.9.0, which is not installed.
Installing collected packages: deap, update-checker, tqdm, stopit, TPOT
Successfully installed TPOT-0.9.4 deap-1.2.2 stopit-1.1.2 tqdm-4.25.0 update-checker-0.16

In [3]:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

Setup Dask ¶

We first start a Dask client in order to get access to the Dask dashboard, which will provide progress and performance metrics.

You can view the dashboard by clicking on the dashboard link after you run the cell.

In [4]:

from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1)
client

Out[4]:

Client

Scheduler: tcp://127.0.0.1:35970
Dashboard: http://127.0.0.1:8787/status

Cluster

Workers: 4
Cores: 4
Memory: 5.61 GB

Create Data ¶

We'll use the digits dataset. To ensure the example runs quickly, we'll make the training dataset relatively small.

In [5]:

digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    train_size=0.05,
    test_size=0.95,
)

These are just small, in-memory NumPy arrays. This example is not applicable to larger-than-memory Dask arrays.

Using Dask ¶

TPOT follows the scikit-learn API; we specify a TPOTClassifier with a few hyperparameters, and then fit it on some data. By default, TPOT trains on your single machine. To ensure your cluster is used, specify the use_dask keyword.

In [6]:

# scale up: Increase the TPOT parameters like population_size, generations
tp = TPOTClassifier(
    generations=2,
    population_size=10,
    cv=2,
    n_jobs=-1,
    random_state=0,
    verbosity=0,
    use_dask=True
)

In [7]:

tp.fit(X_train, y_train)

/home/travis/miniconda/envs/test/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

Out[7]:

TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=2,
        disable_update_check=False, early_stop=None, generations=2,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=-1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=10,
        random_state=0, scoring=None, subsample=1.0, use_dask=True,
        verbosity=0, warm_start=False)

Learn More ¶

See the Dask-ML and TPOT documenation for more information on using Dask and TPOT.