Voting Classifier

A Voting classifier model combines multiple different models (i.e., sub-estimators) into a single model, which is (ideally) stronger than any of the individual models alone.

Dask provides the software to train individual sub-estimators on different machines in a cluster. This enables users to train more models in parallel than would have been possible on a single machine. Note that users will only observe this benefit if they have a distributed cluster with more resources than their single machine (because sklearn already enables users to parallelize training across cores on a single machine).

What follows is an example of how one would deploy a voting classifier model in dask (using a local cluster).

Dask logo

In [1]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

import sklearn.datasets
/home/travis/miniconda/envs/test/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

We create a synthetic dataset (with 1000 rows and 20 columns) that we can give to the voting classifier model.

In [2]:
X, y = sklearn.datasets.make_classification(n_samples=1_000, n_features=20)

We specify the VotingClassifier as a list of (name, sub-estimator) tuples. Fitting the VotingClassifier on the data fits each of the sub-estimators in turn. We set the n_jobs argument to be -1, which instructs sklearn to use all available cores (notice that we haven't used dask).

In [3]:
classifiers = [
    ('sgd', SGDClassifier(max_iter=1000)),
    ('logisticregression', LogisticRegression()),
    ('svc', SVC(gamma='auto')),
]
clf = VotingClassifier(classifiers, n_jobs=-1)

We call the classifier's fit method in order to train the classifier.

In [4]:
%time clf.fit(X, y)
CPU times: user 80 ms, sys: 244 ms, total: 324 ms
Wall time: 621 ms
Out[4]:
VotingClassifier(estimators=[('sgd', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=1000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=T...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=-1, voting='hard', weights=None)

Creating a Dask client provides performance and progress metrics via the dashboard. Because Client is given no arugments, its output refers to a local cluster (not a distributed cluster).

We can view the dashboard by clicking the link after running the cell.

In [5]:
import dask_ml.joblib
from sklearn.externals import joblib
from distributed import Client

client = Client()
client
Out[5]:

Client

Cluster

  • Workers: 48
  • Cores: 48
  • Memory: 67.28 GB

To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's parallel_backend context manager. This distributes training of sub-estimators acoss the cluster. By providing the data in the scatter argument, the data is pre-emptively sent to each worker in the cluster (follow the link for more info).

In [6]:
%%time 
with joblib.parallel_backend("dask", scatter=[X, y]):
    clf.fit(X, y)

print(clf)
VotingClassifier(estimators=[('sgd', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=1000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=T...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=-1, voting='hard', weights=None)
CPU times: user 1.98 s, sys: 964 ms, total: 2.94 s
Wall time: 8.92 s

Note, that we see no advantage of using dask because we are using a local cluster rather than a distributed cluster and sklearn is already using all my computer's cores. If we were using a distributed cluster, dask would enable us to take advantage of the multiple machines and train sub-estimators across them.


Right click to download this notebook from GitHub.