Training on Large Datasets

Most estimators in scikit-learn are designed to work with NumPy arrays or scipy sparse matricies. These data structures must fit in the RAM on a single machine.

Estimators implemented in Dask-ML work well with Dask Arrays and DataFrames. This can be much larger than a single machine's RAM. They can be distributed in memory on a cluster of machines.

In [1]:
%matplotlib inline
In [2]:
from dask.distributed import Client

# Scale up: connect to your own cluster with more resources
# see http://dask.pydata.org/en/latest/setup.html
client = Client(processes=False, threads_per_worker=4, n_workers=1, memory_limit='2GB')
client
Out[2]:

Client

Cluster

  • Workers: 1
  • Cores: 4
  • Memory: 2.00 GB
In [3]:
import dask_ml.datasets
import dask_ml.cluster
import matplotlib.pyplot as plt

In this example, we'll use dask_ml.datasets.make_blobs to generate some random dask arrays.

In [4]:
# Scale up: increase n_samples or n_features
X, y = dask_ml.datasets.make_blobs(n_samples=1000000,
                                   chunks=100000,
                                   random_state=0,
                                   centers=3)
X = X.persist()
X
Out[4]:
dask.array<concatenate, shape=(1000000, 2), dtype=float64, chunksize=(100000, 2)>

We'll use the k-means implemented in Dask-ML to cluster the points. It uses the k-means|| (read: "k-means parallel") initialization algorithm, which scales better than k-means++ . All of the computation, both during and after initialization, can be done in parallel.

In [5]:
km = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
km.fit(X)
INFO:root:Starting _check_array
INFO:root:Finished _check_array in 0:00:00.299619
INFO:root:Starting init_scalable
INFO:dask_ml.cluster.k_means:Initializing with k-means||
INFO:dask_ml.cluster.k_means:Starting init iteration  1/ 2 ,  1 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  1/ 2 ,  1 centers in 0:00:00.147220
INFO:dask_ml.cluster.k_means:Starting init iteration  2/ 2 , 14 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  2/ 2 , 14 centers in 0:00:00.385016
INFO:root:Finished init_scalable in 0:00:00.725804
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  0.
INFO:dask_ml.cluster.k_means:Shift: 1.1078
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  0. in 0:00:00.441496
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  1.
INFO:dask_ml.cluster.k_means:Shift: 0.0231
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  1. in 0:00:00.288660
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  2.
INFO:dask_ml.cluster.k_means:Shift: 0.0025
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  2. in 0:00:00.232219
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  3.
INFO:dask_ml.cluster.k_means:Shift: 0.0003
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  3. in 0:00:00.228746
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  4.
INFO:dask_ml.cluster.k_means:Shift: 0.0000
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  4. in 0:00:00.233141
Out[5]:
KMeans(algorithm='full', copy_x=True, init='k-means||', init_max_iter=2,
    max_iter=300, n_clusters=3, n_jobs=1, oversampling_factor=10,
    precompute_distances='auto', random_state=None, tol=0.0001)

We'll plot a sample of points, colored by the cluster each falls into.

In [6]:
fig, ax = plt.subplots()
ax.scatter(X[::1000, 0], X[::1000, 1], marker='.', c=km.labels_[::1000],
           cmap='viridis', alpha=0.25);

For all the estimators implemented in Dask-ML, see the API documentation .


Right click to download this notebook from GitHub.