如果我在 HPC 上使用 Dask-Jobqueue，是否还需要使用 Dask-ML 来运行 scikit-learn 代码？

Question

If I am using Dask-Jobqueue on a High Performing Computer (HPC), do I still need to use Dask-ML (ie. joblib.parallel_backend('dask' ) to run scikit-learn codes?如果我在高性能计算机 (HPC) 上使用 Dask-Jobqueue，是否还需要使用 Dask-ML（即joblib.parallel_backend('dask' ) 来运行 scikit-learn 代码？

Say I have the following code:假设我有以下代码：

from dask_jobqueue import PBSCluster

cluster = PBSCluster(cores=36,   
                     memory='100GB',   
                     project='P48500028',   
                     queue='premium',   
                     interface='ib0',
                     walltime='02:00:00')

cluster.scale(100)  
                   
from dask.distributed import Client
client = Client(cluster)   


from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)

param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           iid=True,
                           cv=3,
                           n_jobs=-1)


import joblib

with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)

Since I am using Dask-Jobqueue on a HPC (ie. I am connected to an instance of the HPC), when I run my code would all of my code be distributed to a cluster already (since I have specified cluster.scale(100) )?由于我在 HPC 上使用 Dask-Jobqueue（即，我连接到 HPC 的一个实例），当我运行我的代码时，我的所有代码都已经分发到一个集群（因为我已经指定了cluster.scale(100) ）？ If yes, then do I still need the last 3 lines of code above which uses Dask-ML?如果是，那么我还需要上面使用 Dask-ML 的最后 3 行代码吗？ Or can my code be like this:或者我的代码可以是这样的：

from dask_jobqueue import PBSCluster

cluster = PBSCluster(cores=36,   
                     memory='100GB',   
                     project='P48500028',   
                     queue='premium',   
                     interface='ib0',
                     walltime='02:00:00')

cluster.scale(100)  
                   
from dask.distributed import Client
client = Client(cluster)   


from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)

param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           iid=True,
                           cv=3,
                           n_jobs=-1)

grid_search.fit(X, y)

Will the last line of code above grid_search.fit(X, y) not run on any Dask cluster since I have removed joblib.parallel_backend('dask') ?自从我删除joblib.parallel_backend('dask')后， grid_search.fit(X, y)上面的最后一行代码是否不会在任何 Dask 集群上运行？ Or will it still run on a cluster since I have earlier on declared cluster.scale(100) ?还是因为我之前声明过cluster.scale(100) ，它仍然会在集群上运行？

Many thanks in advance.提前谢谢了。

Answer 1

Will the last line of code above grid_search.fit(X, y) not run on any Dask cluster since I have removed joblib.parallel_backend('dask')?自从我删除了 joblib.parallel_backend('dask') 后，grid_search.fit(X, y) 上面的最后一行代码是否不会在任何 Dask 集群上运行？

Correct.正确的。 Scikit-Learn needs to be told to use Dask Scikit-Learn 需要被告知使用 Dask

Or will it still run on a cluster since I have earlier on declared cluster.scale(100)?或者它仍然会在集群上运行，因为我之前已经声明了 cluster.scale(100)？

No. Dask is unable to automatically parallelize your code.不能。Dask 无法自动并行化您的代码。 You need to either tell Scikit-Learn to use Dask with the joblib decorator, or else use the dask_ml GridSearchCV equivalent object.您需要告诉 Scikit-Learn 使用带有 joblib 装饰器的 Dask，或者使用dask_ml GridSearchCV等效的 object。

如果我在 HPC 上使用 Dask-Jobqueue，是否还需要使用 Dask-ML 来运行 scikit-learn 代码？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-27 16:23:49

如果我在 HPC 上使用 Dask-Jobqueue，是否还需要使用 Dask-ML 来运行 scikit-learn 代码？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-27 16:23:49

解决方案1
1 已采纳 2020-06-27 16:23:49