简体   繁体   English

如果我在 HPC 上使用 Dask-Jobqueue,是否还需要使用 Dask-ML 来运行 scikit-learn 代码?

[英]If I am using Dask-Jobqueue on a HPC, do I still need to use Dask-ML to run scikit-learn codes?

If I am using Dask-Jobqueue on a High Performing Computer (HPC), do I still need to use Dask-ML (ie. joblib.parallel_backend('dask' ) to run scikit-learn codes?如果我在高性能计算机 (HPC) 上使用 Dask-Jobqueue,是否还需要使用 Dask-ML(即joblib.parallel_backend('dask' ) 来运行 scikit-learn 代码?

Say I have the following code:假设我有以下代码:

from dask_jobqueue import PBSCluster

cluster = PBSCluster(cores=36,   
                     memory='100GB',   
                     project='P48500028',   
                     queue='premium',   
                     interface='ib0',
                     walltime='02:00:00')

cluster.scale(100)  
                   
from dask.distributed import Client
client = Client(cluster)   


from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)

param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           iid=True,
                           cv=3,
                           n_jobs=-1)


import joblib

with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)

Since I am using Dask-Jobqueue on a HPC (ie. I am connected to an instance of the HPC), when I run my code would all of my code be distributed to a cluster already (since I have specified cluster.scale(100) )?由于我在 HPC 上使用 Dask-Jobqueue(即,我连接到 HPC 的一个实例),当我运行我的代码时,我的所有代码都已经分发到一个集群(因为我已经指定了cluster.scale(100) )? If yes, then do I still need the last 3 lines of code above which uses Dask-ML?如果是,那么我还需要上面使用 Dask-ML 的最后 3 行代码吗? Or can my code be like this:或者我的代码可以是这样的:

from dask_jobqueue import PBSCluster

cluster = PBSCluster(cores=36,   
                     memory='100GB',   
                     project='P48500028',   
                     queue='premium',   
                     interface='ib0',
                     walltime='02:00:00')

cluster.scale(100)  
                   
from dask.distributed import Client
client = Client(cluster)   


from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)

param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           iid=True,
                           cv=3,
                           n_jobs=-1)

grid_search.fit(X, y)

Will the last line of code above grid_search.fit(X, y) not run on any Dask cluster since I have removed joblib.parallel_backend('dask') ?自从我删除joblib.parallel_backend('dask')后, grid_search.fit(X, y)上面的最后一行代码是否不会在任何 Dask 集群上运行? Or will it still run on a cluster since I have earlier on declared cluster.scale(100) ?还是因为我之前声明过cluster.scale(100) ,它仍然会在集群上运行?

Many thanks in advance.提前谢谢了。

Will the last line of code above grid_search.fit(X, y) not run on any Dask cluster since I have removed joblib.parallel_backend('dask')?自从我删除了 joblib.parallel_backend('dask') 后,grid_search.fit(X, y) 上面的最后一行代码是否不会在任何 Dask 集群上运行?

Correct.正确的。 Scikit-Learn needs to be told to use Dask Scikit-Learn 需要被告知使用 Dask

Or will it still run on a cluster since I have earlier on declared cluster.scale(100)?或者它仍然会在集群上运行,因为我之前已经声明了 cluster.scale(100)?

No. Dask is unable to automatically parallelize your code.不能。Dask 无法自动并行化您的代码。 You need to either tell Scikit-Learn to use Dask with the joblib decorator, or else use the dask_ml GridSearchCV equivalent object.您需要告诉 Scikit-Learn 使用带有 joblib 装饰器的 Dask,或者使用dask_ml GridSearchCV等效的 object。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM