[英]If I am using Dask-Jobqueue on a HPC, do I still need to use Dask-ML to run scikit-learn codes?
If I am using Dask-Jobqueue on a High Performing Computer (HPC), do I still need to use Dask-ML (ie. joblib.parallel_backend('dask'
) to run scikit-learn codes?如果我在高性能计算机 (HPC) 上使用 Dask-Jobqueue,是否还需要使用 Dask-ML(即
joblib.parallel_backend('dask'
) 来运行 scikit-learn 代码?
Say I have the following code:假设我有以下代码:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory='100GB',
project='P48500028',
queue='premium',
interface='ib0',
walltime='02:00:00')
cluster.scale(100)
from dask.distributed import Client
client = Client(cluster)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
"kernel": ['rbf', 'poly', 'sigmoid'],
"shrinking": [True, False]}
grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
param_grid=param_grid,
return_train_score=False,
iid=True,
cv=3,
n_jobs=-1)
import joblib
with joblib.parallel_backend('dask'):
grid_search.fit(X, y)
Since I am using Dask-Jobqueue on a HPC (ie. I am connected to an instance of the HPC), when I run my code would all of my code be distributed to a cluster already (since I have specified cluster.scale(100)
)?由于我在 HPC 上使用 Dask-Jobqueue(即,我连接到 HPC 的一个实例),当我运行我的代码时,我的所有代码都已经分发到一个集群(因为我已经指定了
cluster.scale(100)
)? If yes, then do I still need the last 3 lines of code above which uses Dask-ML?如果是,那么我还需要上面使用 Dask-ML 的最后 3 行代码吗? Or can my code be like this:
或者我的代码可以是这样的:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory='100GB',
project='P48500028',
queue='premium',
interface='ib0',
walltime='02:00:00')
cluster.scale(100)
from dask.distributed import Client
client = Client(cluster)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X, y = make_classification(n_samples=2000, n_features=20, n_classes=2, random_state=0)
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
"kernel": ['rbf', 'poly', 'sigmoid'],
"shrinking": [True, False]}
grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
param_grid=param_grid,
return_train_score=False,
iid=True,
cv=3,
n_jobs=-1)
grid_search.fit(X, y)
Will the last line of code above grid_search.fit(X, y)
not run on any Dask cluster since I have removed joblib.parallel_backend('dask')
?自从我删除
joblib.parallel_backend('dask')
后, grid_search.fit(X, y)
上面的最后一行代码是否不会在任何 Dask 集群上运行? Or will it still run on a cluster since I have earlier on declared cluster.scale(100)
?还是因为我之前声明过
cluster.scale(100)
,它仍然会在集群上运行?
Many thanks in advance.提前谢谢了。
Will the last line of code above grid_search.fit(X, y) not run on any Dask cluster since I have removed joblib.parallel_backend('dask')?
自从我删除了 joblib.parallel_backend('dask') 后,grid_search.fit(X, y) 上面的最后一行代码是否不会在任何 Dask 集群上运行?
Correct.正确的。 Scikit-Learn needs to be told to use Dask
Scikit-Learn 需要被告知使用 Dask
Or will it still run on a cluster since I have earlier on declared cluster.scale(100)?
或者它仍然会在集群上运行,因为我之前已经声明了 cluster.scale(100)?
No. Dask is unable to automatically parallelize your code.不能。Dask 无法自动并行化您的代码。 You need to either tell Scikit-Learn to use Dask with the joblib decorator, or else use the
dask_ml
GridSearchCV
equivalent object.您需要告诉 Scikit-Learn 使用带有 joblib 装饰器的 Dask,或者使用
dask_ml
GridSearchCV
等效的 object。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.