简体   繁体   中英

Sklearn kmeans with multiprocessing

I can't understand how the n_jobs works :

data, labels = sklearn.datasets.make_blobs(n_samples=1000, n_features=416, centers=20)
k_means = sklearn.cluster.KMeans(n_clusters=10, max_iter=3, n_jobs=1).fit(data)

runs in less than 1sec

with n_jobs = 2, it runs nearly twice as much

with n_jobs = 8, it is so long it never ended on my computer... (I have 8 cores)

Is there something I don't understand with how parallelization works ?

n_jobs specifies the number of concurrent processes/threads should be used for parallelized routines

From docs

Some parallelism uses a multi-threading backend by default, some a multi-processing backend. It is possible to override the default backend by using sklearn.utils.parallel_backend.

With python GIL, more threads does not guarantee better speed. So check if your backend is configured for threads or processes. If it is threads then try changing it to processes (but you will also have the overhead of IPC).

Again from the docs:

Whether parallel processing is helpful at improving runtime depends on many factors, and it's usually a good idea to experiment rather than assuming that increasing the number of jobs is always a good thing. It can be highly detrimental to performance to run multiple copies of some estimators or functions in parallel.

So n_jobs is not a silver bullet but one has to experiment to see if it works for their estimators and kind of data.

您可以使用n_jobs=-1来使用所有CPU,也可以使用n_jobs=-1 n_jobs=-2来使用除一个以外的所有CPU。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM