[英]Why is parallel implementation slower than serial? (Python multiprocessing module)
I am trying to extend the scikit-learn class KNeighborsClassifier
by introducing an alternative method of computing distances between neighbours (see here if interested). 我试图通过引入一种计算邻居之间距离的替代方法来扩展scikit-learn类
KNeighborsClassifier
(如果感兴趣,请参见此处 )。
The parallelization scheme is the following: given that we want to calculate distances between all elements of set A and set B, for each element in A (taken sequentially one after the other), calculate distances to all elements in B in parallel . 并行方案如下:假设我们想要计算集合A中的所有元素,并设置B,对于A中的每个元素之间的距离(一个顺序采取的其他后),计算距离在平行于B中的所有元素。 The time-consuming operation is calculating the distance between any two elements, so each process should carry out this fundamental operation.
耗时的操作正在计算任何两个元素之间的距离,因此每个过程都应执行此基本操作。
The problem is that parallel execution is much slower than serial execution (using Python's multiprocessing
module), both when using synchronous as well as asynchronous calls, regardless of the machine and the numbers of cores used. 问题在于,无论使用同步调用还是异步调用,无论使用何种计算机和内核数量,并行执行都比串行执行(使用Python的
multiprocessing
模块)慢得多。
I suspect that this is related to using shared variables, which are communicated in the background. 我怀疑这与使用共享变量有关,这些变量在后台进行通信。 The question is, which variables are being communicated and how could this be avoided?
问题是,正在传达哪些变量,如何避免这种情况?
Code: 码:
class WordMoversKNN(KNeighborsClassifier):
"""K nearest neighbors classifier using the Word Mover's Distance.
Parameters
----------
W_embed : array, shape: (vocab_size, embed_size)
Precomputed word embeddings between vocabulary items.
Row indices should correspond to the columns in the bag-of-words input.
n_neighbors : int
Number of neighbors to use by default for :meth:`k_neighbors` queries.
n_jobs : int
The number of parallel jobs to run for Word Mover's Distance computation.
If ``-1``, then the number of jobs is set to the number of CPU cores.
verbose : int, optional
Controls the verbosity; the higher, the more messages. Defaults to 0.
"""
def __init__(self, W_embed, n_neighbors=1, n_jobs=1, verbose=5):
self.W_embed = W_embed
self.verbose = verbose
if n_jobs == -1:
n_jobs = mp.cpu_count()
super(WordMoversKNN, self).__init__(n_neighbors=n_neighbors, n_jobs=n_jobs, metric='precomputed', algorithm='brute')
def _wmd(self, i, row, X_train):
"""Compute the WMD between training sample i and given test row.
Assumes that `row` and train samples are sparse BOW vectors summing to 1.
"""
union_idx = np.union1d(X_train[i].indices, row.indices)
W_minimal = self.W_embed[union_idx]
W_dist = euclidean_distances(W_minimal)
bow_i = X_train[i, union_idx].A.ravel()
bow_j = row[:, union_idx].A.ravel()
return emd(bow_i, bow_j, W_dist)
def _wmd_row(self, row, X_train):
"""Wrapper to compute the WMD of a row with all training samples.
Assumes that `row` and train samples are sparse BOW vectors summing to 1.
Useful for parallelization.
"""
n_samples_train = X_train.shape[0]
return [self._wmd(i, row, X_train) for i in range(n_samples_train)]
def _pairwise_wmd(self, X_test, X_train=None, ordered=True):
"""Computes the word mover's distance between all train and test points.
Parallelized over rows of X_test.
Assumes that train and test samples are sparse BOW vectors summing to 1.
Parameters
----------
X_test: scipy.sparse matrix, shape: (n_test_samples, vocab_size)
Test samples.
X_train: scipy.sparse matrix, shape: (n_train_samples, vocab_size)
Training samples. If `None`, uses the samples the estimator was fit with.
ordered: returns result keeping the order of the rows in dist (following X_test).
Otherwise, the rows of dist follow a potentially random order which does not follow the order
of indices in X_test. However, computation is faster in this case (asynchronous parallel execution)
Returns
-------
dist : array, shape: (n_test_samples, n_train_samples)
Distances between all test samples and all train samples.
"""
n_samples_test = X_test.shape[0]
if X_train is None: X_train = self._fit_X
if (self.n_jobs == 1) or (n_samples_test < 2*self.n_jobs): # to avoid parallelism overhead for small test samples
dist = [ self._wmd_row( test_sample , X_train ) for test_sample in X_test ]
else:
if self.verbose:
print("WordMoversKNN set to use {} parallel processes".format(self.n_jobs))
if ordered:
dist = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)( delayed(self._wmd_row) (test_sample, X_train) for test_sample in X_test)
else: # Asynchronous call is faster but returns results in random order
pool = mp.Pool(processes=self.n_jobs)
results = [pool.apply_async(self._wmd_row, args=(test_sample, X_train)) for test_sample in X_test]
dist = [p.get() for p in results]
return np.array(dist)
def calculate(self, X):
"""Predict the class labels for the provided data
Parameters
----------
X : scipy.sparse matrix, shape (n_test_samples, vocab_size)
Test samples.
Returns
-------
y : array of shape [n_samples]
Class labels for each data sample.
"""
X = check_array(X, accept_sparse='csr', copy=True)
X = normalize(X, norm='l1', copy=False)
dist = self._pairwise_wmd(X)
# A matrix of distances given to predict in combination with metric = 'precomputed'
# means that no more distance calculations take place. Neighbors are found simply by sorting
return super(WordMoversKNN, self).predict(dist)
The main problem was that for each row of matrix X_test
a new process was spawned, each time requiring passing the full X_train
as well as other variables (eg self.X_embed
) for every process. 主要问题是, 对于矩阵
X_test
的每一行 X_test
产生一个新进程,每次都需要为每个进程传递完整的X_train
以及其他变量(例如self.X_embed
)。 Pickling and dispatching these variables is very time consuming, because of their size. 由于它们的大小,酸洗和分派这些变量非常耗时。 I got a tremendous speed-up when I split the matrix
X_test
in n_jobs
chunks of size X_test.shape[0]//n_jobs
, overall spawning only n_jobs
processes and having to pass the variables n_jobs
times instead of X_test.shape[0]
times. 当我将矩阵
X_test
n_jobs
为大小为X_test.shape[0]//n_jobs
n_jobs
块时,我得到了极大的提速,总体上只产生了n_jobs
进程,并且必须将变量n_jobs
次而不是X_test.shape[0]
次传递。 However, because of the size of the variables that have to be communicated, I believe that for this type of problem, data parallelism is a much more appropriate approach than task parallelism, and I therefore intend to use mpi4py
, so that each process separately creates its own self.W_embed
, X_train
and X_test
matrices, communicating only the results of the computation. 但是,由于必须传达的变量的大小,我认为对于这种类型的问题,数据并行性比任务并行性更合适,因此我打算使用
mpi4py
,以便每个进程分别创建它自己的self.W_embed
, X_train
和X_test
矩阵,仅传达计算结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.