带有pythons多处理功能的“ bucketsort”

Question

I have a data series with a uniform distribution. 我有一个分布均匀的数据系列。 I wish to exploit the distribution to sort the data in parallel. 我希望利用发行版对数据进行并行排序。 For N CPUs, I essentially define N buckets and sort the buckets in parallel. 对于N个CPU，我本质上定义了N个存储桶，并对并行存储桶进行排序。 My problem is that, I do not get a speed up. 我的问题是，我没有加快速度。

What is wrong? 怎么了？

from multiprocessing import Process, Queue
from numpy import array, linspace, arange, where, cumsum, zeros
from numpy.random import rand
from time import time


def my_sort(x,y):
    y.put(x.get().argsort())

def my_par_sort(X,np):
 p_list=[]
 Xq = Queue()
 Yq = Queue()
 bmin = linspace(X.min(),X.max(),np+1) #bucket lower bounds
 bmax = array(bmin); bmax[-1] = X.max()+1 #bucket upper bounds
 B = []
 Bsz = [0]
 for i in range(np):
  b = array([bmin[i] <= X, X < bmax[i+1]]).all(0)
  B.append(where(b)[0])
  Bsz.append(len(B[-1]))
  Xq.put(X[b])
  p = Process(target=my_sort, args=(Xq,Yq))
  p.start()
  p_list.append(p)

 Bsz = cumsum(Bsz).tolist()
 Y = zeros(len(X)) 
 for i in range(np):
   Y[arange(Bsz[i],Bsz[i+1])] = B[i][Yq.get()]
   p_list[i].join()

 return Y


if __name__ == '__main__':
 num_el = 1e7
 mydata = rand(num_el)
 np = 4 #multiprocessing.cpu_count()
 starttime = time()
 I = my_par_sort(mydata,np)
 print "Sorting %0.0e keys took %0.1fs using %0.0f processes" % (len(mydata),time()-starttime,np)
 starttime = time()
 I2 = mydata.argsort()
 print "in serial it takes %0.1fs" % (time()-starttime)
 print (I==I2).all()

Answer 1

It looks like your problem is the amount of overhead you're adding when you break the original array into pieces. 看起来您的问题是当将原始数组分解成碎片时要添加的开销量。 I took your code, and just removed all usage of multiprocessing : 我拿走了您的代码，并删除了multiprocessing所有用法：

def my_sort(x,y): 
    pass
    #y.put(x.get().argsort())

def my_par_sort(X,np, starttime):
    p_list=[]
    Xq = Queue()
    Yq = Queue()
    bmin = linspace(X.min(),X.max(),np+1) #bucket lower bounds
    bmax = array(bmin); bmax[-1] = X.max()+1 #bucket upper bounds
    B = []
    Bsz = [0] 
    for i in range(np):
        b = array([bmin[i] <= X, X < bmax[i+1]]).all(0)
        B.append(where(b)[0])
        Bsz.append(len(B[-1]))
        Xq.put(X[b])
        p = Process(target=my_sort, args=(Xq,Yq, i)) 
        p.start()
        p_list.append(p)
    return

if __name__ == '__main__':
    num_el = 1e7 
    mydata = rand(num_el)
    np = 4 #multiprocessing.cpu_count()
    starttime = time()
    I = my_par_sort(mydata,np, starttime)
    print "Sorting %0.0e keys took %0.1fs using %0.0f processes" % (len(mydata),time()-starttime,np)
    starttime = time()
    I2 = mydata.argsort()
    print "in serial it takes %0.1fs" % (time()-starttime)
    #print (I==I2).all()

With absolutely no sorting happening, the multiprocessing code takes just as long as the serial code: 在绝对没有排序的情况下， multiprocessing代码所花费的时间与串行代码一样长：

Sorting 1e+07 keys took 2.2s using 4 processes
in serial it takes 2.2s

You may be thinking that the overhead of starting processes and passing values between them is the cause of the overhead, but if I remove all usage of multiprocessing , including the Xq.put(X[b]) call, it ends up being just slightly faster: 您可能会认为启动进程以及在进程之间传递值的开销是开销的原因，但是如果我删除所有对multiprocessing使用（包括Xq.put(X[b])调用），最终结果可能只是快点：

Sorting 1e+07 keys took 1.9s using 4 processes
in serial it takes 2.2s

So it seems you need to investigate a more efficient way of breaking your array into pieces. 因此，您似乎需要研究一种更有效的方法来将阵列分成多个部分。

Answer 2

In my view there are two main problems. 我认为有两个主要问题。

The overhead of multiple processes and communicating between them 多个进程及其之间进行通信的开销
Spawning a couple of Python interpreters causes some overhead, but mainly passing data to and from the "worker" processes is killing performance. 生成几个Python解释器会产生一些开销，但是主要是在“工作者”进程之间来回传递数据会降低性能。 Data that you pass through the Queue needs to be "pickled" and "unpickled", which is somewhat slow for larger data (and you need to do this two times). 通过“ Queue ”的数据需要“腌制”和“未腌制”，这对于较大的数据来说有点慢（您需要这样做两次）。
You don't need to use Queue s if you'd use threads instead of processes. 如果要使用线程而不是进程，则不需要使用Queue 。 Using threads in CPython for CPU heavy tasks is often regarded as inefficient, because generally you will run into the Global Interpreter Lock , but not always! 在CPython中将线程用于CPU繁重的任务通常被认为效率低下，因为通常您会遇到Global Interpreter Lock ，但并非总是如此！ Luckily Numpy's sorting functions seem to be releasing the GIL, so using threads is a viable option! 幸运的是，Numpy的排序功能似乎正在释放GIL，因此使用线程是一个可行的选择！
The partitioning and joining of the dataset 数据集的分区和联接
Partitioning and joining the data is an inevitable cost of this "bucketsort approach", but can be relieved somewhat by doing it more efficiently. 分区和联接数据是这种“桶排序方法”的必然成本，但是可以通过更有效地执行操作来减轻一些负担。 In particar these two lines of code 特别是这两行代码
```
 b = array([bmin[i] <= X, X < bmax[i+1]]).all(0) Y[arange(Bsz[i],Bsz[i+1])] = ... 
```
Can be rewriten to 可以重写为
```
 b = (bmin[i] <= X) & (X < bmax[i+1]) Y[Bsz[i] : Bsz[i+1]] = ... 
```
Improving some more I also found np.take to be faster than "fancy indexing" and np.partition also useful. 我做了一些改进，我发现np.take比“ fancy indexing”要快， np.partition也很有用。

Summarizing, the fastest that I could make it is the following (but it still doesn't scale linearly with the number of cores like you would want..): 总而言之，以下是我能做到的最快的速度（但是它仍然不能像您想要的那样随内核数线性增长。）：

from threading import Thread

def par_argsort(X, nproc):
    N = len(X)
    k = range(0, N+1, N//nproc)
    I = X.argpartition(k[1:-1])
    P = X.take(I)

    def worker(i):
        s = slice(k[i], k[i+1])
        I[s].take(P[s].argsort(), out=I[s])

    t_list = []
    for i in range(nproc):
        t = Thread(target=worker, args=(i,))
        t.start()
        t_list.append(t)

    for t in t_list:
        t.join()

    return I

带有pythons多处理功能的“ bucketsort”

问题描述

2 个解决方案

解决方案1
2 2014-07-04 16:03:36

解决方案2
0 已采纳

带有pythons多处理功能的“ bucketsort”

问题描述

2 个解决方案

解决方案1 2 2014-07-04 16:03:36

解决方案2 0 已采纳

解决方案1
2 2014-07-04 16:03:36

解决方案2
0 已采纳