使用MPI获得性能提升

Question

I tested the performance gain of parallelizing the (nearly) "embarassingly parallel" (ie perfectly parallelizable) algorithm of summing up the first N integers: 我测试了并行化（几乎）“令人难以置信的并行”（即完全可并行化）算法的性能增益，该算法总结了前N整数：

The serial algorithm is simply: 串行算法很简单：

N = 100000000
print sum(range(N))

Execution time on my dual core laptop (Lenovo X200): 0m21.111s. 我的双核笔记本电脑（联想X200）的执行时间：0m21.111s。

The parallelized (with mpi4py) version uses 3 nodes; 并行化（带mpi4py）版本使用3个节点; node 0 calculates the sum of the lower half of the interger, node 1 calculates the sum of the upper half. 节点0计算整数的下半部分之和，节点1计算上半部分的总和。 The both send their results (via comm.send ) to node 2 which sums up both numbers and prints the result: 两者都将结果（通过comm.send ）发送到节点2，节点2汇总两个数字并打印结果：

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

N = 100000000

if rank == 0: 
  s = sum(range(N/2))
  comm.send(s,dest=2,tag=11)
elif rank == 1:
  s = sum(range(N/2+1,N))
  comm.send(s,dest=2,tag=11)
elif rank == 2:
  s1 = comm.recv(source=0, tag=11)
  s2 = comm.recv(source=1, tag=11)
  print s1+s2

Both cores of my dual-core-laptop are fully used; 我的双核笔记本电脑的两个核心都被充分利用; Execution time now: 15.746s. 执行时间现在：15.746s。

My Question: At least in theory, the execution time should nearly be halfed. 我的问题：至少在理论上，执行时间几乎应该减半。 Which overhead eats the missing 4 seconds? 哪个开销吃了4秒？ (surely not s1+s2). （当然不是s1 + s2）。 Are those send- / receive-Commands that time-consuming?? 那些发送/接收命令是否耗时？

Edit: After reading the answers and rethinking the question, I think the 4 seconds (in some runs even more than that) are eaten by the high memory traffic caused by the generation of two lists of length 50000000; 编辑：在阅读了答案并重新思考问题之后，我认为4秒（在某些运行中甚至更多）被生成两个长度为50000000的列表所导致的高内存流量所吞噬; the two cores of my laptop share a common memory (at least main memory; I think they have separate L2-Caches) and exactly this is the bottleneck: so, very often, both cores want to access memory at the same time (for getting the next list element) and one of them has to wait... 我的笔记本电脑的两个核心共享一个公共内存（至少是主内存;我认为它们有独立的L2缓存），而这正是瓶颈：因此，两个内核通常都希望同时访问内存（以获取内存）下一个列表元素），其中一个必须等待...

If I use xrange instead of range , the next list elements are generated lazily and little memory is allocated. 如果我使用xrange而不是range ，则会延迟生成下一个列表元素并分配很少的内存。 I tested it and running the same programm as above with xrange takes just 11 seconds! 我测试了它并运行与上面相同的程序，xrange仅需11秒！

Answer 1

How are you doing the timing, and what's your laptop? 你是如何做时间的，你的笔记本电脑是什么？

If you're doing the timing from the shell, you may be (as BiggAl suggests) hitting a delay just starting up python. 如果你正在从shell进行计时，你可能（正如BiggAl建议的那样）在启动python时遇到延迟。 That's real overhead and worth knowing about, but probably isn't your immediate concern. 这是真正的开销，值得了解，但可能不是您的直接关注。 And I have trouble imaginging that this contributes 4 seconds of overhead... [ Edited to add : although BiggAl suggests it really may be, under Windows] 而且我在成像方面遇到了麻烦，这会导致4秒的开销... [ 编辑补充说 ：虽然BiggAl建议它真的可能是，在Windows下]

I think a more likely concern is memory bandwidth limitation. 我认为更可能的问题是内存带宽限制。 While you are going to fully use both your cores with this setup, you only have so much memory bandwidth, and that may end up being the limitation here. 虽然您将通过此设置完全使用两个内核，但您只有很多内存带宽，这可能最终成为限制。 Each core is trying to write a lot of data (the range(N/2)) and then read it in (the sum) to do a fairly modest amount of computation (an integer) and so I suspect computation isn't the bottleneck. 每个核心都试图写入大量数据（范围（N / 2）），然后读入（总和）以进行相当适度的计算量（整数），因此我怀疑计算不是瓶颈。

I ran your same setup using timeit on a Nehalem box with pretty good memory-bandwidth per core, and did get the expected speedup: 我在Nehalem盒子上使用timeit运行相同的设置，每个核心具有相当不错的内存带宽，并且确实获得了预期的加速：

from mpi4py import MPI
import timeit

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

N = 10000000

def parSum():
    if rank == 0:
        ...etc

def serSum():
    s = sum(range(N))

if rank == 0:
    print 'Parallel time:'
    tp = timeit.Timer("parSum()","from __main__ import parSum")
    print tp.timeit(number=10)

    print 'Serial time:'
    ts = timeit.Timer("serSum()","from __main__ import serSum")
    print ts.timeit(number=10)

from which I got 我得到了

$ mpirun -np 3 python ./sum.py
Parallel time:
1.91955494881
Serial time:
3.84715008736

If you think it's a memory bandwidth issue, you can test that by making the computation artificially compute-heavy; 如果你认为这是一个内存带宽问题，你可以通过人为地计算计算来测试它; say using numpy and doing sum of more complicated functions of range: sum(numpy.sin(range(N/2+1,N))) , say. 比如使用numpy和做更复杂的范围函数的sum(numpy.sin(range(N/2+1,N))) ： sum(numpy.sin(range(N/2+1,N))) 。 That should tilt the balance from memory access to computation. 这应该倾斜从内存访问到计算的平衡。

Answer 2

In what follows, I assume you're using Python 2.x. 在下文中，我假设您使用的是Python 2.x.

Depending on the hardware spec of your laptop, it is likely that there's heavy memory contention between processes 0 and 1. 根据笔记本电脑的硬件规格，进程0和1之间可能存在大量内存争用。

range(100000000/2) creates a list that takes 1.5GB of RAM on my PC, so you're looking at 3GB of RAM between the two processes. range(100000000/2)创建一个列表，在我的PC上占用1.5GB的RAM，所以你在两个进程之间看到3GB的RAM。 Using two cores to iterate over the two lists will likely result in memory bandwidth issues (and/or swapping). 使用两个内核迭代这两个列表可能会导致内存带宽问题（和/或交换）。 This is the most likely cause of the imperfect parallelization. 这是不完美并行化的最可能原因。

Using xrange instead of range won't generate the lists and should parallelize a lot better by making the computation CPU-bound. 使用xrange而不是range不会生成列表，并且应该通过使计算CPU绑定来更好地并行化。

By the way, there's a bug in your code: the second (x)range should start at N/2 , not N/2+1 . 顺便说一下，你的代码中有一个错误：第二个(x)range应该从N/2开始，而不是N/2+1 。

Answer 3

My Question: At least in theory, the execution time should nearly be halfed. 我的问题：至少在理论上，执行时间几乎应该减半。 Which overhead eats the missing 4 seconds? 哪个开销吃了4秒？

Some thoughts: 一些想法：

Are you using python 2? 你在使用python 2吗？ If so, use xrange since it creates a generator/iterator object. 如果是这样，请使用xrange因为它创建了一个生成器/迭代器对象。 It could save some milliseconds because range will be creating a fully fledged dictionary it keeps adding to, whereas xrange doesn't. 它可以节省几毫秒，因为range将创建一个它不断添加的完全成熟的字典，而xrange不会。 If using python 3, range creates an iterator by default. 如果使用python 3， range默认会创建一个迭代器。 Likely this won't save you very much time/memory in practise, but the python dev's clearly thought it was worth implementing everything as a generator, because that's one of the big things in python 3. 可能这在实践中不会为你节省很多时间/内存，但是python开发人员显然认为将一切都作为生成器实现是值得的，因为这是python 3中的重大事项之一。
Theoretically the algorithm bit should be 2x faster. 从理论上讲，算法位应该快2倍。 In practise, it is more complicated than that. 在实践中，它比这更复杂。 There is a cost for setting up threads or processes at the start of the algorithm which will add time to your run time; 在算法开始时设置线程或进程需要花费一些成本，这会增加运行时间; finally, there's a cost for synchronising the result at the end (waiting on joins). 最后，在结束时同步结果是有代价的（等待连接）。 So the 2x speed increase will never actually be realised. 所以2倍的速度增长永远不会实现。 For small values of any algorithm it is well known that serial algorithms outperform threaded counterparts; 对于任何算法的小值，众所周知，串行算法优于线程对应物; it is only when you reach an order of magnitude where the cost of thread creation is negligible compared to the work to be done that you notice an astronomical speed increase. 只有当你达到一个数量级时，线程创建的成本与你要注意到的天文速度增加的工作相比可以忽略不计。
Balancing of work may be a problem. 平衡工作可能是一个问题。 On a 32 bit system, the maximum size of number that can fit into a register (and so be O(1) for add given the size of the numbers) is 4294967296 (2^32). 在32位系统上，可以放入寄存器的最大数字大小（因此在给定数字大小的情况下为O（1）进行添加）是4294967296（2 ^ 32）。 Your sum, at large values, is 4999999950000000. Bignum addition is O(n) for the number of limbs (elements in the array) that you need, so you reach a slowdown as soon as you start using bignums as opposed to anything you can handle in a single memory address. 你的总和，在大的值，是4999999950000000.Bignum加法是你需要的肢数（数组中的元素）的O（n），所以你开始使用bignums而不是任何你可以达到减速处理单个内存地址。
```
 y = 0 for x in xrange(1, 100000000): if (x+y) > 2**32: print "X is " + str(x) print "y is " + str(y) break else: y += x 
```
That shows you at what n in N addition starts to become more expensive. 这表明你在N中的n加入开始变得更加昂贵。 I'd try timing the sum up to that value and the sum of values from there up to N and then adjust your work queue so that you split at an appropriate time. 我会尝试计算总和达到该值以及从那里到N的值之和，然后调整工作队列，以便在适当的时间进行分割。
Of course, on 64-bit systems you shouldn't be noticing this issue, since 2^64 is bigger than your total sum, unless python internally does not use uint64_t . 当然，在64位系统上你不应该注意到这个问题，因为2 ^ 64比你的总和大，除非 python内部不使用uint64_t 。 I would have thought it does. 我原以为是的。

Answer 4

Please read this Amdahl's Law 请阅读这个阿姆达尔定律

Your OS includes a large number of non-parallelizable bottlenecks. 您的操作系统包含大量不可并行化的瓶颈。 Your language library may also have some bottlenecks. 您的语言库也可能存在一些瓶颈。

Interestingly, your intel hardware's Memory Write Ordering may also have some number of non-parallelizable bottlenecks. 有趣的是，您的英特尔硬件的内存写入顺序也可能有一些不可并行化的瓶颈。

Answer 5

Load balancing is one theory, also there is also going to be an obvious communication latency, but I wouldn't expect any of these, even in combination, to have that great a performance loss. 负载均衡是一种理论，也会有明显的通信延迟，但我不希望其中任何一种，即使是组合，也会产生很大的性能损失。 I would guess that your largest overhead is that of starting 2 more instances of the python interpreter. 我猜你最大的开销就是启动另外两个python解释器实例。 Hopefully if you experiment with larger number you should find that the overhead does not in fact grow proportionality to N, but actually is a large constant plus a term dependent on N. For this reason you may want to stop the algorithm from going parallel for number less than some amount at which the performance improves. 希望如果您尝试使用更大的数字，您应该会发现开销实际上并不与N成比例，但实际上是一个大的常量加上一个依赖于N的术语。因此，您可能希望停止算法与数字并行低于性能提高的一些量。

I'm not intimately acquainted with mpi, however it may be that you are better creating a pool of workers at the start of your application and have them wait for tasks, rather than creating them on the fly. 我并不熟悉mpi，但是你可能最好在应用程序启动时创建一个工作池，让他们等待任务，而不是动态创建它们。 This requires a more complex design, but only incurs the interpreter initialisation penalty once per application run. 这需要更复杂的设计，但每次应用程序运行只会产生一次解释器初始化惩罚。

Answer 6

I wrote a bit of code to test what bits of the mpi infrastructure take up time. 我写了一些代码来测试mpi基础设施的哪些部分占用时间。 This version of your code can use an abritary number of cores from 1 to lots and lots. 此版本的代码可以使用从1到批次和批次的不同数量的核心。 The work is divided up evenly amongst the cores and sent back to host 0 to total. 工作在核心之间平均分配，并发送回主机0到总数。 Host 0 also does work. 主机0也可以工作。

import time

t = time.time()
import pypar
print 'pypar init time', time.time()-t, 'seconds'

rank = pypar.rank()
hosts = pypar.size()

N = 100000000

nStart = (N/hosts) * rank
if rank==hosts-1:
    nStop = N
else:
    nStop = ( ((N/hosts) * (rank+1)) )
print rank, 'working on', nStart, 'to', nStop

t = time.time()
s = sum(xrange(nStart,nStop))
if rank == 0:
    for p in range(1,hosts):
        s += pypar.receive(p)
        pypar.send(s,p) 
else:
    pypar.send(s,0) 
    s = pypar.receive(0)
if rank==0:
    print rank, 'total', s, 'in', time.time()-t, 'seconds'
pypar.Finalize()

Results: 结果：

pypar init time 1.68600010872 seconds
1 working on 12500000 to 25000000
pypar init time 1.80400013924 seconds
2 working on 25000000 to 37500000
pypar init time 1.98699998856 seconds
3 working on 37500000 to 50000000
pypar init time 2.16499996185 seconds
4 working on 50000000 to 62500000
Pypar (version 2.1.4.7) initialised MPI OK with 8 processors
pypar init time 1.5720000267 seconds
0 working on 0 to 12500000
0 total 4999999950000000 in 1.40100002289 seconds
pypar init time 2.34000015259 seconds
6 working on 75000000 to 87500000
pypar init time 2.64600014687 seconds
7 working on 87500000 to 100000000
pypar init time 2.23900008202 seconds
5 working on 62500000 to 75000000

Starting up the pypar and mpi libraries takes about 2.5 seconds. 启动pypar和mpi库大约需要2.5秒。 Then the actual work takes 1.4 seconds, to calculate and communicate back to host 0. Running as a single core it takes about 11 seconds. 然后实际工作需要1.4秒，以计算并与主机0通信。作为单核运行大约需要11秒。 So using 8 cores scales nicely. 所以使用8核可以很好地扩展。

Starting the mpiexec and python takes almost no time at all. 启动mpiexec和python几乎没有时间。 As this pathetic test shows: 正如这个可悲的测试所示：

c:\Data\python speed testing>time  0<enter.txt
The current time is: 10:13:07.03
Enter the new time:

c:\Data\python speed testing>mpiexec -n 1 python printTime.py
time.struct_time(tm_year=2011, tm_mon=8, tm_mday=4, tm_hour=10, tm_min=13, tm_sec=7, tm_wday=3, tm_yday=216, tm_isdst=0)

Splitting out the actual time to run the summation from the time to setup the data and libraries yields good scalling of peformance improvements. 从设置数据和库的时间开始计算运行总和的实际时间会产生良好的性能改进。

主机的秒数图表

Answer 7

Probably its a bad load balancing: Node 0 has less work than node 1 since summing up the lower N/2 integers is faster than summing up the upper N/2 integers. 可能是负载均衡不好：节点0的工作量比节点1少，因为总和较低的N / 2整数比总和较高的N / 2整数要快。 As a consequence, node 2 gets the message from node 0 quite early and has to wait relatively long for node 1. 结果，节点2很早就从节点0获得消息，并且必须等待相对长的节点1。

EDIT : Sven Marnach is right; 编辑：Sven Marnach是对的; it's not the load balancing since sum(range(N)) and sum(range(N,2*N)) takes the same amount of time. 它不是负载平衡，因为sum(range(N))和sum(range(N,2*N))花费相同的时间量。

使用MPI获得性能提升

问题描述

7 个解决方案

解决方案1
5 2011-03-17 12:15:43

解决方案2
4 已采纳 2011-03-17 12:24:39

解决方案3
3

解决方案4
2 2011-03-17 12:05:11

解决方案5
1 2011-03-17 11:51:38

解决方案6
1 2011-08-04 00:24:23

解决方案7
0 2011-03-17 11:33:54

使用MPI获得性能提升

问题描述

7 个解决方案

解决方案1 5 2011-03-17 12:15:43

解决方案2 4 已采纳 2011-03-17 12:24:39

解决方案3 3

解决方案4 2 2011-03-17 12:05:11

解决方案5 1 2011-03-17 11:51:38

解决方案6 1 2011-08-04 00:24:23

解决方案7 0 2011-03-17 11:33:54

解决方案1
5 2011-03-17 12:15:43

解决方案2
4 已采纳 2011-03-17 12:24:39

解决方案3
3

解决方案4
2 2011-03-17 12:05:11

解决方案5
1 2011-03-17 11:51:38

解决方案6
1 2011-08-04 00:24:23

解决方案7
0 2011-03-17 11:33:54