简体   繁体   English

在 Python 2.7 中并行化 for 循环

[英]Parallelizing for loop in Python 2.7

I'm very new to Python (and coding in general) and I need help parallising the code below.我对 Python(以及一般编码)非常陌生,我需要帮助并行化下面的代码。 I looked around and found some packages (eg. Multiprocessing & JobLib) which could be useful.我环顾四周,发现了一些可能有用的包(例如 Multiprocessing 和 JobLib)。

However, I have trouble using it in my example.但是,在我的示例中使用它时遇到了麻烦。 My code makes an outputfile, and updates it doing the loop(s).我的代码制作了一个输出文件,并在循环中更新它。 Therefore is it not directly paralisable, so I think I need to make smaller files.因此它不是直接可瘫痪的,所以我想我需要制作更小的文件。 After this, I could merge the files together.在此之后,我可以将文件合并在一起。

I'm unable to find a way to do this, could someone be so kind and give me a decent start?我找不到办法做到这一点,有人能这么好心给我一个体面的开始吗?

I appreciate any help, A code newbie我感谢任何帮助,代码新手

Code:代码:

def delta(graph,n,t,nx,OutExt):
    fout_=open(OutExt+'Delta'+str(t)+'.txt','w')
    temp=nx.Graph(graph)
    for u in range(0,n):
        #print "stamp: "+str(t)+" node: "+str(u)
        for v in range(u+1,n):
            #print str(u)+"\t"+str(v)
            Stat = dict()
            temp.add_edge(u,v)
            MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
            for a in Stat:
                for b in Stat[a]:
                    fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
            if not graph.has_edge(u,v):
                temp.remove_edge(u,v)
    del temp
    fout_.close()

As a start, find the part of the code that you want to be able to execute in parallel with something (perhaps with other invocations of that very same function).首先,找到您希望能够与某些内容并行执行的代码部分(可能与同一函数的其他调用)。 Then, figure out how to make this code not share mutable state with anything else.然后,弄清楚如何使此代码与其他任何内容共享可变状态。

Mutable state is the enemy of parallel execution.可变状态是并行执行的敌人。 If two pieces of code are executing in parallel and share mutable state, you don't know what the outcome will be (and the outcome will be different each time you run the program).如果两段代码并行执行并共享可变状态,您不知道结果会是什么(并且每次运行程序的结果都会不同)。 This is becaues you don't know what order the code from the parallel executions will run in. Perhaps the first will mutate something and then the second one will compute something.这是因为您不知道并行执行中的代码将按什么顺序运行。也许第一个会改变某些东西,然后第二个会计算某些东西。 Or perhaps the second one will compute something and then the first one will mutate it.或者也许第二个会计算一些东西,然后第一个会改变它。 Who knows?谁知道? There are solutions to that problem but they involve fine-grained locking and careful reasoning about what can change and when.这个问题有解决方案,但它们涉及细粒度锁定和仔细推理什么可以改变以及何时改变。

After you have an algorithm with a core that doesn't share mutable state, factor it into a separate function (turning locals into parameters).在您拥有一个核心不共享可变状态的算法后,将其分解为一个单独的函数(将局部变量转换为参数)。

Finally, use something like the threading (if your computations are primarily in CPython extension modules with good GIL behavior) or multiprocessing (otherwise) modules to execute the algorithm core function (which you have abstracted out) at some level of parallelism.最后,使用threading (如果您的计算主要在具有良好 GIL 行为的 CPython 扩展模块中)或多multiprocessing (否则)模块之类的东西,以某种程度的并行性执行算法核心功能(您已经抽象出来)。

The particular code example you've shared is a challenge because you use the NetworkX library and a lot of shared mutable state.您共享的特定代码示例是一个挑战,因为您使用 NetworkX 库和许多共享的可变状态。 Each iteration of your loop depends on the results of the previous, apparently.显然,循环的每次迭代都取决于前一次的结果。 This is not obviously something you can parallelize.这显然不是您可以并行化的。 However, perhaps if you think about your goals more abstractly you will be able to think of a way to do it (remember, the key is to be able to expressive your algorithm without using shared mutable state).但是,也许如果您更抽象地考虑您的目标,您将能够想出一种方法来实现它(请记住,关键是能够在使用共享可变状态的情况下表达您的算法)。

Your function is called delta .您的函数称为delta Perhaps you can split your graph into sub-graphs and compute the deltas of each (which are now no longer shared ) in parallel.也许您可以将图拆分为子图并并行计算每个子图(现在不再共享)的增量。

If the code within your outermost loop is concurrent safe (I don't know if it is or not), you could rewrite it like this for parallel execution:如果最外层循环中的代码是并发安全的(我不知道它是否是),你可以像这样重写它以进行并行执行:

from multiprocessing import Pool

def do_one_step(nx, graph, n, t, OutExt, u):
    # Create a separate output file for this set of results.
    name = "{}Delta{}-{}.txt".format(OutExt, t, u)
    fout_ = open(name, 'w')
    temp = nx.Graph(graph)

    for v in range(u+1,n):
        Stat = dict()
        temp.add_edge(u,v)
        MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
        for a in Stat:
            for b in Stat[a]:
                fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
        if not graph.has_edge(u,v):
            temp.remove_edge(u,v)
    fout_.close()

def delta(graph,n,t,nx,OutExt):
    pool = Pool()
    pool.map(
        partial(
            do_one_step,
            nx,
            graph,
            n,
            t,
            OutExt,
        ),
        range(0,n),
    )

This supposes that all of the arguments can be serialized across processes (required for any argument you pass to a function you call with multiprocessing ).这假设所有参数都可以跨进程序列化(传递给使用multiprocessing调用的函数的任何参数都需要)。 I suspect that nx and graph may be problems but I don't know what they are.我怀疑nxgraph可能有问题,但我不知道它们是什么。

And again, this assumes it's actually correct to concurrently execute the inner loop.同样,这假设并发执行内部循环实际上是正确的

Best use pool.map.最好使用 pool.map。 Here an example that shows what you need to do.这是一个显示您需要做什么的示例。 Here a simple example of how multiprocessing works with pool:这是多处理如何与池一起工作的简单示例:

Single threaded, basic function:单线程,基本功能:

def f(x):
    return x*x

if __name__ == '__main__':
     print(map(f, [1, 2, 3]))

>> [1, 4, 9]

Using multiple processors:使用多个处理器:

from multiprocessing import Pool 

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(3) # 3 parallel pools
    print(p.map(f, [1, 2, 3]))

Using 1 processor使用 1 个处理器

from multiprocessing.pool import ThreadPool as Pool 

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(3) # 3 parallel pools
    print(p.map(f, [1, 2, 3]))

When you use map you can easily get a list back from the results of your function.当您使用 map 时,您可以轻松地从函数的结果中获取一个列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM