简体   繁体   English

在Python中使用线程进行for循环

[英]Using threads for a for-loop in Python

item_list = [("a", 10, 20), ("b", 25, 40), ("c", 40, 100), ("d", 45, 90),
             ("e", 35, 65), ("f", 50, 110)] #weight/value
results = [("", 0, 0)]  #an empty string and a 2-tupel to compare with the new
                        #values

class Rucksack(object):
    def __init__(self, B):
        self.B = B   #B=maximum weight
        self.pack(item_list, 0, ("", 0, 0))

    def pack(self, items, n, current):  
        n += 1   #n is incremented, to stop the recursion, if all
        if n >= len(items) - 1:
            if current[2] > results[0][2]:
                #substitutes the result, if current is bigger and starts no
                #new recursion
                results[0] = current
        else:
            for i in items:
                if current[1] + i[1] <= self.B and i[0] not in current[0]:
                    #first condition: current + the new value is not bigger
                    #than B; 2nd condition: the new value is not the same as
                    #current
                    i = (current[0] + " " + i[0], current[1] + i[1],
                         current[2] + i[2])
                    self.pack(items, n, i)
                else:
                    #substitutes the result, if current is bigger and starts no
                    #new recursion
                    if current[2] > results[0][2]:
                        results[0] = current

rucksack1 = Rucksack(100) 背包1 =背包(100)

This is a small algo for the knapsack-problem. 这是一个关于背包问题的小算法。 I have to parallelize the code somehow, but I don't get the thread module so far. 我必须以某种方式并行化代码,但到目前为止我还没有得到线程模块。 I think the only place to work with parallelisation is the for-loop, right? 我认为使用并行化的唯一地方是for循环,对吧? So, I tried this: 所以,我尝试了这个:

def run(self, items, i, n, current):
    global num_threads, thread_started
    lock.acquire()
    num_threads += 1
    thread_started = True
    lock.release()
    if current[1] + i[1] <= self.B and i[0] not in current[0]:
        i = (current[0] + " " + i[0], current[1] + i[1], current[2] + i[2])
        self.pack(items, n, i)
    else:
        if current[2] > results[0][2]:
            results[0] = current
    lock.acquire()
    num_threads -= 1
    lock.release() 

but the results are strange. 但是结果很奇怪。 Nothing happens and if I make a keyboardinterrupt, the result is correct, but thats definetly not the sense of the implementation. 什么也没有发生,如果我进行了键盘中断,则结果是正确的,但这绝对不是实现的意义。 Can you tell me what is wrong with the second code or where else I could use perallelisation soundly. 您能告诉我第二个代码有什么问题吗,或者我可以在其他地方合理使用全同化。 Thanks. 谢谢。

First, since your code is CPU-bound, you will get very little benefit from using threads for parallelism, because of the GIL , as bereal explains. 首先,由于代码是CPU绑定的,因此如bereal所述,由于使用GIL ,使用线程进行并行处理几乎不会带来任何好处。 Fortunately, there are only a few differences between threads and processes—basically, all shared data must be passed or shared explicitly (see Sharing state between processes for details). 幸运的是,线程和进程之间只有几个区别-基本上,所有共享数据都必须显式传递或共享(有关详细信息,请参阅进程之间的共享状态 )。

Second, if you want to data-parallelize your code, you have to lock all access to mutable shared objects. 其次,如果要对代码进行数据并行化,则必须锁定对可变共享对象的所有访问。 From a quick glance, while items and current look immutable, the results object is a shared global that you modify all over the place. 快速浏览一下,虽然itemscurrent items看起来是不可变的,但results对象是一个共享的全局对象,您可以在各处进行修改。 If you can change your code to return a value up the chain, that's ideal. 如果您可以更改代码以返回一个链上的值,那将是理想的选择。 If not, if you can accumulate a bunch of separate return values and merge them after processing is finished, that's usually good too. 如果不是这样,如果您可以累积一堆单独的返回值并在处理完成后合并它们,那么通常也很好。 If neither is feasible, you will need to guard all access to results with a lock. 如果都不可行,则需要使用锁来保护对results所有访问。 See Synchronization between processes for details. 有关详细信息,请参见进程之间的同步

Finally, you ask where to put the parallelism. 最后,您询问将并行机制放在何处。 The key is to find the right dividing line between independent tasks. 关键是找到独立任务之间的正确分界线。

Ideally you want to find a large number of mid-sized jobs that you can queue up, and just have a pool of processes each picking up the next one. 理想情况下,您希望找到大量可以排队的中型作业,并且只有一个进程池,每个进程都接下一个。 From a quick glance, the obvious places to do that are either at the recursive call to self.pack , or at each iteration of the for i in items: loop. 乍一看,明显的地方是在对self.pack的递归调用中,或者for i in items:循环中for i in items:每次迭代中。 If they actually are independent, just use concurrent.futures , as in the ProcessPollExecutor example . 如果它们实际上是独立的,则只需使用concurrent.futures ,如ProcessPollExecutor示例中所示 (If you're on Python 3.1 or earlier, you need the futures module, because it's not built into the stdlib.) (如果您使用的是Python 3.1或更早版本,则需要使用futures模块,因为它不是内置在stdlib中的。)

If there's no easy way to do this, often it's at least possible to create a small number (N or 2N, if you have N cores) of long-running jobs of about equal size, and just give each one its own multiprocessing.Process . 如果没有简便的方法来执行此操作,通常通常至少可以创建少量(如果有N个核心,则为N或2N个)大小大致相等的长时间运行的作业,并为每个作业分配自己的multiprocessing.Process For example: 例如:

n = 8
procs = [Process(target=rucksack.pack, args=(items[i//n:(i+1)//n],)) for i in range(n)]

One last note: If you finish your code and it looks like you've gotten away with implicitly sharing globals, what you've actually done is written code that usually-but-not-always works on some platforms, and never on others. 最后一点:如果您完成了代码,但是似乎已经隐式地共享了全局变量,那么您实际上所做的就是编写通常(但不是总是)在某些平台上可以使用而在其他平台上却无法使用的代码。 See the Windows section of the multiprocessing docs to see what to avoid—and, if possible, test regularly on Windows, because it's the most restrictive platform. 请参阅multiprocessing文档的Windows部分,以了解应避免的事情,并且如果可能,请在Windows上进行定期测试,因为它是限制性最强的平台。


You also ask a second question: 您还会问第二个问题:

Can you tell me what is wrong with the second code. 您能告诉我第二个代码有什么问题吗?

It's not entirely clear what you were trying to do here, but there are a few obvious problems (beyond what's mentioned above). 目前尚不清楚您要在这里做什么,但是有一些明显的问题(除了上面提到的问题之外)。

  • You don't create a thread anywhere in the code you showed us. 您不会在向我们展示的代码中的任何地方创建线程。 Just creating variables with "thread" in the name doesn't give you parallelism. 仅使用名称中的“ thread”创建变量不会给您带来并行性。 And neither does adding locks—if you don't have any threads, all locks can do is slow you down for no reason. 而且,添加锁也不会—如果您没有任何线程,则所有锁可以做的操作都会无缘无故拖慢您的速度。
  • From your description, it sounds like you were trying to use the thread module, instead of threading . 从您的描述中,听起来好像您正在尝试使用thread模块,而不是threading There's a reason that the very top of the thread documentation tells you to not use it and use threading instead. 有一个原因是thread文档的最顶层告诉您不要使用它,而应使用threading
  • You have a lock protecting your thread count (which shouldn't be needed at all), but no lock protecting your results . 您有一个锁来保护线程数(根本不需要),但是没有锁来保护results You will get away with this in most cases in Python (because of the same GIL issue mentioned above—your threads are basically not going to run concurrently, and therefore they're not going to have races), but it's still a very bad idea (especially if you don't understand exactly what those "most cases" are). 在大多数情况下,您将在Python中避免使用该命令(由于上述相同的GIL问题-您的线程基本上不会并行运行,因此它们不会产生竞争),但这仍然是一个非常糟糕的主意(尤其是如果您不完全了解那些“大多数情况”是什么)。

However, it looks like your run function is based on the body of the for i in items: loop in pack . 然而,它看起来像你的run功能是基于的身体for i in items:循环的pack If that's a good place to parallelize, you're in luck, because creating a parallel task out of each iteration of a loop is exactly what futures and multiprocessing are best at. 如果这是并行化的好地方,那么您很幸运,因为从循环的每个迭代中创建并行任务正是futuresmultiprocessing的最佳选择。 For example, this code: 例如,此代码:

results = []
for i in items:
    result = dostuff(i)
    results.append(result)

… can, of course, be written as: ……当然可以写成:

results = map(dostuff, items)

And it can be trivially parallelized, without even having to understand what futures are about, as: 它可以被微不足道地并行化,甚至不必了解期货的含义,例如:

pool = concurrent.futures.ProcessPoolExecutor()
results = pool.map(dostuff, items)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM