简体   繁体   English

在Python中的多个进程之间共享一个存储对象的字典

[英]Share a dictionary storing objects between several processes in Python

I'm working on a large script that's the main purpose is to read contents of many files and store number of each element in the dictionary. 我正在研究一个大型脚本,其主要目的是读取许多文件的内容并在字典中存储每个元素的编号。 If the element is absent in the dictionary then we're creating a new instance of some object and then increment, only increment otherwise. 如果字典中没有该元素,则我们正在创建某个对象的新实例,然后增加,否则增加。 Since each one of the files to process is huge itself and sometimes I need to process 100+ of them I wanted to speed things up a little and take advantage of Python's multiprocessing module. 由于每个要处理的文件本身都很庞大,有时我需要处理100个以上的文件,因此我想加快处理速度并利用Python的多处理模块。 Here is the largely simplified version of the script (I hid the path with ..., it's not the real one): 这是脚本的大部分简化版本(我用...隐藏了路径,这不是真正的脚本):

import multiprocessing as mp
from os import listdir
from os.path import join

manager = mp.Manager()
queue = manager.Queue()
dictionary = manager.dict()

class TestClass:
    def __init__(self):
        self._number = 0

    def increment(self):
        self._number += 1

def worker(file):
    f = open(file, 'r')
    for line in f.readlines():
        if line not in dictionary:
            dictionary[line] = TestClass()

        dictionary[line].increment()

def _list_files():
    for f in listdir("..."):
        queue.put(join("...", f))

def pool():
    _list_files()
    _pool = mp.Pool(mp.cpu_count())    

    for i in range(len(queue)):
        _pool.apply(worker, args=(queue.get()))

    _pool.close()
    _pool.join()

pool()
print(dictionary)

The problem is that the script crashes with message: 问题是脚本崩溃并显示以下消息:

AttributeError: Can't get attribute 'TestClass' on <module '__main__' from '.../multiprocessing_test.py'>  

Is there any way that I can get this to work? 有什么办法可以使它起作用?
I'm not the one who created the initial version of the script, I'm just adding some functionalities to it. 我不是创建脚本初始版本的人,我只是向其中添加了一些功能。 Given that, the structure of the script must stay the same because rewriting it would take too much time, that is TestClass , worker and list_files can't change their structure (except all things connected with multiprocessing) 鉴于此,脚本的结构必须保持不变,因为重写将花费太多时间,即TestClassworkerlist_files不能更改其结构(与多处理相关的所有事物除外)

(It seems like you posted this question before.) (看来您之前曾发布过此问题。)

Your example code is nonfunctional for a bunch of reasons, not least of which is that ... just does not do anything useful: 您的示例代码由于多种原因而无法运行,其中最重要的原因是...只是没有做任何有用的事情:

$ python tst.py
Traceback (most recent call last):
  File "tst.py", line 38, in <module>
    pool()
  File "tst.py", line 29, in pool
    _list_files()
  File "tst.py", line 25, in _list_files
    for f in listdir("..."):
OSError: [Errno 2] No such file or directory: '...'

(It's not good form to post code that won't run, but it is a good idea to provide an MCVE .) So I fixed that: (发布无法运行的代码不是一种很好的形式,但是提供MCVE 一个好主意。)因此,我修复了以下问题:

index 39014ff..1ac9f4a 100644
--- a/tst.py
+++ b/tst.py
@@ -2,6 +2,8 @@ import multiprocessing as mp
 from os import listdir
 from os.path import join

+DIRPATH = 'inputs'
+
 manager = mp.Manager()
 queue = manager.Queue()
 dictionary = manager.dict()
@@ -22,8 +24,8 @@ def worker(file):
         dictionary[line].increment()

 def _list_files():
-    for f in listdir("..."):
-        queue.put(join("...", f))
+    for f in listdir(DIRPATH):
+        queue.put(join(DIRPATH, f))

 def pool():
     _list_files()

and created an inputs/ directory with one sample input file: 并使用一个示例输入文件创建了一个inputs/目录:

$ ls inputs
one
$ cat inputs/one
1
one
unum

and now this example produces: 现在这个例子产生了:

$ python tst.py
Traceback (most recent call last):
  File "tst.py", line 40, in <module>
    pool()
  File "tst.py", line 34, in pool
    for i in range(len(queue)):
TypeError: object of type 'AutoProxy[Queue]' has no len()

Now, I won't claim that this rewrite is good , but I went ahead and rewrote this into something that does work: 现在,我不会声称这种重写是好的 ,但是我继续将其重写为可以工作的东西:

import multiprocessing as mp
from os import listdir
from os.path import join

DIRPATH = 'inputs'

class TestClass:
    def __repr__(self):
        return str(self._number)

    def __init__(self):
        self._number = 0

    def increment(self):
        self._number += 1

def worker(dictionary, queue):
    while True:
        path = queue.get()
        if path is None:
            return
        f = open(path, 'r')
        for line in f.readlines():
            if line not in dictionary:
                dictionary[line] = TestClass()
            dictionary[line].increment()

def run_pool():
    manager = mp.Manager()
    queue = manager.Queue()
    dictionary = manager.dict()
    nworkers = mp.cpu_count()
    pool = mp.Pool(nworkers)

    for i in range(nworkers):
        pool.apply_async(worker, args=(dictionary, queue))

    for f in listdir(DIRPATH):
        queue.put(join(DIRPATH, f))
    for i in range(nworkers):
        queue.put(None)

    pool.close()
    pool.join()

    return dictionary

def main():
    dictionary = run_pool()
    print(dictionary)

if __name__ == '__main__':
    main()

The main differences are: 主要区别在于:

  • I removed all the global variables. 我删除了所有全局变量。 The manager instance, the managed queue, and the managed dictionary are all local to run_pool . 管理器实例,托管队列和托管字典都是run_pool本地的。

  • I put the names of the files into the queue after creating nworker workers. 创建nworker worker 之后 ,我将文件名放入队列中。 Each worker runs a loop, reading file names, until it reads a None name, then returns its (None) result. 每个工作程序都会运行一个循环,读取文件名,直到读取None ,然后返回其(无)结果。

  • The main loop drops the file names into the queue, so that workers can pull file names out of the queue as they finish each previous file. 主循环将文件名放入队列中,以便工作人员在完成每个先前的文件时可以将文件名从队列中拉出。 To signal all nworkers workers to exit, the main loop adds that many None entries to the queue. 为了通知所有nworkers工人退出,主循环将许多None条目添加到队列中。

  • run_pool returns the final (still managed) dictionary. run_pool返回最终(仍受管)的字典。

And of course I added a __repr__ to your TestClass object so that we can see the counts. 当然,我在您的TestClass对象中添加了一个__repr__ ,以便我们可以看到计数。 I also made sure that the code should work on Windows by moving the main driver into a function, called only if __name__ == '__main__' . 我还通过将main驱动程序移入一个函数(仅在__name__ == '__main__'时才调用)来确保代码在Windows上可以正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM