[英]python multiprocessing shared Counter , pickling error
我正在多重处理以处理一些非常大的文件。
我可以使用collections.Counter集合对特定字符串的出现进行计数,该collections使用multiprocessing.BaseManager子类在进程之间共享。
尽管我可以与他人共享计数器,并且似乎可以将其腌制,但似乎并未对其进行适当的腌制。 我可以将字典复制到可以腌制的新字典中。
我试图了解如何在选择共享计数器之前避免对其进行“复制”。
这是我的(伪代码):
from multiprocessing.managers import BaseManager
from collections import Counter
class MyManager(BaseManager):
pass
MyManager.register('Counter', Counter)
def main(glob_pattern):
# function that processes files
def worker_process(files_split_to_allow_naive_parallelization, mycounterdict):
# code that loops through files
for line in file:
# code that processes line
my_line_items = line.split()
index_for_read = (my_line_items[0],my_line_items[6])
mycounterdict.update((index_for_read,))
manager = MyManager()
manager.start()
mycounterdict = manager.Counter()
# code to get glob files , split them with unix shell split and then chunk then
for i in range(NUM_PROCS):
p = multiprocessing.Process(target=worker_process , args = (all_index_file_tuples[chunksize * i:chunksize * (i + 1)],mycounterdict))
procs.append(p)
p.start()
# Now we "join" the processes
for p in procs:
p.join()
# This is the part I have trouble with
# This yields a pickled file that fails with an error
pickle.dump(mycounterdict,open("Combined_count_gives_error.p","wb"))
# This however works
# How can I avoid doing it this way?
mycopydict = Counter()
mydictcopy.update(mycounterdict.items())
pickle.dump(mycopydict,open("Combined_count_that_works.p","wb"))
当我尝试加载“腌制”错误提示文件(该文件总是固定大小较小)时,出现了没有意义的错误。
我如何在不通过上面的伪代码创建新字典的情况下腌制共享字典。
>>> p = pickle.load(open("Combined_count_gives_error.p"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 880, in load_eof
raise EOFError
EOFError
您的代码有几个问题。 首先,如果将文件悬空,则不能保证将其关闭。 其次, mycounterdict
不是实际的Counter
而是它的代理-对其进行腌制,您会遇到很多问题,因为在此过程之外它是无法拾取的。 但是,您也不需要使用update
进行复制: .copy
会.copy
创建新的Counter
副本。
因此,您应该使用
with open("out.p", "wb") as f:
pickle.dump(mycounterdict.copy(), f)
至于这是一个好的模式,答案是否定的 。 除了使用共享计数器,您还应该在每个过程中分别进行计数,以获得更简单的代码:
from multiprocessing import Pool
from collections import Counter
import pickle
def calculate(file):
counts = Counter()
...
return counts
pool = Pool(processes=NPROCS)
counts = Counter()
for result in pool.map(calculate, files):
counts += result
with open("out.p", "wb") as f:
pickle.dump(counts, f)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.