I am multiprocessing to process some very large files.
I can count occurrences of a particular string using the collections.Counter collection that is shared between the processes using a multiprocessing.BaseManager subclass.
Although I can share the Counter and seemingly pickle it does not seem to be pickled properly. I can copy the dictionary to a new dictionary that I can pickle.
I am trying to understand how to avoid the "copy" of the shared counter before picking it.
Here is my (pseudocode):
from multiprocessing.managers import BaseManager
from collections import Counter
class MyManager(BaseManager):
pass
MyManager.register('Counter', Counter)
def main(glob_pattern):
# function that processes files
def worker_process(files_split_to_allow_naive_parallelization, mycounterdict):
# code that loops through files
for line in file:
# code that processes line
my_line_items = line.split()
index_for_read = (my_line_items[0],my_line_items[6])
mycounterdict.update((index_for_read,))
manager = MyManager()
manager.start()
mycounterdict = manager.Counter()
# code to get glob files , split them with unix shell split and then chunk then
for i in range(NUM_PROCS):
p = multiprocessing.Process(target=worker_process , args = (all_index_file_tuples[chunksize * i:chunksize * (i + 1)],mycounterdict))
procs.append(p)
p.start()
# Now we "join" the processes
for p in procs:
p.join()
# This is the part I have trouble with
# This yields a pickled file that fails with an error
pickle.dump(mycounterdict,open("Combined_count_gives_error.p","wb"))
# This however works
# How can I avoid doing it this way?
mycopydict = Counter()
mydictcopy.update(mycounterdict.items())
pickle.dump(mycopydict,open("Combined_count_that_works.p","wb"))
When I try to load the "pickled" error giving file which is always a smaller fixed size , I get an error that does not make sense.
How do I pickle the shared dict without going through the creation of fresh dict in the pseudocode above.
>>> p = pickle.load(open("Combined_count_gives_error.p"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 880, in load_eof
raise EOFError
EOFError
There are several problems with your code. First of all, you are not guaranteed to close the file if you leave it dangling. Secondly, the mycounterdict
is not an actual Counter
but a proxy over it - pickle it and you will run into many problems, as it is unpicklable outside this process. However, you do not need to copy with update
either: .copy
makes a new Counter
copy of it.
Thus you should use
with open("out.p", "wb") as f:
pickle.dump(mycounterdict.copy(), f)
As for if this is a good pattern, the answer is no . Instead of using a shared counter you should count separately in each process, for a much simpler code:
from multiprocessing import Pool
from collections import Counter
import pickle
def calculate(file):
counts = Counter()
...
return counts
pool = Pool(processes=NPROCS)
counts = Counter()
for result in pool.map(calculate, files):
counts += result
with open("out.p", "wb") as f:
pickle.dump(counts, f)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.