简体   繁体   中英

python multiprocessing shared Counter , pickling error

I am multiprocessing to process some very large files.

I can count occurrences of a particular string using the collections.Counter collection that is shared between the processes using a multiprocessing.BaseManager subclass.

Although I can share the Counter and seemingly pickle it does not seem to be pickled properly. I can copy the dictionary to a new dictionary that I can pickle.

I am trying to understand how to avoid the "copy" of the shared counter before picking it.

Here is my (pseudocode):

from multiprocessing.managers import BaseManager
from collections import Counter

class MyManager(BaseManager):
    pass

MyManager.register('Counter', Counter)

def main(glob_pattern):
    # function that processes files
    def worker_process(files_split_to_allow_naive_parallelization, mycounterdict):
        # code that loops through files
        for line in file:
            # code that processes line
            my_line_items = line.split()
            index_for_read = (my_line_items[0],my_line_items[6])
            mycounterdict.update((index_for_read,))

    manager = MyManager()
    manager.start()
    mycounterdict = manager.Counter()

    # code to get glob files , split them with unix shell split and then chunk then

    for i in range(NUM_PROCS):
        p = multiprocessing.Process(target=worker_process , args = (all_index_file_tuples[chunksize * i:chunksize * (i + 1)],mycounterdict))
        procs.append(p)
        p.start()
    # Now we "join" the processes
    for p in procs:
        p.join()

    # This is the part I have trouble with
    # This yields a pickled file that fails with an error
    pickle.dump(mycounterdict,open("Combined_count_gives_error.p","wb"))

    # This however works
    # How can I avoid doing it this way?
    mycopydict = Counter()
    mydictcopy.update(mycounterdict.items())
    pickle.dump(mycopydict,open("Combined_count_that_works.p","wb"))

When I try to load the "pickled" error giving file which is always a smaller fixed size , I get an error that does not make sense.

How do I pickle the shared dict without going through the creation of fresh dict in the pseudocode above.

>>> p = pickle.load(open("Combined_count_gives_error.p"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1378, in load
    return Unpickler(file).load()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 880, in load_eof
    raise EOFError
EOFError

There are several problems with your code. First of all, you are not guaranteed to close the file if you leave it dangling. Secondly, the mycounterdict is not an actual Counter but a proxy over it - pickle it and you will run into many problems, as it is unpicklable outside this process. However, you do not need to copy with update either: .copy makes a new Counter copy of it.

Thus you should use

with open("out.p", "wb") as f:
    pickle.dump(mycounterdict.copy(), f)

As for if this is a good pattern, the answer is no . Instead of using a shared counter you should count separately in each process, for a much simpler code:

from multiprocessing import Pool
from collections import Counter
import pickle

def calculate(file):
    counts = Counter()
    ...
    return counts

pool = Pool(processes=NPROCS)
counts = Counter()
for result in pool.map(calculate, files):
    counts += result

with open("out.p", "wb") as f:
    pickle.dump(counts, f)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM