简体   繁体   中英

Python multiprocessing - Return a dict

I'd like to parallelize a function that returns a flatten list of values (called "keys") in a dict but I don't understand how to obtain in the final result. I have tried:

def toParallel(ht, token):
    keys = []
    words = token[token['hashtag'] == ht]['word']
    for w in words:
        keys.append(checkString(w))
    y = {ht:keys}

num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)

token = pd.read_csv('/path', sep=",", header = None, encoding='utf-8')
token.columns = ['word', 'hashtag', 'count']
hashtag = pd.DataFrame(token.groupby(by='hashtag', as_index=False).count()['hashtag'])

result = pd.DataFrame(index = hashtag['hashtag'], columns = range(0, 21))
result = result.fillna(0)

final_result = []
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]

Where toParallel function should return a dict with hashtag as key and a list of keys (where keys are int). But if I try to print final_result, I obtain only

bound method ApplyResult.get of multiprocessing.pool.ApplyResult object at 0x10c4fa950

How can I do it?

final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]

You can either use Pool.apply() and get the result right away (in which case you do not need multiprocessing hehe, the function is just there for completeness) or use Pool.apply_async() following by Pool.get() . Pool.apply_async() is asynchronous .

Something like this:

workers = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
final_result = [worker.get() for worker in workers]

Alternatively, you can also use Pool.map() which will do all this for you.

Either way, I recommend you read the documentation carefully.


Addendum: When answering this question I presumed the OP is using some Unix operating system like Linux or OSX. If you are using Windows, you must not forget to safeguard your parent/worker processes using if __name__ == '__main__' . This is because Windows lacks fork() and so the child process starts at the beginning of the file, and not at the point of forking like in Unix, so you must use an if condition to guide it. See here .


ps: this is unnecessary:

num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)

If you call multiprocessing.Pool() without arguments (or None ), it already creates a pool of workers with the size of your cpu count.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM