简体   繁体   中英

How to apply multiprocessing technique in python for-loop?

I have a long list of user(about 200,000) and a corresponding data frame df with their attributes. Now I'd like to write a for loop to measure pair-wise similarity of the users. The code is following:

df2record = pd.DataFrame(columns=['u1', 'u2', 'sim'])
for u1 in reversed(user_list):
    for u2 in reversed(list(range(1, u1))):
        sim = measure_sim(df[u1], df[u2]))
        if sim < 0.6:
            continue
        else:
            df2record = df2record.append(pd.Series([u1, u2, sim], index=['u1', 'u2', 'sim']), ignore_index=True)

Now I wanna run this for loop with multiprocessing and I have read some tutorial. But I still have no idea to handle it properly. Seems that I should set reasonable number of processes first, like 6 . And then I should feed each loop into one process. But the problem is how can I know the task in a certain process has been done so that a new loop can begin? Could you help me with this? Thanks you in advance!

You can use multiprocessing.Pool which provides method map that maps pool of processes over given iterable. Here's some example code:

def pairGen():
    for u1 in reversed(user_list):
        for u2 in reversed(list(range(1, u1))):
            yield (u1, u2)

def processFun(pair):
    u1, u2 = pair
    sim = measure_sim(df[u1], df[u2]))
    if sim < 0.6:
        return None
    else:
        return pd.Series([u1, u2, sim], index=['u1', 'u2', 'sim'])

def main():
    with multiprocessing.Pool(processes=6) as pool:
       vals = pool.map(processFun, pairGen())

    df2record = pd.DataFrame(columns=['u1', 'u2', 'sim'])
    for v in vals:
       if vals != None:
           df2record = df2record.append(v, ignore_index=True)

1st of all i would not recommend to use multiprocessing on such a small data. and especially when you are working with data frame. because data frame has it's own lot functionality which can help you in many ways. you just need to write proper loop.

Use: multiprocessing.Pool

just pass list of user as iterator(process_size=list_of_user) to pool.map() . you just need to create your iterator in a little tweak.

from multiprocessing import Pool
with Pool() as pool:
     pool = multiprocessing.Pool(processes=6)
     pool.map(function, iterator)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM