简体   繁体   中英

multiprocessing a function execution with python

i have a pandas dataframe which consists of approximately 1M rows, it contains information entered by users. i wrote a function that validates if the number entered by the user is correct or not. what im trying to do, is to execute the function on multiple processors to overcome the issue of doing heavy computation on a single processor. what i did is i split my dataframe into multiple chunks where each chunk contains 50K rows and then used the python multiprocessor module to perform the processing on each chunk separately. the issue is that only the first process is starting and its still using one processor instead of distributing the load on all processors. here is the code i wrote:

 pool = multiprocessing.Pool(processes=16)
 r7 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[0],fields ,dictionary))
 r8 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[1],fields ,dictionary))
 print(r7.get())
 print(r8.get())
 pool.close()
 pool.join()

i have attached a screenshot that shows how the CPU usage when executing the above code在此处输入图像描述

any advice on how can i overcome this issue?

I suggest you try this:

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor() as executor:
    params = [(pnl, fields, dictionary) for pnl in has_phone_num_list]
    for result in executor.map(validate.validate_phone_number, params):
        pass # process results here

By constructing the ProcessPoolExecutor with no parameters, most of your CPUs will be fully utilised. This is a very portable approach because there's no explicit assumption about the number of CPUs available. You could, of course, construct with max_workers=N where N is a low number to ensure that a minimal number of CPUs are used concurrently. You might do that if you're not too concerned about how long the overall process is going to take.

As suggested in this answer , you can use pandarallel for using Pandas' apply function in parallel. Unfortunately as I cannot try your code I am not able to find the problem. Did you try to use less processors (like 8 instead of 16)?

Note that in some cases the parallelization doesn't work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM