简体   繁体   English

多处理 function 执行与 python

[英]multiprocessing a function execution with python

i have a pandas dataframe which consists of approximately 1M rows, it contains information entered by users.我有一个 pandas dataframe,它由大约 1M 行组成,它包含用户输入的信息。 i wrote a function that validates if the number entered by the user is correct or not.我写了一个 function 来验证用户输入的数字是否正确。 what im trying to do, is to execute the function on multiple processors to overcome the issue of doing heavy computation on a single processor.我想做的是在多个处理器上执行 function 以克服在单个处理器上进行大量计算的问题。 what i did is i split my dataframe into multiple chunks where each chunk contains 50K rows and then used the python multiprocessor module to perform the processing on each chunk separately.我所做的是将我的 dataframe 分成多个块,每个块包含 50K 行,然后使用 python 多处理器模块分别对每个块执行处理。 the issue is that only the first process is starting and its still using one processor instead of distributing the load on all processors.问题是只有第一个进程正在启动并且它仍在使用一个处理器而不是在所有处理器上分配负载。 here is the code i wrote:这是我写的代码:

 pool = multiprocessing.Pool(processes=16)
 r7 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[0],fields ,dictionary))
 r8 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[1],fields ,dictionary))
 print(r7.get())
 print(r8.get())
 pool.close()
 pool.join()

i have attached a screenshot that shows how the CPU usage when executing the above code我附上了一张截图,显示了执行上述代码时 CPU 使用情况在此处输入图像描述

any advice on how can i overcome this issue?关于如何克服这个问题的任何建议?

I suggest you try this:我建议你试试这个:

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor() as executor:
    params = [(pnl, fields, dictionary) for pnl in has_phone_num_list]
    for result in executor.map(validate.validate_phone_number, params):
        pass # process results here

By constructing the ProcessPoolExecutor with no parameters, most of your CPUs will be fully utilised.通过构建不带参数的 ProcessPoolExecutor,您的大部分 CPU 将得到充分利用。 This is a very portable approach because there's no explicit assumption about the number of CPUs available.这是一种非常便携的方法,因为没有关于可用 CPU 数量的明确假设。 You could, of course, construct with max_workers=N where N is a low number to ensure that a minimal number of CPUs are used concurrently.当然,您可以使用 max_workers=N 进行构造,其中 N 是一个较小的数字,以确保同时使用最少数量的 CPU。 You might do that if you're not too concerned about how long the overall process is going to take.如果您不太关心整个过程需要多长时间,您可以这样做。

As suggested in this answer , you can use pandarallel for using Pandas' apply function in parallel.如此答案中所建议的,您可以使用 pandarallel 并行使用 Pandas 的 apply function。 Unfortunately as I cannot try your code I am not able to find the problem.不幸的是,因为我无法尝试您的代码,所以我无法找到问题所在。 Did you try to use less processors (like 8 instead of 16)?您是否尝试使用更少的处理器(例如 8 个而不是 16 个)?

Note that in some cases the parallelization doesn't work.请注意,在某些情况下,并行化不起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM