简体   繁体   English

`multiprocessing.Pool.map()`似乎排错了时间

[英]`multiprocessing.Pool.map()` seems to schedule wrongly

I have a function which request a server, retrieves some data, process it and saves a csv file. 我有一个请求服务器,检索一些数据,对其进行处理并保存一个csv文件的功能。 This function should be launch 20k times. 此功能应启动20k次。 Each execution last differently: sometimes It last more than 20 minutes and other less than a second. 每次执行的持续时间都不一样:有时,持续时间超过20分钟,而其他时间则不到一秒钟。 I decided to go with multiprocessing.Pool.map to parallelize the execution. 我决定使用multiprocessing.Pool.map来并行执行。 My code looks like: 我的代码如下:

def get_data_and_process_it(filename):
    print('getting', filename)
    ...
    print(filename, 'has been process')

with Pool(8) as p:
    p.map(get_data_and_process_it, long_list_of_filenames)

Looking at how prints are generated it seems that long_list_of_filenames it's been splited into 8 parts and assinged to each CPU because sometimes is just get blocked in one 20 minutes execution with no other element of long_list_of_filenames been processed in those 20 minutes. 查看prints的生成方式,似乎将long_list_of_filenames分成8部分并分配给每个CPU因为有时20分钟内执行一次就会阻塞它,而在那20分钟内没有处理过long_list_of_filenames其他元素。 What I was expecting is map to schedule each element in a cpu core in a FIFO style. 我期望的是map以FIFO样式调度cpu内核中的每个元素。

Is there a better approach for my case? 我的案子有更好的方法吗?

map is blocking, instead of p.map you can use p.map_async . map被拦截,而不是p.map可以使用p.map_async map will wait for all those function calls to finish so we see all the results in a row. map将等待所有这些函数调用完成,因此我们将连续查看所有结果。 map_async does the work in random order and does not wait for a proceeding task to finish before starting a new task. map_async以随机顺序进行工作,并且在开始新任务之前不会等待正在进行的任务完成。 This is the fastest approach.( For more ) There is also a SO thread which in detail discusses about map and map_async . 这是最快的方法。( 更多 )还有一个SO线程 ,详细讨论mapmap_async

The multiprocessing Pool class handles the queuing logic for us. 多处理池类为我们处理排队逻辑。 It's perfect for running web scraping jobs in parallel (example) or really any job that can be broken up and distributed independently. 非常适合并行运行Web抓取作业(示例),或者实际上可以分解和分发的任何作业。 If you need more control over the queue or need to share data between multiple processes, you may want to look at the Queue class( For more ). 如果您需要对队列的更多控制或需要在多个进程之间共享数据,则可能需要查看Queue类( 有关更多信息 )。

The map method only returns when all operations have finished. 只有完成所有操作后, map方法才会返回。

And printing from a pool worker is not ideal. 从泳池工作人员进行打印并不理想。 For one thing, files like stdout use buffering, so there might be a variable amount of time between printing a message and it actually appearing. 一方面,像stdout这样的文件使用缓冲,因此在打印消息和消息实际显示之间可能会有可变的时间量。 Furthermore, since all workers inherit the same stdout , their output would become intermeshed and possibly even garbled. 此外,由于所有工人都继承相同的stdout ,因此他们的输出将相互融合,甚至可能出现乱码。

So I would suggest using imap_unordered instead. 因此,我建议改用imap_unordered That returns an iterator that will begin yielding results as soon as they are available. 这将返回一个迭代器,该迭代器将在结果可用时立即开始产生结果。 The only catch is that this returns results in the order they finish , not in the order they started. 唯一的问题是,此返回结果按其完成的顺序而不是按开始的顺序返回。

Your worker function ( get_data_and_process_it ) should return some kind of status indicator. 您的辅助函数( get_data_and_process_it )应该返回某种状态指示器。 For example a tuple of the filename and the result. 例如,文件名和结果的元组。

def get_data_and_process_it(filename):
    ...
    if (error):
        return (filename, f'has *failed* bacause of {reason}')
    return (filename, 'has been processed')

You could then do: 然后,您可以执行以下操作:

with Pool(8) as p:
   for fn, res in p.imap_unordered(get_data_and_process_it, long_list_of_filenames):
       print(fn, res)

That gives accurate information about when a job finishes, and since only the parent process writes to stdout , there is no change of the output becoming garbled. 这样就可以提供有关作业何时完成的准确信息,并且由于只有父进程写入stdout ,所以输出不会发生乱码。

Additionally, I would suggest to use sys.stdout.reconfigure(line_buffering=True) somewhere in the beginning of your program. 另外,我建议在程序开始的某个地方使用sys.stdout.reconfigure(line_buffering=True) That ensures that the stdout stream will be flushed after every line of output. 这样可以确保每行输出后都将刷新stdout流。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM