简体   繁体   English

带有pandas的Python多重处理并非所有进程都同时运行

[英]Python multiprocessing with pandas not all process running at once

在此处输入图片说明

I am reading a csv in chunk and passing the chunk to a pool of 4 processes. 我正在读取块中的csv,并将该块传递给4个进程的池。

pool = Pool(processes=4)
            chunk_index = 1
            for df in pd.read_csv(downloaded_file, chunksize=chunksize, compression='gzip', skipinitialspace=True, encoding='utf-8'):
                output_file_name = output_path + merchant['output_file_format'].format(
                    file_index, chunk_index)
                pool.map(wrapper_process, [
                         (df, transformer, output_file_name)])
                chunk_index += 1

With this code my understanding is it should show me 4 process continuously running. 有了这段代码,我的理解是它应该向我显示4个进程连续运行。 But in the screenshot of htop below, It is always 2 running. 但是在下面的htop屏幕截图中,它始终在运行2。 One is htop command it self. 一种是htop自己命令。 It means that only 1 python process in running at the time. 这意味着当时只有1个python进程正在运行。

在此处输入图片说明 From the memory usage, It is 12 gb which i think will only be possible when the 4 chunks are loaded in memory provided 1 chunk is 2gb almost 从内存使用情况来看,这是12 gb,我认为只有在将4个块加载到内存中的情况下才有可能, 前提是1个块几乎是2 gb

How can i use for processors at once. 我如何一次使用处理器。

The problem is that you misuderstood how map works. 问题是您误解了地图的工作方式。 From the doc : 文档

map(func, iterable[, chunksize]) This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. map(func, iterable[, chunksize])此方法将Iterable分成许多块,这些块作为单独的任务提交给进程池。 The (approximate) size of these chunks can be specified by setting chunksize to a positive integer. 这些块的(大约)大小可以通过将chunksize设置为正整数来指定。

As iterable you provide a list with only one element: the tuple (df, ...) . 作为可迭代的,您提供的列表仅包含一个元素:元组(df, ...) But you'd need to provide a iterable with many elements. 但是您需要提供一个包含许多元素的可迭代对象。 To make this work, you'd need to prepare the list first and only then send it to the processes (hint: you can just write Pool() and let python find out the number of cores itself) 要使此工作正常进行,您需要先准备列表,然后再将其发送到进程(提示:您可以编写Pool()并让python自己找出内核数)

pool = Pool()
chunk_index = 1
list = []
for df in pd.read_csv(downloaded_file, 
        chunksize=chunksize, 
        compression='gzip', 
        skipinitialspace=True, 
        encoding='utf-8'):
    output_file_name = output_path + 
        merchant['output_file_format'].format(file_index, chunk_index)
    list.append((df, transformer, output_file_name)])
    chunk_index += 1
pool.map(wrapper_process, list)

But now you have the problem that you need to hold the full csv data in memory which might be ok, but is usually not. 但是现在您遇到了一个问题,您需要将完整的csv数据保存在内存中,这也许可以,但是通常不行。 To come around this problem you could switch to using a queue: You would 要解决此问题,您可以切换到使用队列:

  • build up an empty queue 建立一个空队列
  • start the processes and tell them to get items from the queue (which is still empty at start) 启动流程并告诉他们从队列中获取项目(队列在开始时仍然为空)
  • feed the queue with your main process (and maybe check that the queue is not getting too long so memory consumption doesn't go into the roof) 向队列提供您的主进程(并可能检查队列的时间不要太长,以免导致内存消耗不增加)
  • put a STOP element to the queue so the processes quit themselves STOP元素放入队列,以便进程自行退出

There's a good example in the official doc (look at the last example on the page) which explains you would approach that. 官方文档中有一个很好的例子(请看页面上的最后一个例子) ,它说明了您可以采用的方法。

One last word: Are you sure your operation is CPU bound? 最后一句话:您确定您的操作受CPU限制吗? Do you do a lot of processing in wrapper_process (and possibly also transformer )? 您是否在wrapper_process进行了大量处理(可能还包括transformer )? Because if you just split the CSV in separate files without much processing your program is IO bound and not CPU bound and then the multiprocessing wouldn't make any sense. 因为如果仅将CSV拆分为单独的文件而不进行大量处理,则您的程序是IO绑定的,而不是CPU绑定的,那么多处理将毫无意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM