简体   繁体   English

如何将结果从Multiprocessing.Pool流到csv?

[英]How to stream results from Multiprocessing.Pool to csv?

I have a python process (2.7) that takes a key, does a bunch of calculations and returns a list of results. 我有一个带键的python进程(2.7),进行一堆计算并返回结果列表。 Here is a very simplified version. 这是一个非常简化的版本。

I am using multiprocessing to create threads so this can be processed faster. 我正在使用多处理来创建线程,以便可以更快地处理它。 However, my production data has several million rows and each loop takes progressively longer to complete. 但是,我的生产数据有几百万行,每个循环需要花费更长的时间才能完成。 The last time I ran this each loop took over 6 minutes to complete while at the start it takes a second or less. 我上次运行该循环的时间超过了6分钟,而在开始时则花费了一秒钟或更短的时间。 I think this is because all the threads are adding results into resultset and that continues to grow until it contains all the records. 我认为这是因为所有线程都将结果添加到结果集中,并且一直持续增长,直到包含所有记录为止。

Is it possible to use multiprocessing to stream the results of each thread (a list) into a csv or batch resultset so it writes to the csv after a set number of rows? 是否可以使用多重处理将每个线程(列表)的结果流式传输到csv或批处理结果集中,以便在设置一定数量的行之后将其写入csv?

Any other suggestions for speeding up or optimizing the approach would be appreciated. 对于加快或优化该方法的任何其他建议将不胜感激。

import numpy as np
import pandas as pd
import csv
import os
import multiprocessing
from multiprocessing import Pool

global keys
keys = [1,2,3,4,5,6,7,8,9,10,11,12]

def key_loop(key):
    test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
    test_list = test_df.ix[0].tolist()
    return test_list

if __name__ == "__main__":
    try:
        pool = Pool(processes=8)      
        resultset = pool.imap(key_loop,(key for key in keys) )

        loaddata = []
        for sublist in resultset:
            loaddata.append(sublist)

        with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
            writer = csv.writer(file)
            for listitem in loaddata:
                writer.writerow(listitem)
        file.close

        print "finished load"
    except:
        print 'There was a problem multithreading the key Pool'
        raise

Here is an answer consolidating the suggestions Eevee and I made 这是巩固我和伊芙提出的建议的答案

import numpy as np
import pandas as pd
import csv
from multiprocessing import Pool

keys = [1,2,3,4,5,6,7,8,9,10,11,12]

def key_loop(key):
    test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
    test_list = test_df.ix[0].tolist()
    return test_list

if __name__ == "__main__":
    try:
        pool = Pool(processes=8)      
        resultset = pool.imap(key_loop, keys, chunksize=200)

        with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
            writer = csv.writer(file)
            for listitem in resultset:
                writer.writerow(listitem)

        print "finished load"
    except:
        print 'There was a problem multithreading the key Pool'
        raise

Again, the changes here are 同样,这里的变化是

  1. Iterate over resultset directly, rather than needlessly copying it to a list first. 直接遍历resultset ,而不是不必要地首先将其复制到列表中。
  2. Feed the keys list directly to pool.imap instead of creating a generator comprehension out of it. keys列表直接输入pool.imap而不是根据它创建生成器理解。
  3. Providing a larger chunksize to imap than the default of 1. The larger chunksize reduces the cost of the inter-process communication required to pass the values inside keys to the sub-processes in your pool, which can give big performance boosts when keys is very large (as it is in your case). 提供了一个更大的chunksizeimap比1.较大的默认chunksize减少了内部传递值所需要的进程间通信的成本keys在你池中的子过程,它可以给大的性能提升时, keys非常大(视您的情况而定)。 You should experiment with different values for chunksize (try something considerably larger than 200, like 5000, etc.) and see how it affects performance. 你应该有不同的价值观实验chunksize (尝试比200大很多的东西,像5000等等),看看它是如何影响性能。 I'm making a wild guess with 200, though it should definitely do better than 1. 我对200进行了一个疯狂的猜测,尽管它肯定比1更好。

The following very simple code collects many worker's data into a single CSV file. 以下非常简单的代码将许多工作人员的数据收集到一个CSV文件中。 A worker takes a key and returns a list of rows. 工作人员获取密钥并返回行列表。 The parent processes several keys at a time, using several workers. 父级使用多个工作程序一次处理多个密钥。 When each key is done, the parent writes output rows, in order, to a CSV file. 完成每个键后,父级将输出行按顺序写入CSV文件。

Be careful about order. 注意订购。 If each worker writes to the CSV file directly, they'll be out of order or will stomp on each others. 如果每个工作人员都直接写入CSV文件,那么他们将陷入混乱或彼此脚。 Having each worker write to its own CSV file will be fast, but will require merging all the data files together afterward. 让每个工作人员写入自己的CSV文件很快,但是随后需要将所有数据文件合并在一起。

source 资源

import csv, multiprocessing, sys

def worker(key):
    return [ [key, 0], [key+1, 1] ]


pool = multiprocessing.Pool()   # default 1 proc per CPU
writer = csv.writer(sys.stdout)

for resultset in pool.imap(worker, [1,2,3,4]):
    for row in resultset:
        writer.writerow(row)

output 产量

1,0
2,1
2,0
3,1
3,0
4,1
4,0
5,1

My bet would be that dealing with the large structure at once using appending is what makes it slow. 我敢打赌,使用追加立即处理大型结构的原因是它使速度变慢。 What I usually do is that I open up as many files as cores and use modulo to write to each file immediately such that the streams don't cause trouble compared to if you'd direct them all into the same file (write errors), and also not trying to store huge data. 我通常要做的是打开与核心一样多的文件,并使用modulo立即写入每个文件,这样,与将它们全部定向到同一文件(写入错误)相比,流不会造成麻烦,而且也不要试图存储海量数据 Probably not the best solution, but really quite easy. 可能不是最好的解决方案,但确实很容易。 In the end you just merge back the results. 最后,您只需要合并结果即可。

Define at start of the run: 在运行开始时定义:

num_cores = 8
file_sep = ","
outFiles = [open('out' + str(x) + ".csv", "a") for x in range(num_cores)]

Then in the key_loop function: 然后在key_loop函数中:

def key_loop(key):
    test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
    test_list = test_df.ix[0].tolist()
    outFiles[key % num_cores].write(file_sep.join([str(x) for x in test_list]) 
                                    + "\n")

Afterwards, don't forget to close: [x.close() for x in outFiles] 之后,别忘了关闭: [x.close() for x in outFiles]

Improvements: 改进:

  • Iterate over blocks like mentioned in the comments. 遍历注释中提到的块。 Writing/processing 1 line at a time is going to be much slower than writing blocks. 一次写入/处理1行将比写入块慢得多。

  • Handling errors (closing of files) 处理错误(关闭文件)

  • IMPORTANT: I'm not sure of the meaning of the "keys" variable, but the numbers there will not allow modulo to ensure you have each process write to each individual stream (12 keys, modulo 8 will make 2 processes write to the same file) 重要说明:我不确定“ keys”变量的含义,但是那里的数字将不允许模以确保您将每个进程写入每个单独的流(12个键,模8将使2个进程写入相同的流文件)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM