简体   繁体   English

用熊猫进行多处理读取csv和线程池执行器

[英]Multiprocessing with pandas read csv and threadpool executor

I have a huge csv to parse by chyunk and write to multiple files 我有一个巨大的csv要由chyunk解析并写入多个文件

I am using pandas read_csv function to get chunks by chunks. 我正在使用pandas read_csv函数逐块获取块。 It was working fine but slower than the performance we need. 它工作正常,但比我们所需的性能慢。 So i decided to do this parsing in threads 所以我决定在线程中进行此解析

pool = ThreadPoolExecutor(2)
            with ThreadPoolExecutor(max_workers=2) as executor:
                futures = executor.map(process, [df for df in pd.read_csv(
                    downloaded_file, chunksize=chunksize, compression='gzip', low_memory=False, skipinitialspace=True, encoding='utf-8')], file_index)
                for future in concurrent.futures.as_costmpleted(futures):
                    pass

Here is my function that has a responsibility to parse and write to csv 这是我的功能,负责解析并写入csv

def process(df, file_index):
    """
    Process the csv chunk in a separate thread
        :param df:
        :param file_index:
        :param chunk_index:
    """
    chunk_index = random.randint(1, 200)
    print "start processing chunk"
    # some heaving processing...
    handle = open(outfile_name)
    df.to_csv(outfile_name, index=False,
                          compression='gzip', sep='\t', quoting=1, encoding='utf-8')
    handle.close()
    del df
    print "end processing chunk"
    return True

I never see my print debug lines and the cpu and memory reach to 100% and my script get killed. 我从没见过我的打印调试行,并且cpu和内存达到100%并且我的脚本被杀死了。

It looks like the read_csv it self is always yielding and the executor.map is still waiting for the first argument. 看起来好像read_csv本身总是在屈服,而executor.map仍在等待第一个参数。

Thanks 谢谢

Have you considered keeping the second argument to the executor.map function lazy (generator)? 您是否考虑过将executor.map函数的第二个参数保持为惰性(生成器)?

pool = ThreadPoolExecutor(2)
df_generator = pd.read_csv(downloaded_file, 
                           chunksize=chunksize,
                           compression='gzip', 
                           low_memory=False, 
                           skipinitialspace=True, 
                           encoding='utf-8')

with ThreadPoolExecutor(max_workers=2) as executor:
    futures = executor.map(process, df_generator, file_index)
    for future in concurrent.futures.as_completed(futures):
        pass

df.read_csv with a given chunksize will return a generator object and ensure iteration is lazy. 具有给定df.read_csv将返回一个生成器对象,并确保迭代是惰性的。 This should ideally not cause memory overflow, if your chunksize is chosen well. 如果选择适当的块大小,则理想情况下这应该不会导致内存溢出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM