简体   繁体   English

在Pandas数据帧上并行化操作时速度较慢

[英]Slow speed while parallelizing operation on pandas dataframe

I have a dataframe which I perform some operation on and print out. 我有一个数据框,可以对其执行一些操作并打印出来。 To do this, I have to iterate through each row. 为此,我必须遍历每一行。

for count, row in final_df.iterrows():
    x = row['param_a']
    y = row['param_b']
    # Perform operation
    # Write to output file

I decided to parallelize this using the python multiprocessing module 我决定使用python多处理模块对此进行并行化

def write_site_files(row):
    x = row['param_a']
    y = row['param_b']
    # Perform operation
    # Write to output file

pkg_num = 0
total_runs = final_df.shape[0] # Total number of rows in final_df
threads = []

import multiprocessing

while pkg_num < total_runs or len(threads):
    if(len(threads) < num_proc and pkg_num < total_runs):
        print pkg_num, total_runs
        t = multiprocessing.Process(target=write_site_files,args=[final_df.iloc[pkg_num],pkg_num])
        pkg_num = pkg_num + 1
        t.start()
        threads.append(t)
    else:
        for thread in threads:
            if not thread.is_alive():
               threads.remove(thread)

However, the latter (parallelized) method is way slower than the simple iteration based approach. 但是,后一种(并行化)方法比基于简单迭代的方法要慢得多。 Is there anything I am missing? 我有什么想念的吗?

thanks! 谢谢!

This will be way less efficient that doing this in a single process unless the actual operation take a lot of time, like seconds per row . 除非实际操作花费大量时间(如每行秒),否则这将比在单个过程中执行此方法效率低。

Normally parallelization is the last tool in the box. 通常,并行化是包装盒中的最后一个工具。 After profiling, after local vectorization, after local optimization, then you parallelize. 分析后,局部矢量化后,局部优化后,然后进行并行化。

You are spending time just doing the slicing, then spinning up new processes (which is generally a constant overhead), then pickling a single row (not clear how big it is from your example). 您正在花费时间只是在进行切片,然后分解新的进程(通常是恒定的开销),然后对单个行进行腌制(不清楚示例中的行数)。

At the very least, you should chunk the rows, eg df.iloc[i:(i+1)*chunksize] . 至少,您应该对行进行分块,例如df.iloc[i:(i+1)*chunksize]

There hopefully will be some support for parallel apply in 0.14, see here: https://github.com/pydata/pandas/issues/5751 希望在0.14中将有一些对并行apply支持,请参见此处: https : //github.com/pydata/pandas/issues/5751

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM