简体   繁体   English

如何同时多次运行同一个函数?

[英]How to run the same function multiple times simultaneously?

I have a function that takes dataframe as an input and returns a dataframe.我有一个函数,它将数据帧作为输入并返回一个数据帧。 Like:喜欢:

def process(df):
    <all the code for processing>
    return df
# input df has 250K rows and 30 columns
# saving it in a variable
result = process(df)
# transform input df into 10,000K rows and over 50 columns

It does a lot of processing and thus takes a long time to return the output.它做了很多处理,因此需要很长时间才能返回输出。 I am using jupyter notebook.我正在使用 jupyter 笔记本。

I have come up with a new function that filters the original dataframe into 5 chunks not of equal size but between 30K to 100K, based on some category filter on a column on the origianl df and have it passed separately as process(df1), process(df2)...etc.我提出了一个新函数,它基于原始 df 列上的某个类别过滤器将原始数据帧过滤成 5 个大小不相等但在 30K 到 100K 之间的块,并将其作为 process(df1)、process (df2)...等。 and save it as result1, result 2, etc and then merge the results together as one single final dataframe.并将其保存为 result1、result 2 等,然后将结果合并为一个最终数据帧。

But I want them to run simultaneously and combine the results automatically.但我希望它们同时运行并自动组合结果。 Like a code to run the 5 process functions together and once all are completed then they can join as one to give me the same "result" as earlier but with a lot of run time saved.就像一起运行 5 个进程函数的代码,一旦全部完成,它们就可以合并为一个,给我与之前相同的“结果”,但节省了大量的运行时间。

Even better if I can split the original dataframe into equal parts and run simultaneously each part using the process(df) function, like it splits randomly those 250 k rows into 5 dataframes of size 50k each and send them as an input to the process(df) five times and runs them parallelly and give me the same final output I would be getting right now without any of this optimization.如果我可以将原始数据帧分成相等的部分并使用 process(df) 函数同时运行每个部分,那就更好了,就像它将这 250 k 行随机分成 5 个大小为 50k 的数据帧并将它们作为输入发送到进程( df) 五次并并行运行它们并给我相同的最终输出,我现在可以在没有任何优化的情况下获得相同的最终输出。

I was reading a lot about multi-threading and I found some useful answers on stack overflow but I wasn't able to really get it work.我阅读了很多关于多线程的书,并在堆栈溢出方面找到了一些有用的答案,但我无法真正让它发挥作用。 I am very new to this how concept of multi-threading.我对多线程的概念非常陌生。

You can use the multiprocessing library for this, which allows you to run a function on different cores of the CPU.您可以为此使用多处理库,它允许您在 CPU 的不同内核上运行一个函数。

The following is an example下面是一个例子

from multiprocessing import Pool

def f(df):
    # Process dataframe
    return df

if __name__ == '__main__':
    dataframes = [df1, df2, df3]

    with Pool(len(dataframes)) as p:
        proccessed_dfs = p.map(f, dataframes)
    
    print(processed_dfs)

    # You would join them here

You should check dask ( https://dask.org/ ) since it seems like you have mostly operations on dataframes.您应该检查 dask ( https://dask.org/ ),因为您似乎主要对数据帧进行操作。 A big advantage is that you won't have to worry about all the details of manually splitting your dataframe and all of that.一个很大的优势是您不必担心手动拆分数据帧的所有细节以及所有这些。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM