How to run the same function multiple times simultaneously?

Question

I have a function that takes dataframe as an input and returns a dataframe. Like:

def process(df):
    <all the code for processing>
    return df
# input df has 250K rows and 30 columns

# saving it in a variable
result = process(df)
# transform input df into 10,000K rows and over 50 columns

It does a lot of processing and thus takes a long time to return the output. I am using jupyter notebook.

I have come up with a new function that filters the original dataframe into 5 chunks not of equal size but between 30K to 100K, based on some category filter on a column on the origianl df and have it passed separately as process(df1), process(df2)...etc. and save it as result1, result 2, etc and then merge the results together as one single final dataframe.

But I want them to run simultaneously and combine the results automatically. Like a code to run the 5 process functions together and once all are completed then they can join as one to give me the same "result" as earlier but with a lot of run time saved.

Even better if I can split the original dataframe into equal parts and run simultaneously each part using the process(df) function, like it splits randomly those 250 k rows into 5 dataframes of size 50k each and send them as an input to the process(df) five times and runs them parallelly and give me the same final output I would be getting right now without any of this optimization.

I was reading a lot about multi-threading and I found some useful answers on stack overflow but I wasn't able to really get it work. I am very new to this how concept of multi-threading.

Answer 1

You can use the multiprocessing library for this, which allows you to run a function on different cores of the CPU.

The following is an example

from multiprocessing import Pool

def f(df):
    # Process dataframe
    return df

if __name__ == '__main__':
    dataframes = [df1, df2, df3]

    with Pool(len(dataframes)) as p:
        proccessed_dfs = p.map(f, dataframes)
    
    print(processed_dfs)

    # You would join them here

Answer 2

You should check dask ( https://dask.org/ ) since it seems like you have mostly operations on dataframes. A big advantage is that you won't have to worry about all the details of manually splitting your dataframe and all of that.

How to run the same function multiple times simultaneously?

Question

2 answers

solution1
0 2021-07-15 17:27:35

solution2
0 2021-07-15 17:33:37

How to run the same function multiple times simultaneously?

Question

2 answers

solution1 0 2021-07-15 17:27:35

solution2 0 2021-07-15 17:33:37

solution1
0 2021-07-15 17:27:35

solution2
0 2021-07-15 17:33:37