How to map a function to an array of dataframes in parallel in python?

Question

My code has an array of dataframes, each of which I want to apply a function to. The dataframes are all in the same format, here's an example of one:

and the code I have for regular mapping is this:

def return_stat(df):
    return np.random.choice(df.iloc[:,1],p=df.iloc[:,0])


weather_df_list = [weather_df1,weather_df2,weather_df3,weather_df4]

expected_values = list(map(lambda i:return_stat(i), weather_df_list))

but I have 16 cores on my computer and want to make use of it to make this code super fast.

How would I implement this same code using parallel computing in Python?

Thanks!

Answer 1

Using multiprocessing.Pool can help to occupy all your cores.

import pandas as pd, numpy as np, multiprocessing

def return_stat(df):
    return np.random.choice(df.iloc[:, 1], p = df.iloc[:, 0])

if __name__ == '__main__':
    weather_df = pd.DataFrame({'rain_probability': [0.1,0.2,0.7], 'rain_inches': [1,2,3]})
    weather_df_list = [weather_df, weather_df, weather_df, weather_df]
    with multiprocessing.Pool() as pool:
        expected_values = pool.map(return_stat, weather_df_list)
    print(expected_values)

Another fancy and also efficient way to solve the problem is using Numba . It transcodes Python into efficient machine code and also has parallelization feature. Although it had no choice() variant supporting probabilities array, hence I had to implement choice() myself. You need to install numba once through python -m pip install numba .

import pandas as pd, numpy as np
from numba import njit

@njit(parallel = True, fastmath = True)
def choices(l):
    rnds = np.random.random((len(l),))
    def choice(i, a, p):
        assert p.shape == a.shape
        p = p.cumsum()
        p = p / p[-1]
        r = rnds[i]
        i = np.sum((p <= r).astype(np.int64))
        return a[i]
    res = np.empty((len(l),), dtype = np.float64)
    for i in range(len(l)):
        res[i] = choice(i, l[i][:, 1], l[i][:, 0])
    return res

weather_df = pd.DataFrame({'rain_probability': [0.1, 0.2, 0.3, 0.4], 'rain_inches': [0, 1, 2, 3]})
weather_df_list = [weather_df, weather_df, weather_df, weather_df, weather_df, weather_df, weather_df, weather_df]
weather_df_arrays = [e.values[:, :2] for e in weather_df_list]
print(choices(weather_df_arrays))

You may try numba variant on your side and tell me how fast it is, if it is not faster than multiprocessing variant then I have some extra ideas how to improve its speed.

How to map a function to an array of dataframes in parallel in python?

Question

1 answers

solution1
1 2020-10-06 13:14:19

How to map a function to an array of dataframes in parallel in python?

Question

1 answers

solution1 1 2020-10-06 13:14:19

solution1
1 2020-10-06 13:14:19