My code has an array of dataframes, each of which I want to apply a function to. The dataframes are all in the same format, here's an example of one:
and the code I have for regular mapping is this:
def return_stat(df):
return np.random.choice(df.iloc[:,1],p=df.iloc[:,0])
weather_df_list = [weather_df1,weather_df2,weather_df3,weather_df4]
expected_values = list(map(lambda i:return_stat(i), weather_df_list))
but I have 16 cores on my computer and want to make use of it to make this code super fast.
How would I implement this same code using parallel computing in Python?
Thanks!
Using multiprocessing.Pool can help to occupy all your cores.
import pandas as pd, numpy as np, multiprocessing
def return_stat(df):
return np.random.choice(df.iloc[:, 1], p = df.iloc[:, 0])
if __name__ == '__main__':
weather_df = pd.DataFrame({'rain_probability': [0.1,0.2,0.7], 'rain_inches': [1,2,3]})
weather_df_list = [weather_df, weather_df, weather_df, weather_df]
with multiprocessing.Pool() as pool:
expected_values = pool.map(return_stat, weather_df_list)
print(expected_values)
Another fancy and also efficient way to solve the problem is using Numba . It transcodes Python into efficient machine code and also has parallelization feature. Although it had no choice()
variant supporting probabilities array, hence I had to implement choice() myself. You need to install numba once through python -m pip install numba
.
import pandas as pd, numpy as np
from numba import njit
@njit(parallel = True, fastmath = True)
def choices(l):
rnds = np.random.random((len(l),))
def choice(i, a, p):
assert p.shape == a.shape
p = p.cumsum()
p = p / p[-1]
r = rnds[i]
i = np.sum((p <= r).astype(np.int64))
return a[i]
res = np.empty((len(l),), dtype = np.float64)
for i in range(len(l)):
res[i] = choice(i, l[i][:, 1], l[i][:, 0])
return res
weather_df = pd.DataFrame({'rain_probability': [0.1, 0.2, 0.3, 0.4], 'rain_inches': [0, 1, 2, 3]})
weather_df_list = [weather_df, weather_df, weather_df, weather_df, weather_df, weather_df, weather_df, weather_df]
weather_df_arrays = [e.values[:, :2] for e in weather_df_list]
print(choices(weather_df_arrays))
You may try numba variant on your side and tell me how fast it is, if it is not faster than multiprocessing variant then I have some extra ideas how to improve its speed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.