[英]Passing kwargs to a function with multiprocessing.pool.map for pandas Dataframe
I have written a function that takes in a dataframe with keywords arguments as shown below:我编写了一个函数,该函数接收带有关键字参数的数据框,如下所示:
df1 = df1.apply(add_data_ip2, axis=1, result_type="expand")
The process takes 20 minutes.该过程需要 20 分钟。 the function add_data_ip2 takes in a dataframe and reads the stock symbol under the "Ticker" column and does an api call to retrieve financial info and manipulates the data through math to calculate a score.函数 add_data_ip2 接收一个数据框并读取“Ticker”列下的股票代码,并执行 api 调用以检索财务信息并通过数学操作数据以计算分数。 The score is under saved in the "Score" column of the same df.分数低于保存在同一 df 的“分数”列中。 The function returns the same dataframe.该函数返回相同的数据帧。
The df contains approximately 1500 ticker symbols and I am trying to run the following parallel processing code to reduce the waiting time, but with no luck. df 包含大约 1500 个股票代码,我正在尝试运行以下并行处理代码以减少等待时间,但没有运气。 The function keeps running with no indication of any output.该功能在没有任何输出指示的情况下继续运行。 Can anyone advise what is the problem?任何人都可以建议是什么问题? Is there anything wrong with the way I am passing in the kwargs into the function.我将 kwargs 传入函数的方式有什么问题吗? Have tried searching stackoverflow for answers with no luck.尝试在 stackoverflow 中搜索答案但没有运气。 Appreciate the help.感谢帮助。
from functools import partial
mapfunc = partial(add_data_ip2, axis=1, result_type="expand")
p = mp.Pool(mp.cpu_count())
df1= p.map(mapfunc, df1)
df1
Another alternative block also doesn't give any output either.另一个替代块也没有给出任何输出。
from multiprocessing import Pool
def parallelize_dataframe(df, func, n_cores=4):
df_split = np.array_split(df, n_cores)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
train = parallelize_dataframe(df1, mapfunc)'
As for me your partial
is wrong.至于我你的partial
是错误的。
You can't send df
to partial partial(add_data_ip2,...)
and run as df.apply(add_data_ip2, ...)
because partial
will try to run it as add_data_ip2(..., df)
您不能将df
发送到 partial partial(add_data_ip2,...)
并作为df.apply(add_data_ip2, ...)
运行,因为partial
会尝试将其作为add_data_ip2(..., df)
Other problem: axis=1, result_type="expand"
are parameters for df.apply()
but partial
will run it as add_data_ip2(..., axis=1, result_type="expand")
其他问题: axis=1, result_type="expand"
是df.apply()
参数,但partial
会将其作为add_data_ip2(..., axis=1, result_type="expand")
As for me you should define normal function至于我,你应该定义正常的功能
def mapfunc(dfx):
return dfx.apply(add_data_ip2, axis=1, result_type="expand")
or lambda
或lambda
mapfunc = lambda dfx: dfx.apply(add_data_ip2, axis=1, result_type="expand")
But as I know Pool
can't work with lambda
because it has to save function and data in pickle
and later process
has to read it - but pickle
can't save lambda
但据我所知, Pool
不能与lambda
因为它必须将函数和数据保存在pickle
并且以后的process
必须读取它 - 但pickle
不能保存lambda
Code which I used for tests我用于测试的代码
import pandas as pd
from functools import partial
from multiprocessing import Pool
import numpy as np
data = {
'A': [1,2,3],
'B': [4,5,6],
'C': [7,8,9]
}
df = pd.DataFrame(data)
print('--- original ---')
print(df)
def add_data_ip2(row):
row['A'] += 10
row['B'] += 100
row['C'] += 1000
return row
# --- test 1 ---
#new_df = df.apply(add_data_ip2, axis=1, result_type="expand")
#print(new_df) # OK
# --- test 2 ---
#mapfunc = partial(add_data_ip2, axis=1, result_type="expand")
#new_df = mapfunc(df) # ERROR
#print(new_df)
# --- test 3 ---
#mapfunc = lambda df: df.apply(add_data_ip2, axis=1, result_type="expand")
#new_df = mapfunc(df) # OK
#print(new_df)
# --- test 4 ---
def mapfunc(df):
return df.apply(add_data_ip2, axis=1, result_type="expand")
new_df = mapfunc(df) # OK
print('--- mapfunc ---')
print(new_df)
# --- test Pool ---
p = Pool()
parts = np.array_split(df, 4)
results = p.map(mapfunc, parts)
print('--- Poll results ---')
for item in results:
print(item)
print('---')
print('--- concat new df ---')
new_df = pd.concat(results)
print(new_df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.