简体   繁体   English

将 kwargs 传递给带有用于 Pandas Dataframe 的 multiprocessing.pool.map 的函数

[英]Passing kwargs to a function with multiprocessing.pool.map for pandas Dataframe

I have written a function that takes in a dataframe with keywords arguments as shown below:我编写了一个函数,该函数接收带有关键字参数的数据框,如下所示:

df1 = df1.apply(add_data_ip2, axis=1, result_type="expand")

The process takes 20 minutes.该过程需要 20 分钟。 the function add_data_ip2 takes in a dataframe and reads the stock symbol under the "Ticker" column and does an api call to retrieve financial info and manipulates the data through math to calculate a score.函数 add_data_ip2 接收一个数据框并读取“Ticker”列下的股票代码,并执行 api 调用以检索财务信息并通过数学操作数据以计算分数。 The score is under saved in the "Score" column of the same df.分数低于保存在同一 df 的“分数”列中。 The function returns the same dataframe.该函数返回相同的数据帧。

The df contains approximately 1500 ticker symbols and I am trying to run the following parallel processing code to reduce the waiting time, but with no luck. df 包含大约 1500 个股票代码,我正在尝试运行以下并行处理代码以减少等待时间,但没有运气。 The function keeps running with no indication of any output.该功能在没有任何输出指示的情况下继续运行。 Can anyone advise what is the problem?任何人都可以建议是什么问题? Is there anything wrong with the way I am passing in the kwargs into the function.我将 kwargs 传入函数的方式有什么问题吗? Have tried searching stackoverflow for answers with no luck.尝试在 stackoverflow 中搜索答案但没有运气。 Appreciate the help.感谢帮助。

from functools import partial
mapfunc = partial(add_data_ip2, axis=1, result_type="expand")

p = mp.Pool(mp.cpu_count())
df1= p.map(mapfunc, df1)
df1

Another alternative block also doesn't give any output either.另一个替代块也没有给出任何输出。

from multiprocessing import  Pool
    
def parallelize_dataframe(df, func, n_cores=4):
        df_split = np.array_split(df, n_cores)
        pool = Pool(n_cores)
        df = pd.concat(pool.map(func, df_split))
        pool.close()
        pool.join()
        return df

train = parallelize_dataframe(df1, mapfunc)'

As for me your partial is wrong.至于我你的partial是错误的。

You can't send df to partial partial(add_data_ip2,...) and run as df.apply(add_data_ip2, ...) because partial will try to run it as add_data_ip2(..., df)您不能将df发送到 partial partial(add_data_ip2,...)并作为df.apply(add_data_ip2, ...)运行,因为partial会尝试将其作为add_data_ip2(..., df)

Other problem: axis=1, result_type="expand" are parameters for df.apply() but partial will run it as add_data_ip2(..., axis=1, result_type="expand")其他问题: axis=1, result_type="expand"df.apply()参数,但partial会将其作为add_data_ip2(..., axis=1, result_type="expand")

As for me you should define normal function至于我,你应该定义正常的功能

def mapfunc(dfx):
    return dfx.apply(add_data_ip2, axis=1, result_type="expand")

or lambdalambda

mapfunc = lambda dfx: dfx.apply(add_data_ip2, axis=1, result_type="expand")

But as I know Pool can't work with lambda because it has to save function and data in pickle and later process has to read it - but pickle can't save lambda但据我所知, Pool不能与lambda因为它必须将函数和数据保存在pickle并且以后的process必须读取它 - 但pickle不能保存lambda


Code which I used for tests我用于测试的代码

import pandas as pd
from functools import partial
from multiprocessing import Pool
import numpy as np

data = {
    'A': [1,2,3], 
    'B': [4,5,6], 
    'C': [7,8,9]
}

df = pd.DataFrame(data)
print('--- original ---')
print(df)

def add_data_ip2(row):
    row['A'] += 10
    row['B'] += 100
    row['C'] += 1000
    return row

# --- test 1 ---
#new_df = df.apply(add_data_ip2, axis=1, result_type="expand")
#print(new_df)  # OK

# --- test 2 ---
#mapfunc = partial(add_data_ip2, axis=1, result_type="expand")
#new_df = mapfunc(df)  # ERROR
#print(new_df)
    
# --- test 3 ---
#mapfunc = lambda df: df.apply(add_data_ip2, axis=1, result_type="expand")
#new_df = mapfunc(df)  # OK
#print(new_df)

# --- test 4 ---
def mapfunc(df):
    return df.apply(add_data_ip2, axis=1, result_type="expand")

new_df = mapfunc(df)  # OK
print('--- mapfunc ---')
print(new_df)

# --- test Pool ---

p = Pool()

parts = np.array_split(df, 4)

results = p.map(mapfunc, parts)

print('--- Poll results ---')
for item in results:
    print(item)
    print('---')
    
print('--- concat new df ---')    
new_df = pd.concat(results)
print(new_df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM