简体   繁体   English

使用apply方法提高熊猫的表现

[英]Improving pandas performance with apply method

I'm working on pandas for high performance calculations, the below function gives 1 loop, best of 5: 7.24 s per loop for 50,000 rows. 我正在研究用于高性能计算的pandas,下面的函数给出了1个循环,最佳的5:7.24 s每循环 50,000行。

I have to scale it to 1 million rows. 我必须将它扩展到100万行。

How to vectorise the function and apply to all rows. 如何向量化该函数并应用于所有行。 So that overall performance can be improved? 那么整体性能可以提高吗?

def weightedFlowAmt(startDate,endDate,tradeDate,tradeAmt):
  startInDays = datetime.strptime(startDate, "%Y-%m-%d")
  endInDays = datetime.strptime(endDate, "%Y-%m-%d")
  tradeInDays = datetime.strptime(tradeDate, "%Y-%m-%d")
  differenceTradeAndEnd=abs((endInDays - tradeInDays).days)
  differenceStartAndEnd=abs((endInDays - startInDays).days)
  weighted_FlowAmt = (tradeAmt * differenceTradeAndEnd)/differenceStartAndEnd

mutatedCashFlow['flow'] = mutatedCashFlow.apply(lambda row:
        weightedFlowAmt(row['startDate'], row['EndDate'], row['tradeDate'],
                        row['tradeAmount']),
    axis=1)

I think you can remove apply and use vectorized functions: 我认为你可以删除apply并使用矢量化函数:

mutatedCashFlow['startDate'] = pd.to_datetime(mutatedCashFlow['startDate'])
mutatedCashFlow['EndDate'] = pd.to_datetime(mutatedCashFlow['EndDate'])
mutatedCashFlow['tradeDate'] = pd.to_datetime(mutatedCashFlow['tradeDate'])

diffTradeAndEnd=((mutatedCashFlow['EndDate']-mutatedCashFlow['tradeDate']).dt.days).abs()
diffStartAndEnd=((mutatedCashFlow['EndDate']-mutatedCashFlow['startDate']).dt.days).abs()

mutatedCashFlow['flow'] = (mutatedCashFlow['tradeAmount']*diffTradeAndEnd)/diffStartAndEnd

Alternative: 替代方案:

mutatedCashFlow['startDate'] = pd.to_datetime(mutatedCashFlow['startDate'])
mutatedCashFlow['EndDate'] = pd.to_datetime(mutatedCashFlow['EndDate'])
mutatedCashFlow['tradeDate'] = pd.to_datetime(mutatedCashFlow['tradeDate'])

diffTradeAndEnd=mutatedCashFlow['EndDate'].sub(mutatedCashFlow['tradeDate']).dt.days.abs()
diffStartAndEnd=mutatedCashFlow['EndDate'].sub(mutatedCashFlow['startDate']).dt.days.abs()

mutatedCashFlow['flow'] = mutatedCashFlow['tradeAmount'].mul(diffTradeAndEnd)
                                                        .div(diffStartAndEnd)
print (mutatedCashFlow)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM