[英]Pandas DataFrame - Add Column Containing Conditional Sum of “previous” Rows
I have a dataset of tennis match results as follows: 我有一个网球比赛结果的数据集,如下所示:
tennis_cols = ['Year','TourNo','MatchNo','Round','Winner','Loser']
tennis_rslts = [ [2018, 1, 1, 'QF', 'PlayerA', 'PlayerB']
,[2018, 1, 2, 'QF', 'PlayerC', 'PlayerD']
,[2018, 1, 3, 'QF', 'PlayerE', 'PlayerF']
,[2018, 1, 4, 'QF', 'PlayerG', 'PlayerH']
,[2018, 1, 5, 'SF', 'PlayerA', 'PlayerC']
,[2018, 1, 6, 'SF', 'PlayerE', 'PlayerG']
,[2018, 1, 7, 'F', 'PlayerA', 'PlayerE'] ]
dfTennis=pd.DataFrame(tennis_rslts,columns=tennis_cols)
dfTennis
Year TourNo MatchNo Round Winner Loser
0 2018 1 1 QF PlayerA PlayerB
1 2018 1 2 QF PlayerC PlayerD
2 2018 1 3 QF PlayerE PlayerF
3 2018 1 4 QF PlayerG PlayerH
4 2018 1 5 SF PlayerA PlayerC
5 2018 1 6 SF PlayerE PlayerG
6 2018 1 7 F PlayerA PlayerE
I want to add a column, WinsToDate, which contains the number of wins the winner of this match had before the current match, ie: 我想添加一列WinsToDate,其中包含此比赛的获胜者在当前比赛之前所获得的胜利数,即:
Year TourNo MatchNo Round Winner Loser WinsToDate
0 2018 1 1 QF PlayerA PlayerB 0
1 2018 1 2 QF PlayerC PlayerD 0
2 2018 1 3 QF PlayerE PlayerF 0
3 2018 1 4 QF PlayerG PlayerH 0
4 2018 1 5 SF PlayerA PlayerC 1 <-- PlayerA won MatchNo 1
5 2018 1 6 SF PlayerE PlayerG 1 <-- PlayerE won MatchNo 3
6 2018 1 7 F PlayerA PlayerE 2 <-- PlayerA won MatchNo 1 and 5
My real-world dataset is large enough that iterating through the dataset is too slow. 我的现实世界数据集足够大,以至于遍历数据集的速度太慢。 Any ideas how I achieve the result in an efficient manner? 有什么想法可以有效地实现结果吗?
Essentially I want to count the number of rows where the Winner matches the row being processed and the MatchNo is less than the current row being processed. 本质上,我想计算Winner与正在处理的行匹配并且MatchNo小于正在处理的当前行的行数。
** UPDATE ** I can get a count of the number of times the winner occurs in the Dataframe using: **更新**我可以使用以下方法来计算获胜者在数据框中出现的次数:
dfTennis['Count'] = list(map(lambda x : len(dfTennis[(dfTennis['Winner'] == x)]), dfTennis['Winner']))
But this counts all occurrences rather than all occurrences before the current row. 但这将计算所有事件,而不是当前行之前的所有事件。
Strangely I am going to answer my own question. 奇怪的是,我要回答自己的问题。
The code needed to compute the WinsToDate column is: 计算WinsToDate列所需的代码是:
dfTennis['WinsToDate'] = list(map(lambda x : len(dfTennis[(dfTennis['Winner'] == dfTennis.iloc[x]['Winner']) &
(dfTennis['MatchNo'] < dfTennis.iloc[x]['MatchNo'])]), dfTennis.index.values))
By passing in the index value to the lambda function it meant that I could access data in both the Winner and MatchNo fields to apply the logic I required. 通过将索引值传递给lambda函数,这意味着我可以访问Winner和MatchNo字段中的数据以应用所需的逻辑。
Am welcome to hear any better solutions but this appears to work for my need. 欢迎听到任何更好的解决方案,但这似乎可以满足我的需要。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.