[英]Pandas, quicker way to summarize sub-grouped data totals into merged df as separate columns
[英]calculating running totals in separate columns in Pandas DF
我有一个像这样的熊猫数据框。
frame = pd.DataFrame({'home' : ['CHI', 'ATL', 'SEA', 'DET', 'STL','HOU' ,'CHI','CHI'],
'away' : ['DET', 'CHI', 'HOU', 'TOR', 'DAL', 'STL', 'MIA', 'SEA']})
多亏了unutbu,我可以像这样保持每支球队的总比赛量。
awayGP = collections.Counter()
homeGP = collections.Counter()
def count_games():
for idx, row in frame.iterrows():
homeGP[row['home']] +=1
awayGP[row['away']] +=1
test = homeGP + awayGP
yield awayGP[row['away']], awayGP[row['home']], , homeGP[row['away']], homeGP[row['home']]
frame['awayteamAwayGP'] , frame['hometeamAwayGP'], frame['awayteamHomeGP'], frame['hometeamHomeGP'] = zip(*list(count_games()))
frame['awayteamGames'] = frame['awayteamAwayGP'] + frame['awayteamHomeGP']
frame['hometeamGames'] = frame['hometeamAwayGP'] + frame['hometeamHomeGP']
del frame['awayteamAwayGP'] , frame['hometeamAwayGP'], frame['awayteamHomeGP'], frame['hometeamHomeGP']
我希望能够保持每支球队的总积分。
frame['awayPTS'] = [88, 75, 105, 99, 110, 85, 95, 100]
frame['homePTS'] = [92, 88, 95, 97, 100, 74, 98, 110]
这是所需的输出。
away home awayteamGP hometeamGP awayPTS homePTS awayteam_totalPTS hometeam_totalPTS
DET CHI 1 1 88 92 88 92
CHI ATL 2 1 75 88 180 88
HOU SEA 1 1 105 95 105 95
TOR DET 1 2 99 97 99 187
DAL STL 1 1 110 100 110 100
STL HOU 2 2 85 74 185 179
MIA CHI 1 3 95 98 95 265
SEA CHI 2 4 100 110 195 375
创建defualtdict
(默认值为0),在那里你会保持球队的目前的水平,并沿应用axis=1
,更新本字典,并返回结果的元组的功能。 然后只需将您的DataFrame和来自apply
函数的结果DataFrame沿axis=1
。
frame = pd.DataFrame({
'home' : ['CHI', 'ATL', 'SEA', 'DET', 'STL','HOU' ,'CHI','CHI'],
'away' : ['DET', 'CHI', 'HOU', 'TOR', 'DAL', 'STL', 'MIA', 'SEA'],
'awayPTS' : [88, 75, 105, 99, 110, 85, 95, 100],
'homePTS' : [92, 88, 95, 97, 100, 74, 98, 110],
})
score = collections.defaultdict(int)
def calculate(row):
away = row['away']
home = row['home']
score[away] += row['awayPTS']
score[home] += row['homePTS']
return pd.Series([score[away], score[home]],
index=['awayteam_totalPTS', 'hometeam_totalPTS'])
frame = pd.concat([frame, frame.apply(calculate, axis=1)], axis=1)
得到:
away home awayPTS homePTS awayteam_totalPTS hometeam_totalPTS
0 DET CHI 88 92 88 92
1 CHI ATL 75 88 167 88
2 HOU SEA 105 95 105 95
3 TOR DET 99 97 99 185
4 DAL STL 110 100 110 100
5 STL HOU 85 74 185 179
6 MIA CHI 95 98 95 265
7 SEA CHI 100 110 195 375
我觉得很有道理做groupby
然后cumsum
每个组。 值得一提的是这个方法会显著快,比专柜/ defaultdict解决方案,当你在你的表(我把它两倍的速度由100行和第五十次,10000行更快)的多个项目。
首先,我们必须以一种可以独立(离开/在家)进行此操作的方式进行stack
:
In [10]: frame.columns = [['away', 'away', 'home', 'home'],
['team', 'PTS', 'team', 'PTS']]
In [11]: frame # with nice descriptive column labels
Out[11]:
away away home home
team PTS team PTS
0 DET 88 CHI 92
1 CHI 75 ATL 88
2 HOU 105 SEA 95
3 TOR 99 DET 97
4 DAL 110 STL 100
5 STL 85 HOU 74
6 MIA 95 CHI 98
7 SEA 100 CHI 110
In [12]: frame_stacked = frame.stack(0)
In [13]: frame_stacked
Out[13]:
PTS team
0 away 88 DET
home 92 CHI
1 away 75 CHI
home 88 ATL
2 away 105 HOU
home 95 SEA
3 away 99 TOR
home 97 DET
4 away 110 DAL
home 100 STL
5 away 85 STL
home 74 HOU
6 away 95 MIA
home 98 CHI
7 away 100 SEA
home 110 CHI
现在我们可以在这里对球队进行分组(并且总和将包括他们的客场比赛和主场比赛):
In [14]: total_pts = frame_stacked.groupby('team')['PTS'].cumsum()
In [15]: total_pts
Out[15]:
0 away 88
home 92
1 away 167
home 88
2 away 105
home 95
3 away 99
home 185
4 away 110
home 100
5 away 185
home 179
6 away 95
home 265
7 away 195
home 375
dtype: int64
最后,我们只需要使用正确命名的列将它们插入框架即可:
In [16]: frame[('home', 'totalPTS')] = total_pts[:, 'home']
In [17]: frame[('away', 'totalPTS')] = total_pts[:, 'away']
In [18]: frame
Out[18]:
away away home home away home
team PTS team PTS totalPTS totalPTS
0 DET 88 CHI 92 88 92
1 CHI 75 ATL 88 167 88
2 HOU 105 SEA 95 105 95
3 TOR 99 DET 97 99 185
4 DAL 110 STL 100 110 100
5 STL 85 HOU 74 185 179
6 MIA 95 CHI 98 95 265
7 SEA 100 CHI 110 195 375
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.