[英]How to merge rows (with strings) based on column value (int) in Pandas dataframe?
[英]How to merge rows in a dataframe based on a column value?
我有一个这样的数据集,其中每一行代表一个由gameID
指定的特定匹配gameID
。
gameID Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
2017020001 1 1 0 10 10 10 1 0 0 0
2017020001 0 0 1 10 10 10 0 1 0 0
我要做的是创建一个函数,该函数将具有相同gameID
的行合并在一起。 正如您在下面的数据示例中看到的那样,这两行代表一个比赛,该比赛分为一个主队(row_1)和一个客队(row_2)。 我希望这两行仅坐在一行上。
Won/Lost h_metric2 h_metric3 h_metric4 a_metric2 a_metric3 a_metric4 h_team1 h_team2 h_team3 h_team4 a_team1 a_team2 a_team3 a_team4
1 10 10 10 10 10 10 1 0 0 0 0 1 0 0
我如何得到这个结果?
编辑:我造成了太多的混乱,发布我的代码,以便您可以更好地了解我要解决的问题。
import numpy as np
import pandas as pd
import requests
import json
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
results = []
for game_id in range(2017020001, 2017020010, 1):
url = 'https://statsapi.web.nhl.com/api/v1/game/{}/boxscore'.format(game_id)
r = requests.get(url)
game_data = r.json()
for homeaway in ['home','away']:
game_dict = game_data.get('teams').get(homeaway).get('teamStats').get('teamSkaterStats')
game_dict['team'] = game_data.get('teams').get(homeaway).get('team').get('name')
game_dict['homeaway'] = homeaway
game_dict['game_id'] = game_id
results.append(game_dict)
df = pd.DataFrame(results)
df['Won/Lost'] = df.groupby('game_id')['goals'].apply(lambda g: (g == g.max()).map({True: 1, False: 0}))
df["faceOffWinPercentage"] = df["faceOffWinPercentage"].astype('float')
df["powerPlayPercentage"] = df["powerPlayPercentage"].astype('float')
df["team"] = df["team"].astype('category')
df = pd.get_dummies(df, columns=['homeaway'])
df = pd.get_dummies(df, columns=['team'])
我只是想,您正在使用面包和黄油:numpy,pandas&co?
如果是这样,我还假设您的表当前存储在名为'df'的pandas.DataFrame-instance中:
将您的df分为两个df,然后将它们加入:
df_team1 = df[df['Won/Lost']==1]
df_team2 = df[df['Won/Lost']==0]
final_df = df_team1.join(df_team2, lsuffix='_team1', rsuffix='_team2', on='gameID')
当然,您可以对其进行编辑以更好地满足您的目的。 例如,根据“居家/离开”列创建df,等等。
BR Ben:]
这是基于以下假设:每个gameID
恰好有两行,并且您gameID
该ID分组。 (它也假设我理解这个问题。)
改进的解决方案
给定一个数据帧df
如
gameID Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
0 2017020001 1 1 0 10 10 10 1 0 0 0
1 2017020001 0 0 1 10 10 10 0 1 0 0
2 2017020002 1 1 0 10 10 10 1 0 0 0
3 2017020002 0 0 1 10 10 10 0 1 0 0
您可以像这样使用pd.merge
(和一些数据处理):
>>> is_home = df['Home'] == 1
>>> home = df[is_home].drop(['Home', 'Away'], axis=1).add_prefix('h_').rename(columns={'h_gameID':'gameID'})
>>> away = df[~is_home].drop(['Won/Lost', 'Home', 'Away'], axis=1).add_prefix('a_').rename(columns={'a_gameID':'gameID'})
>>> pd.merge(home, away, on='gameID')
gameID h_Won/Lost h_metric2 h_metric3 h_metric4 h_team1 h_team2 h_team3 h_team4 a_metric2 a_metric3 a_metric4 a_team1 a_team2 a_team3 a_team4
0 2017020001 1 10 10 10 1 0 0 0 10 10 10 0 1 0 0
1 2017020002 1 10 10 10 1 0 0 0 10 10 10 0 1 0 0
(我保留了“ Won/Lost
的前缀,因为它表示这是主队的统计信息。此外,如果有人知道如何更优雅地添加前缀而不必重命名gameID
请发表评论。)
原始尝试
分组后可以应用以下功能
def munge(group):
is_home = group.Home == 1
wonlost = group.loc[is_home, 'Won/Lost'].reset_index(drop=True)
group = group.loc[:, 'metric2':]
home = group[is_home].add_prefix('h_').reset_index(drop=True)
away = group[~is_home].add_prefix('a_').reset_index(drop=True)
return pd.concat([wonlost, home, away], axis=1)
... 像这样:
>>> df.groupby('gameID').apply(munge).reset_index(level=1, drop=True)
Won/Lost h_metric2 h_metric3 h_metric4 h_team1 h_team2 h_team3 h_team4 a_metric2 a_metric3 a_metric4 a_team1 a_team2 a_team3 a_team4
gameID
2017020001 1 10 10 10 1 0 0 0 10 10 10 0 1 0 0
2017020002 1 10 10 10 1 0 0 0 10 10 10 0 1 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.