[英]How to merge rows in a dataframe based on a column value?
I have a data-set that is in the shape of this, where each row represents a in a specific match that is specified by the gameID
. 我有一个这样的数据集,其中每一行代表一个由
gameID
指定的特定匹配gameID
。
gameID Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
2017020001 1 1 0 10 10 10 1 0 0 0
2017020001 0 0 1 10 10 10 0 1 0 0
The thing I want to do is create a function that takes the rows with the same gameID
and joins them. 我要做的是创建一个函数,该函数将具有相同
gameID
的行合并在一起。 As you can see in data example below, the two rows represents one game that is split up into a home team (row_1 ) and an away team (row_2). 正如您在下面的数据示例中看到的那样,这两行代表一个比赛,该比赛分为一个主队(row_1)和一个客队(row_2)。 I want these two rows to sit on one row only.
我希望这两行仅坐在一行上。
Won/Lost h_metric2 h_metric3 h_metric4 a_metric2 a_metric3 a_metric4 h_team1 h_team2 h_team3 h_team4 a_team1 a_team2 a_team3 a_team4
1 10 10 10 10 10 10 1 0 0 0 0 1 0 0
How do I get this result? 我如何得到这个结果?
EDIT: I created too much confusion, posting my code so you can get a better grasp of the problem I want to solve. 编辑:我造成了太多的混乱,发布我的代码,以便您可以更好地了解我要解决的问题。
import numpy as np
import pandas as pd
import requests
import json
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
results = []
for game_id in range(2017020001, 2017020010, 1):
url = 'https://statsapi.web.nhl.com/api/v1/game/{}/boxscore'.format(game_id)
r = requests.get(url)
game_data = r.json()
for homeaway in ['home','away']:
game_dict = game_data.get('teams').get(homeaway).get('teamStats').get('teamSkaterStats')
game_dict['team'] = game_data.get('teams').get(homeaway).get('team').get('name')
game_dict['homeaway'] = homeaway
game_dict['game_id'] = game_id
results.append(game_dict)
df = pd.DataFrame(results)
df['Won/Lost'] = df.groupby('game_id')['goals'].apply(lambda g: (g == g.max()).map({True: 1, False: 0}))
df["faceOffWinPercentage"] = df["faceOffWinPercentage"].astype('float')
df["powerPlayPercentage"] = df["powerPlayPercentage"].astype('float')
df["team"] = df["team"].astype('category')
df = pd.get_dummies(df, columns=['homeaway'])
df = pd.get_dummies(df, columns=['team'])
i just suppose, you are working with bread and butter: numpy, pandas & co? 我只是想,您正在使用面包和黄油:numpy,pandas&co?
if so, i furthermore assume, that your table currently is being stored in a pandas.DataFrame-instance called 'df': 如果是这样,我还假设您的表当前存储在名为'df'的pandas.DataFrame-instance中:
Divide your df into two df's and then join them: 将您的df分为两个df,然后将它们加入:
df_team1 = df[df['Won/Lost']==1]
df_team2 = df[df['Won/Lost']==0]
final_df = df_team1.join(df_team2, lsuffix='_team1', rsuffix='_team2', on='gameID')
You can, of course, edit it to better match your purposes. 当然,您可以对其进行编辑以更好地满足您的目的。 For instance create the df's based on Home/Away columns, etc.
例如,根据“居家/离开”列创建df,等等。
BR Ben :] BR Ben:]
This is under the assumption that you have exactly two rows per gameID
and that you want to group by that ID. 这是基于以下假设:每个
gameID
恰好有两行,并且您gameID
该ID分组。 (It also assumes that I understand the question.) (它也假设我理解这个问题。)
Improved solution 改进的解决方案
Given a dataframe df
such as 给定一个数据帧
df
如
gameID Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
0 2017020001 1 1 0 10 10 10 1 0 0 0
1 2017020001 0 0 1 10 10 10 0 1 0 0
2 2017020002 1 1 0 10 10 10 1 0 0 0
3 2017020002 0 0 1 10 10 10 0 1 0 0
you can use pd.merge
(and some data munging) like this: 您可以像这样使用
pd.merge
(和一些数据处理):
>>> is_home = df['Home'] == 1
>>> home = df[is_home].drop(['Home', 'Away'], axis=1).add_prefix('h_').rename(columns={'h_gameID':'gameID'})
>>> away = df[~is_home].drop(['Won/Lost', 'Home', 'Away'], axis=1).add_prefix('a_').rename(columns={'a_gameID':'gameID'})
>>> pd.merge(home, away, on='gameID')
gameID h_Won/Lost h_metric2 h_metric3 h_metric4 h_team1 h_team2 h_team3 h_team4 a_metric2 a_metric3 a_metric4 a_team1 a_team2 a_team3 a_team4
0 2017020001 1 10 10 10 1 0 0 0 10 10 10 0 1 0 0
1 2017020002 1 10 10 10 1 0 0 0 10 10 10 0 1 0 0
(I kept the prefix for Won/Lost
because it indicates that it's the statistic for the home team. Also, if anybody knows how to add the prefixes more elegantly without having to re-rename the gameID
please leave a comment.) (我保留了“
Won/Lost
的前缀,因为它表示这是主队的统计信息。此外,如果有人知道如何更优雅地添加前缀而不必重命名gameID
请发表评论。)
Original Attempt 原始尝试
You can apply the following function after grouping 分组后可以应用以下功能
def munge(group):
is_home = group.Home == 1
wonlost = group.loc[is_home, 'Won/Lost'].reset_index(drop=True)
group = group.loc[:, 'metric2':]
home = group[is_home].add_prefix('h_').reset_index(drop=True)
away = group[~is_home].add_prefix('a_').reset_index(drop=True)
return pd.concat([wonlost, home, away], axis=1)
... like this: ... 像这样:
>>> df.groupby('gameID').apply(munge).reset_index(level=1, drop=True)
Won/Lost h_metric2 h_metric3 h_metric4 h_team1 h_team2 h_team3 h_team4 a_metric2 a_metric3 a_metric4 a_team1 a_team2 a_team3 a_team4
gameID
2017020001 1 10 10 10 1 0 0 0 10 10 10 0 1 0 0
2017020002 1 10 10 10 1 0 0 0 10 10 10 0 1 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.