简体   繁体   English

如何基于列值合并数据框中的行?

[英]How to merge rows in a dataframe based on a column value?

I have a data-set that is in the shape of this, where each row represents a in a specific match that is specified by the gameID . 我有一个这样的数据集,其中每一行代表一个由gameID指定的特定匹配gameID

  gameID          Won/Lost   Home   Away  metric2 metric3 metric4   team1 team2 team3 team4
2017020001         1          1      0      10      10      10      1     0     0      0
2017020001         0          0      1      10      10      10      0     1     0      0

The thing I want to do is create a function that takes the rows with the same gameID and joins them. 我要做的是创建一个函数,该函数将具有相同gameID的行合并在一起。 As you can see in data example below, the two rows represents one game that is split up into a home team (row_1 ) and an away team (row_2). 正如您在下面的数据示例中看到的那样,这两行代表一个比赛,该比赛分为一个主队(row_1)和一个客队(row_2)。 I want these two rows to sit on one row only. 我希望这两行仅坐在一行上。

Won/Lost  h_metric2 h_metric3 h_metric4 a_metric2 a_metric3 a_metric4 h_team1 h_team2 h_team3 h_team4 a_team1 a_team2 a_team3 a_team4
1            10       10         10        10         10        10      1       0        0      0         0      1        0      0

How do I get this result? 我如何得到这个结果?

EDIT: I created too much confusion, posting my code so you can get a better grasp of the problem I want to solve. 编辑:我造成了太多的混乱,发布我的代码,以便您可以更好地了解我要解决的问题。

import numpy as np
import pandas as pd
import requests
import json
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

results = []
for game_id in range(2017020001, 2017020010, 1):
    url = 'https://statsapi.web.nhl.com/api/v1/game/{}/boxscore'.format(game_id)
r = requests.get(url)
game_data = r.json()

for homeaway in ['home','away']:

    game_dict = game_data.get('teams').get(homeaway).get('teamStats').get('teamSkaterStats')
    game_dict['team'] = game_data.get('teams').get(homeaway).get('team').get('name')
    game_dict['homeaway'] = homeaway
    game_dict['game_id'] = game_id
    results.append(game_dict)

df = pd.DataFrame(results)

df['Won/Lost'] = df.groupby('game_id')['goals'].apply(lambda g: (g == g.max()).map({True: 1, False: 0}))

df["faceOffWinPercentage"] = df["faceOffWinPercentage"].astype('float')
df["powerPlayPercentage"] = df["powerPlayPercentage"].astype('float')
df["team"] = df["team"].astype('category')
df = pd.get_dummies(df, columns=['homeaway'])
df = pd.get_dummies(df, columns=['team'])

i just suppose, you are working with bread and butter: numpy, pandas & co? 我只是想,您正在使用面包和黄油:numpy,pandas&co?

if so, i furthermore assume, that your table currently is being stored in a pandas.DataFrame-instance called 'df': 如果是这样,我还假设您的表当前存储在名为'df'的pandas.DataFrame-instance中:

Divide your df into two df's and then join them: 将您的df分为两个df,然后将它们加入:

df_team1 = df[df['Won/Lost']==1]
df_team2 = df[df['Won/Lost']==0]
final_df = df_team1.join(df_team2, lsuffix='_team1', rsuffix='_team2', on='gameID')

You can, of course, edit it to better match your purposes. 当然,您可以对其进行编辑以更好地满足您的目的。 For instance create the df's based on Home/Away columns, etc. 例如,根据“居家/离开”列创建df,等等。

BR Ben :] BR Ben:]

This is under the assumption that you have exactly two rows per gameID and that you want to group by that ID. 这是基于以下假设:每个gameID恰好有两行,并且您gameID该ID分组。 (It also assumes that I understand the question.) (它也假设我理解这个问题。)

Improved solution 改进的解决方案

Given a dataframe df such as 给定一个数据帧df

       gameID  Won/Lost  Home  Away  metric2  metric3  metric4  team1  team2  team3  team4
0  2017020001         1     1     0       10       10       10      1      0      0      0
1  2017020001         0     0     1       10       10       10      0      1      0      0
2  2017020002         1     1     0       10       10       10      1      0      0      0
3  2017020002         0     0     1       10       10       10      0      1      0      0

you can use pd.merge (and some data munging) like this: 您可以像这样使用pd.merge (和一些数据处理):

>>> is_home = df['Home'] == 1                                                                                                                                                                                                                   
>>> home = df[is_home].drop(['Home', 'Away'], axis=1).add_prefix('h_').rename(columns={'h_gameID':'gameID'})                                                                                                                                    
>>> away = df[~is_home].drop(['Won/Lost', 'Home', 'Away'], axis=1).add_prefix('a_').rename(columns={'a_gameID':'gameID'})                                                                                                                       
>>> pd.merge(home, away, on='gameID')                                                                                                                                                                                                           
       gameID  h_Won/Lost  h_metric2  h_metric3  h_metric4  h_team1  h_team2  h_team3  h_team4  a_metric2  a_metric3  a_metric4  a_team1  a_team2  a_team3  a_team4
0  2017020001           1         10         10         10        1        0        0        0         10         10         10        0        1        0        0
1  2017020002           1         10         10         10        1        0        0        0         10         10         10        0        1        0        0

(I kept the prefix for Won/Lost because it indicates that it's the statistic for the home team. Also, if anybody knows how to add the prefixes more elegantly without having to re-rename the gameID please leave a comment.) (我保留了“ Won/Lost的前缀,因为它表示这是主队的统计信息。此外,如果有人知道如何更优雅地添加前缀而不必重命名gameID请发表评论。)


Original Attempt 原始尝试

You can apply the following function after grouping 分组后可以应用以下功能

def munge(group): 
     is_home = group.Home == 1 
     wonlost = group.loc[is_home, 'Won/Lost'].reset_index(drop=True) 
     group = group.loc[:, 'metric2':] 
     home = group[is_home].add_prefix('h_').reset_index(drop=True) 
     away = group[~is_home].add_prefix('a_').reset_index(drop=True) 
     return pd.concat([wonlost, home, away], axis=1) 

... like this: ... 像这样:

>>> df.groupby('gameID').apply(munge).reset_index(level=1, drop=True)                                                                                                                                                                           
            Won/Lost  h_metric2  h_metric3  h_metric4  h_team1  h_team2  h_team3  h_team4  a_metric2  a_metric3  a_metric4  a_team1  a_team2  a_team3  a_team4
gameID                                                                                                                                                        
2017020001         1         10         10         10        1        0        0        0         10         10         10        0        1        0        0
2017020002         1         10         10         10        1        0        0        0         10         10         10        0        1        0        0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM