简体   繁体   English

Pandas 合并具有多列的数据帧

[英]Pandas merge dataframes with multiple columns

I am trying to merge 2 dataframes and have a problem in figuring out how, as it is not straigh forward.我正在尝试合并 2 个数据帧,但在弄清楚如何合并时遇到了问题,因为它不是直截了当的。 One data frame has match results for over 25000 games and looks like this.一个数据框有超过 25000 场比赛的匹配结果,看起来像这样。 The second one has team performance metrics but only for around 1500 games.第二个具有团队绩效指标,但仅适用于大约 1500 场比赛。 As I am not allowed to post pictures yet, here are the column names of interest:由于我还不能发布图片,这里是感兴趣的列名:

df_match['date', 'home_team_api_id', 'away_team_api_id']
df_team_attributes['date', 'team_api_id']

Both data frames have additional columns with results or performance metrics.两个数据框都有带有结果或性能指标的附加列。 To be able to merge correctly, I need to merge by date and by looking if the 'team_api_id' matches either 'home...' or 'away_team_api_id'为了能够正确合并,我需要按日期合并,并查看“team_api_id”是否匹配“home...”或“away_team_api_id”

This is what I have tried until now:这是我到目前为止所尝试的:

df_team_performance = pd.merge(df_team_attributes, df_match,
                               how = 'left',
                               left_on = ['date', 'team_api_id', 'team_api_id'],
                               right_on = ['date', 'home_team_api_id', 'home_team_api_id'])

I have tried also with only 2 columns, but w/o succes.我也尝试过只有 2 列,但没有成功。 What I would like to get is a new data frame with only the rows of the df_team_attributes and columns from both data frames.我想要得到的是一个新的数据框,其中只有 df_team_attributes 的行和两个数据框的列。 Thank you in advance!先感谢您!

Added to request by Correlien: output of print(df_match[['date', 'home_team_api_id', 'away_team_api_id', 'win_home', 'win_away', 'draw', 'win']].head(10).to_dict()) {'date': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00:00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11-01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571}, 'win_home': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 1, 9: 1}, 'win_away': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0}, 'draw': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 1, 8: 1, 9: 1}}添加到 Correlien 的请求: output of print(df_match[['date', 'home_team_api_id', 'away_team_api_id', 'win_home', 'win_away', 'draw', 'win']].head(10).to_dict()) {'日期': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00: 00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11- 01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571} , 'win_home': {0:0, 1:0, 2:0, 3:1, 4:0, 5:0, 6:0, 7:0, 8:1, 9:1}, 'win_away' :{0:0、1:0、2:1、3:0、4:1、5:0、6:0、7:1、8:0、9:0},'画':{0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0、2:1、3:1、4:1、5:0、6:0、7:1、8:1、9:1}}

output for print(df_team_attributes[['date', 'team_api_id', 'buildUpPlaySpeed', 'buildUpPlaySpeedClass']].head(10).to_dict()) {'date': {0: '2010-02-22 00:00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4: '2011-02-22 00:00:00', 5: '2012-02-22 00:00:00', 6: '2013-09-20 00:00:00', 7: '2014-09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2: 47, 3: 70, 4: 47, 5: 58, 6: 62, 7: 58, 8: 59, 9: 60}, 'buildUpPlaySpeedClass': {0: 'Balanced', 1: 'Balanced', 2: 'Balanced', 3: 'Fast', 4: 'Balanced', 5: 'Balanced', 6: 'Balanced', 7: 'Balanced', 8: 'Balanced', 9: 'Balanced'}} output for print(df_team_attributes[['date', 'team_api_id', 'buildUpPlaySpeed', 'buildUpPlaySpeedClass']].head(10).to_dict()) {'date': {0: '2010-02-22 00: 00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4:'2011-02-22 00:00:00',5:'2012-02-22 00:00:00',6:'2013-09-20 00:00:00',7:'2014- 09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2:47、3:70、4:47、5:58、6:62、7:58、8:59、9:60},“buildUpPlaySpeedClass”:{0:“平衡”,1:“平衡”, 2:“平衡”,3:“快速”,4:“平衡”,5:“平衡”,6:“平衡”,7:“平衡”,8:“平衡”,9:“平衡”}}

you can concatenate the multiple fields into one field per table then left join merge the two tables.您可以将多个字段连接到每个表的一个字段中,然后左连接合并两个表。

Have you tried casting the your date columns into the correct format and then attempting the merge?您是否尝试过将日期列转换为正确的格式,然后尝试合并? The following worked for me based on the example that you provided -根据您提供的示例,以下内容对我有用 -

# Casting to date
df_match["date"] = pd.to_datetime(df_match["date"])
df_team_attributes["date"] = pd.to_datetime(df_match["date"])

# Merging on the date field alone
df_team_performance = pd.merge(df_team_attributes, df_match,
                               how = 'left',
                               on = 'date')

# Filtering out the required rows
result = df_team_performance.query("(team_api_id == home_team_api_id) | (team_api_id == away_team_api_id)")

Please let me know if my understanding of your question is correct.请让我知道我对您的问题的理解是否正确。

I would first select the rows from df_match in two steps: one time based on the home ID and then one time on the away ID.我将首先分两步 select df_match 中的行:一次基于家庭 ID,然后一次基于客场 ID。

As an example, first I will reproduce your data frames based on your input:例如,首先我将根据您的输入重现您的数据框:

import pandas as pd
import numpy as np

df_match = pd.DataFrame.from_dict({'date': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00:00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11-01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571}, 'win_home': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 1, 9: 1}, 'win_away': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0}, 'draw': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 1, 8: 1, 9: 1}})
df_team_attributes = pd.DataFrame.from_dict( {'date': {0: '2010-02-22 00:00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4: '2011-02-22 00:00:00', 5: '2012-02-22 00:00:00', 6: '2013-09-20 00:00:00', 7: '2014-09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2: 47, 3: 70, 4: 47, 5: 58, 6: 62, 7: 58, 8: 59, 9: 60}, 'buildUpPlaySpeedClass': {0: 'Balanced', 1: 'Balanced', 2: 'Balanced', 3: 'Fast', 4: 'Balanced', 5: 'Balanced', 6: 'Balanced', 7: 'Balanced', 8: 'Balanced', 9: 'Balanced'}})

df_match['date'] = pd.DatetimeIndex(df_match['date'])
df_team_attributes['date'] = pd.DatetimeIndex(df_team_attributes['date'])

df_team_attributes.set_index(['date', 'team_api_id'], drop=True, inplace=True)

I set the date and team_api_id as multi-index, this makes processing easier.我将日期team_api_id设置为多索引,这使得处理更容易。 The problem for me at the moment is that there is no overlap between the data frames because you only give the first 10 rows.目前对我来说的问题是数据框之间没有重叠,因为您只给出了前 10 行。 Therefore, as a first step (which you don't have to do) I select a few date/team_id combinations for both the home_team_api_id and the away_team_api_id from the df_match dataframe like this:因此,作为第一步(您不必这样做)我 select 为 df_match dataframe 中的home_team_api_idaway_team_api_id的几个日期/ team_id组合,如下所示:

n_pick = 2
i_start = 0
total_index = df_team_attributes.index
for api_id in ['home_team_api_id', 'away_team_api_id']:
    df = df_match.iloc[i_start:i_start + n_pick].copy()
    df.rename(columns={api_id: 'team_api_id'}, inplace=True)
    df.set_index(['date', 'team_api_id'], inplace=True, drop=True)
    total_index = total_index.append(df.index)
    i_start += n_pick
total_index = total_index.drop_duplicates()
df_team_attributes = df_team_attributes.reindex(total_index)
df_team_attributes.ffill(inplace=True)

This step increase the df_team_attributes data frame with 4 rows, so that we know that I should get an overlap of 4 rows这一步将 df_team_attributes 数据框增加了 4 行,这样我们就知道我应该得到 4 行的重叠

Now, your algorithm could look like this:现在,您的算法可能如下所示:

df_selection = None
for api_id in ['home_team_api_id', 'away_team_api_id']:
    df = df_match.set_index(['date', api_id], drop=False)
    df = df.reindex(df_team_attributes.index)
    df.dropna(axis=0, inplace=True)
    if df_selection is None:
        df_selection = df
    else:
        df_selection = df_selection.append(df)

df_team_attributes = df_team_attributes.join(df_selection, how='inner')

print(df_team_attributes)

The resulting data frame is now:生成的数据框现在是:

                        buildUpPlaySpeed buildUpPlaySpeedClass  ... draw  win
date       team_api_id                                          ...          
2008-08-17 9987                     60.0              Balanced  ...  1.0  0.0
2008-08-16 10000                    60.0              Balanced  ...  1.0  0.0
           8635                     60.0              Balanced  ...  0.0  1.0
2008-08-17 9998                     60.0              Balanced  ...  0.0  1.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM