簡體   English   中英

Pandas 合並具有多列的數據幀

[英]Pandas merge dataframes with multiple columns

我正在嘗試合並 2 個數據幀,但在弄清楚如何合並時遇到了問題,因為它不是直截了當的。 一個數據框有超過 25000 場比賽的匹配結果,看起來像這樣。 第二個具有團隊績效指標,但僅適用於大約 1500 場比賽。 由於我還不能發布圖片,這里是感興趣的列名:

df_match['date', 'home_team_api_id', 'away_team_api_id']
df_team_attributes['date', 'team_api_id']

兩個數據框都有帶有結果或性能指標的附加列。 為了能夠正確合並,我需要按日期合並,並查看“team_api_id”是否匹配“home...”或“away_team_api_id”

這是我到目前為止所嘗試的:

df_team_performance = pd.merge(df_team_attributes, df_match,
                               how = 'left',
                               left_on = ['date', 'team_api_id', 'team_api_id'],
                               right_on = ['date', 'home_team_api_id', 'home_team_api_id'])

我也嘗試過只有 2 列,但沒有成功。 我想要得到的是一個新的數據框,其中只有 df_team_attributes 的行和兩個數據框的列。 先感謝您!

添加到 Correlien 的請求: output of print(df_match[['date', 'home_team_api_id', 'away_team_api_id', 'win_home', 'win_away', 'draw', 'win']].head(10).to_dict()) {'日期': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00: 00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11- 01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571} , 'win_home': {0:0, 1:0, 2:0, 3:1, 4:0, 5:0, 6:0, 7:0, 8:1, 9:1}, 'win_away' :{0:0、1:0、2:1、3:0、4:1、5:0、6:0、7:1、8:0、9:0},'畫':{0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0、2:1、3:1、4:1、5:0、6:0、7:1、8:1、9:1}}

output for print(df_team_attributes[['date', 'team_api_id', 'buildUpPlaySpeed', 'buildUpPlaySpeedClass']].head(10).to_dict()) {'date': {0: '2010-02-22 00: 00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4:'2011-02-22 00:00:00',5:'2012-02-22 00:00:00',6:'2013-09-20 00:00:00',7:'2014- 09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2:47、3:70、4:47、5:58、6:62、7:58、8:59、9:60},“buildUpPlaySpeedClass”:{0:“平衡”,1:“平衡”, 2:“平衡”,3:“快速”,4:“平衡”,5:“平衡”,6:“平衡”,7:“平衡”,8:“平衡”,9:“平衡”}}

您可以將多個字段連接到每個表的一個字段中,然后左連接合並兩個表。

您是否嘗試過將日期列轉換為正確的格式,然后嘗試合並? 根據您提供的示例,以下內容對我有用 -

# Casting to date
df_match["date"] = pd.to_datetime(df_match["date"])
df_team_attributes["date"] = pd.to_datetime(df_match["date"])

# Merging on the date field alone
df_team_performance = pd.merge(df_team_attributes, df_match,
                               how = 'left',
                               on = 'date')

# Filtering out the required rows
result = df_team_performance.query("(team_api_id == home_team_api_id) | (team_api_id == away_team_api_id)")

請讓我知道我對您的問題的理解是否正確。

我將首先分兩步 select df_match 中的行:一次基於家庭 ID,然后一次基於客場 ID。

例如,首先我將根據您的輸入重現您的數據框:

import pandas as pd
import numpy as np

df_match = pd.DataFrame.from_dict({'date': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00:00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11-01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571}, 'win_home': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 1, 9: 1}, 'win_away': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0}, 'draw': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 1, 8: 1, 9: 1}})
df_team_attributes = pd.DataFrame.from_dict( {'date': {0: '2010-02-22 00:00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4: '2011-02-22 00:00:00', 5: '2012-02-22 00:00:00', 6: '2013-09-20 00:00:00', 7: '2014-09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2: 47, 3: 70, 4: 47, 5: 58, 6: 62, 7: 58, 8: 59, 9: 60}, 'buildUpPlaySpeedClass': {0: 'Balanced', 1: 'Balanced', 2: 'Balanced', 3: 'Fast', 4: 'Balanced', 5: 'Balanced', 6: 'Balanced', 7: 'Balanced', 8: 'Balanced', 9: 'Balanced'}})

df_match['date'] = pd.DatetimeIndex(df_match['date'])
df_team_attributes['date'] = pd.DatetimeIndex(df_team_attributes['date'])

df_team_attributes.set_index(['date', 'team_api_id'], drop=True, inplace=True)

我將日期team_api_id設置為多索引,這使得處理更容易。 目前對我來說的問題是數據框之間沒有重疊,因為您只給出了前 10 行。 因此,作為第一步(您不必這樣做)我 select 為 df_match dataframe 中的home_team_api_idaway_team_api_id的幾個日期/ team_id組合,如下所示:

n_pick = 2
i_start = 0
total_index = df_team_attributes.index
for api_id in ['home_team_api_id', 'away_team_api_id']:
    df = df_match.iloc[i_start:i_start + n_pick].copy()
    df.rename(columns={api_id: 'team_api_id'}, inplace=True)
    df.set_index(['date', 'team_api_id'], inplace=True, drop=True)
    total_index = total_index.append(df.index)
    i_start += n_pick
total_index = total_index.drop_duplicates()
df_team_attributes = df_team_attributes.reindex(total_index)
df_team_attributes.ffill(inplace=True)

這一步將 df_team_attributes 數據框增加了 4 行,這樣我們就知道我應該得到 4 行的重疊

現在,您的算法可能如下所示:

df_selection = None
for api_id in ['home_team_api_id', 'away_team_api_id']:
    df = df_match.set_index(['date', api_id], drop=False)
    df = df.reindex(df_team_attributes.index)
    df.dropna(axis=0, inplace=True)
    if df_selection is None:
        df_selection = df
    else:
        df_selection = df_selection.append(df)

df_team_attributes = df_team_attributes.join(df_selection, how='inner')

print(df_team_attributes)

生成的數據框現在是:

                        buildUpPlaySpeed buildUpPlaySpeedClass  ... draw  win
date       team_api_id                                          ...          
2008-08-17 9987                     60.0              Balanced  ...  1.0  0.0
2008-08-16 10000                    60.0              Balanced  ...  1.0  0.0
           8635                     60.0              Balanced  ...  0.0  1.0
2008-08-17 9998                     60.0              Balanced  ...  0.0  1.0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM