如何标记列值集匹配的DataFrame行？

Question

I have a DataFrame containing columns that overlap in a sense:我有一个包含在某种意义上重叠的列的 DataFrame：

import pandas as pd

df = pd.DataFrame({
    'Date': ['2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02'],
    'Team': ['CHC', 'ARI', 'NYY', 'TBR', 'STL', 'SFG'],
    'Home': [True, False, True, False, False, True],
    'Opp': ['STL', 'SFG', 'TBR', 'NYY', 'CHC', 'ARI'],
    'Rslt': ['L', 'W', 'L', 'W', 'W', 'L']
})

df['Date'] = pd.to_datetime(df['Date'])

print(df)

OUTPUT:输出：

        Date Team   Home  Opp Rslt
0 2017-04-02  CHC   True  STL    L
1 2017-04-02  ARI  False  SFG    W
2 2017-04-02  NYY   True  TBR    L
3 2017-04-02  TBR  False  NYY    W
4 2017-04-02  STL  False  CHC    W
5 2017-04-02  SFG   True  ARI    L

For the date of 2017-04-01 , there were 3 games played.在2017-04-01日期，进行了 3 场比赛。 The DataFrame contains the game results for each day for each team. DataFrame 包含每个团队每天的比赛结果。 This results in 6 results.这会产生 6 个结果。 Take row 2 and 3 , this is a game between NYY and TBR:以第2行和3行为例，这是 NYY 和 TBR 之间的游戏：

Row 2 gives the NYY result of L , meaning they lost第2行给出了L的 NYY 结果，这意味着他们输了
Row 3 gives the TBR result of W , meaning they won第3行给出了W的 TBR 结果，表示他们赢了

What I'm trying to do is group all row pairs that relate to the same game.我想要做的是将与同一游戏相关的所有行对分组。 My initial idea was to create a new column that would act as a label for the pair and then use that to group on or set MultiIndex .我最初的想法是创建一个新列作为该对的标签，然后使用它来分组或设置MultiIndex 。 I thought about it and considered concatenating the three columns into a single string for each row and then, using sets, look through all rows for each date in Date and find the other row that contains the same characters:我考虑了一下，并考虑将每行的三列连接成一个字符串，然后使用集合，查看Date中每个日期的所有行，并找到包含相同字符的另一行：

df['Match'] = df['Date'].dt.strftime('%Y-%m-%d') + ',' + df['Team'] + ',' + df['Opp']

print(df)

OUTPUT:输出：

        Date Team   Home  Opp Rslt               Match
0 2017-04-02  CHC   True  STL    L  2017-04-02,CHC,STL
1 2017-04-02  ARI  False  SFG    W  2017-04-02,ARI,SFG
2 2017-04-02  NYY   True  TBR    L  2017-04-02,NYY,TBR
3 2017-04-02  TBR  False  NYY    W  2017-04-02,TBR,NYY
4 2017-04-02  STL  False  CHC    W  2017-04-02,STL,CHC
5 2017-04-02  SFG   True  ARI    L  2017-04-02,SFG,ARI

From here, I'm not sure how to proceed.从这里开始，我不确定如何进行。 I have a method in mind using sets that I've used in the past.我有一个使用我过去使用过的集合的方法。 If we focus on row 2 and 3 again, subtracting the sets of the string, split using the , , and taking the bool() will return False for two sets containing the same string elements and True for anything else (different sets):如果我们再次关注第2行和3行，减去字符串的集合，使用,进行拆分，并采用bool()将返回False用于包含相同字符串元素的两个集合，而对于其他任何内容（不同集合）返回True ：

print(
    bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,TBR,NYY'.split(',')))
)
print(
    bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,CHC,STL'.split(',')))
)

OUTPUT:输出：

False
True

Is there a better way to take a row value in a column and lookup all other row values in that same column and label the rows where they are related?有没有更好的方法来获取列中的行值并在同一列中查找所有其他行值并标记它们相关的行？ The kind of label I would like to have is creating a unique numbering of games.我想要的标签是创建一个独特的游戏编号。 Since these three games happen on the same day, labelling the pairs as 1, 2, 3 would be great so that each game pair for each day has a unique ID.由于这三场比赛发生在同一天，因此将这些对标记为1, 2, 3会很好，这样每天的每个比赛对都有一个唯一的 ID。

PS I've also seen this post that kinda looks like what I'm trying to do... I've tried using .isin() but kept running into errors so scrapped that approach. PS我也看过这篇文章，看起来有点像我正在尝试做的事情......我试过使用.isin()但一直遇到错误，所以放弃了这种方法。 I thought about pd.DataFrame.lookup but I'm not quite sure if that's the right approach either.我考虑过pd.DataFrame.lookup但我也不太确定这是否是正确的方法。 Just need a way to group up each pair of rows.只需要一种方法来对每对行进行分组。

Answer 1

Merge the DataFrame on itself, swap values where it's not a home game, take the information you want, and then drop the duplicates:合并 DataFrame 本身，在不是家庭游戏的地方交换值，获取您想要的信息，然后删除重复项：

df2 = df.merge(df, left_on=['Date', 'Team'], right_on=['Date', 'Opp'], suffixes=['_Home', '_Away'])
swap_cols = ['Team', 'Opp', 'Rslt', 'Home']
for col in swap_cols:
    df2[f'{col}_Home'], df2[f'{col}_Away'] = np.where(df2.Home_Home, [df2[f'{col}_Home'],df2[f'{col}_Away']], [df2[f'{col}_Away'],df2[f'{col}_Home']])

df2 = df2[['Date','Team_Home', 'Rslt_Home', 'Team_Away', 'Rslt_Away']].drop_duplicates()
print(df2)

Output:输出：

        Date Team_Home Rslt_Home Team_Away Rslt_Away
0 2017-04-02       CHC         L       STL         W
1 2017-04-02       SFG         L       ARI         W
2 2017-04-02       NYY         L       TBR         W

Answer 2

IIUC,国际大学联合会，

You can do it this way using merge and query.您可以使用合并和查询来做到这一点。

df_match = df.merge(df, 
                    left_on=['Date', 'Team'], 
                    right_on=['Date', 'Opp'], 
                    suffixes=('','_opp'))

df_match.query('Home')

Output:输出：

        Date Team  Home  Opp Rslt Team_opp  Home_opp Opp_opp Rslt_opp
0 2017-04-02  CHC  True  STL    L      STL     False     CHC        W
2 2017-04-02  NYY  True  TBR    L      TBR     False     NYY        W
5 2017-04-02  SFG  True  ARI    L      ARI     False     SFG        W

Answer 3

To match the other answers better, you can deduplicate by selecting only the rows where the team was at home and then creating new columns for Result_home and Result_away.为了更好地匹配其他答案，您可以通过仅选择团队所在的行然后为 Result_home 和 Result_away 创建新列来进行重复数据删除。

I haven't tested, but I think this should be faster than approaches that merge the table onto itself我没有测试过，但我认为这应该比将表格合并到自身上的方法更快

I'm not sure what you want your final output to look like but here's one option我不确定你希望你的最终输出是什么样子，但这是一个选择

rename_cols = {
    'Team':'Home_Team',
    'Opp':'Away_Team',
    'Rslt':'Result_home',
}

df = df[df['Home']]
df['Result_away'] = df['Rslt'].replace({'L':'W','W':'L'})
df = df.rename(columns=rename_cols).drop(columns=['Home'])
df

如何标记列值集匹配的DataFrame行？

问题描述

3 个解决方案

解决方案1
1 2022-06-29 18:02:24

解决方案2
1 2022-06-29 18:05:34

解决方案3
0 2022-06-29 17:38:43

如何标记列值集匹配的DataFrame行？

问题描述

3 个解决方案

解决方案1 1 2022-06-29 18:02:24

解决方案2 1 2022-06-29 18:05:34

解决方案3 0 2022-06-29 17:38:43

解决方案1
1 2022-06-29 18:02:24

解决方案2
1 2022-06-29 18:05:34

解决方案3
0 2022-06-29 17:38:43