简体   繁体   English

如何标记列值集匹配的DataFrame行?

[英]How to label rows of DataFrame where the set of column values match?

I have a DataFrame containing columns that overlap in a sense:我有一个包含在某种意义上重叠的列的 DataFrame:

import pandas as pd

df = pd.DataFrame({
    'Date': ['2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02'],
    'Team': ['CHC', 'ARI', 'NYY', 'TBR', 'STL', 'SFG'],
    'Home': [True, False, True, False, False, True],
    'Opp': ['STL', 'SFG', 'TBR', 'NYY', 'CHC', 'ARI'],
    'Rslt': ['L', 'W', 'L', 'W', 'W', 'L']
})

df['Date'] = pd.to_datetime(df['Date'])

print(df)

OUTPUT:输出:

        Date Team   Home  Opp Rslt
0 2017-04-02  CHC   True  STL    L
1 2017-04-02  ARI  False  SFG    W
2 2017-04-02  NYY   True  TBR    L
3 2017-04-02  TBR  False  NYY    W
4 2017-04-02  STL  False  CHC    W
5 2017-04-02  SFG   True  ARI    L

For the date of 2017-04-01 , there were 3 games played.2017-04-01日期,进行了 3 场比赛。 The DataFrame contains the game results for each day for each team. DataFrame 包含每个团队每天的比赛结果。 This results in 6 results.这会产生 6 个结果。 Take row 2 and 3 , this is a game between NYY and TBR:以第2行和3行为例,这是 NYY 和 TBR 之间的游戏:

  • Row 2 gives the NYY result of L , meaning they lost2行给出了L的 NYY 结果,这意味着他们输了
  • Row 3 gives the TBR result of W , meaning they won3行给出了W的 TBR 结果,表示他们赢了

What I'm trying to do is group all row pairs that relate to the same game.我想要做的是将与同一游戏相关的所有行对分组。 My initial idea was to create a new column that would act as a label for the pair and then use that to group on or set MultiIndex .我最初的想法是创建一个新列作为该对的标签,然后使用它来分组或设置MultiIndex I thought about it and considered concatenating the three columns into a single string for each row and then, using sets, look through all rows for each date in Date and find the other row that contains the same characters:我考虑了一下,并考虑将每行的三列连接成一个字符串,然后使用集合,查看Date中每个日期的所有行,并找到包含相同字符的另一行:

df['Match'] = df['Date'].dt.strftime('%Y-%m-%d') + ',' + df['Team'] + ',' + df['Opp']

print(df)

OUTPUT:输出:

        Date Team   Home  Opp Rslt               Match
0 2017-04-02  CHC   True  STL    L  2017-04-02,CHC,STL
1 2017-04-02  ARI  False  SFG    W  2017-04-02,ARI,SFG
2 2017-04-02  NYY   True  TBR    L  2017-04-02,NYY,TBR
3 2017-04-02  TBR  False  NYY    W  2017-04-02,TBR,NYY
4 2017-04-02  STL  False  CHC    W  2017-04-02,STL,CHC
5 2017-04-02  SFG   True  ARI    L  2017-04-02,SFG,ARI

From here, I'm not sure how to proceed.从这里开始,我不确定如何进行。 I have a method in mind using sets that I've used in the past.我有一个使用我过去使用过的集合的方法。 If we focus on row 2 and 3 again, subtracting the sets of the string, split using the , , and taking the bool() will return False for two sets containing the same string elements and True for anything else (different sets):如果我们再次关注第2行和3行,减去字符串的集合,使用,进行拆分,并采用bool()将返回False用于包含相同字符串元素的两个集合,而对于其他任何内容(不同集合)返回True

print(
    bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,TBR,NYY'.split(',')))
)
print(
    bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,CHC,STL'.split(',')))
)

OUTPUT:输出:

False
True

Is there a better way to take a row value in a column and lookup all other row values in that same column and label the rows where they are related?有没有更好的方法来获取列中的行值并在同一列中查找所有其他行值并标记它们相关的行? The kind of label I would like to have is creating a unique numbering of games.我想要的标签是创建一个独特的游戏编号。 Since these three games happen on the same day, labelling the pairs as 1, 2, 3 would be great so that each game pair for each day has a unique ID.由于这三场比赛发生在同一天,因此将这些对标记为1, 2, 3会很好,这样每天的每个比赛对都有一个唯一的 ID。

PS I've also seen this post that kinda looks like what I'm trying to do... I've tried using .isin() but kept running into errors so scrapped that approach. PS我也看过这篇文章,看起来有点像我正在尝试做的事情......我试过使用.isin()但一直遇到错误,所以放弃了这种方法。 I thought about pd.DataFrame.lookup but I'm not quite sure if that's the right approach either.我考虑过pd.DataFrame.lookup但我也不太确定这是否是正确的方法。 Just need a way to group up each pair of rows.只需要一种方法来对每对行进行分组。

Merge the DataFrame on itself, swap values where it's not a home game, take the information you want, and then drop the duplicates:合并 DataFrame 本身,在不是家庭游戏的地方交换值,获取您想要的信息,然后删除重复项:

df2 = df.merge(df, left_on=['Date', 'Team'], right_on=['Date', 'Opp'], suffixes=['_Home', '_Away'])
swap_cols = ['Team', 'Opp', 'Rslt', 'Home']
for col in swap_cols:
    df2[f'{col}_Home'], df2[f'{col}_Away'] = np.where(df2.Home_Home, [df2[f'{col}_Home'],df2[f'{col}_Away']], [df2[f'{col}_Away'],df2[f'{col}_Home']])

df2 = df2[['Date','Team_Home', 'Rslt_Home', 'Team_Away', 'Rslt_Away']].drop_duplicates()
print(df2)

Output:输出:

        Date Team_Home Rslt_Home Team_Away Rslt_Away
0 2017-04-02       CHC         L       STL         W
1 2017-04-02       SFG         L       ARI         W
2 2017-04-02       NYY         L       TBR         W

IIUC,国际大学联合会,

You can do it this way using merge and query.您可以使用合并和查询来做到这一点。

df_match = df.merge(df, 
                    left_on=['Date', 'Team'], 
                    right_on=['Date', 'Opp'], 
                    suffixes=('','_opp'))

df_match.query('Home')

Output:输出:

        Date Team  Home  Opp Rslt Team_opp  Home_opp Opp_opp Rslt_opp
0 2017-04-02  CHC  True  STL    L      STL     False     CHC        W
2 2017-04-02  NYY  True  TBR    L      TBR     False     NYY        W
5 2017-04-02  SFG  True  ARI    L      ARI     False     SFG        W

To match the other answers better, you can deduplicate by selecting only the rows where the team was at home and then creating new columns for Result_home and Result_away.为了更好地匹配其他答案,您可以通过仅选择团队所在的行然后为 Result_home 和 Result_away 创建新列来进行重复数据删除。

I haven't tested, but I think this should be faster than approaches that merge the table onto itself我没有测试过,但我认为这应该比将表格合并到自身上的方法更快

I'm not sure what you want your final output to look like but here's one option我不确定你希望你的最终输出是什么样子,但这是一个选择

rename_cols = {
    'Team':'Home_Team',
    'Opp':'Away_Team',
    'Rslt':'Result_home',
}

df = df[df['Home']]
df['Result_away'] = df['Rslt'].replace({'L':'W','W':'L'})
df = df.rename(columns=rename_cols).drop(columns=['Home'])
df

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何计算 dataframe 中的行,其中列值与另一个 dataframe 中的列值匹配 - How to count rows in a dataframe where column values match column values in another dataframe 如何删除列的值在集合中的DataFrame行? - How to remove DataFrame rows where a column's values are in a set? 如何在 pandas Dataframe 中使用具有列值的行来匹配行和过滤 - How to match rows and filtering using rows with column values in pandas Dataframe 在DF2列值与DF1索引匹配的pandas DataFrame1中设置新的列值 - Set new column values in pandas DataFrame1 where DF2 column values match DF1 index 如何提取指定列值组合重复的数据帧的行? - How to extract the rows of a dataframe where a combination of specified column values are duplicated? 如何标记 DataFrame 列中 PREVIOUS 三个值相同的行? - How to flag rows where PREVIOUS three values are same in a DataFrame column? 如何将值设置为布尔过滤的dataframe列的行 - how to set values to rows of boolean filtered dataframe column 匹配数据框行中的值 - Match values in dataframe rows 如果列名匹配,则将数据框列值更改为行 - Change dataframe column values into rows if column names match 如何在将 dict 转换为 pandas DataFrame 时设置列标题(其中列名与 dict 键不匹配)? - How to set column headers while converting dict to pandas DataFrame (where column names do not match dict keys)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM