[英]How to label rows of DataFrame where the set of column values match?
I have a DataFrame containing columns that overlap in a sense:我有一个包含在某种意义上重叠的列的 DataFrame:
import pandas as pd
df = pd.DataFrame({
'Date': ['2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02'],
'Team': ['CHC', 'ARI', 'NYY', 'TBR', 'STL', 'SFG'],
'Home': [True, False, True, False, False, True],
'Opp': ['STL', 'SFG', 'TBR', 'NYY', 'CHC', 'ARI'],
'Rslt': ['L', 'W', 'L', 'W', 'W', 'L']
})
df['Date'] = pd.to_datetime(df['Date'])
print(df)
OUTPUT:输出:
Date Team Home Opp Rslt
0 2017-04-02 CHC True STL L
1 2017-04-02 ARI False SFG W
2 2017-04-02 NYY True TBR L
3 2017-04-02 TBR False NYY W
4 2017-04-02 STL False CHC W
5 2017-04-02 SFG True ARI L
For the date of 2017-04-01
, there were 3 games played.在2017-04-01
日期,进行了 3 场比赛。 The DataFrame contains the game results for each day for each team. DataFrame 包含每个团队每天的比赛结果。 This results in 6 results.这会产生 6 个结果。 Take row 2
and 3
, this is a game between NYY and TBR:以第2
行和3
行为例,这是 NYY 和 TBR 之间的游戏:
2
gives the NYY result of L
, meaning they lost第2
行给出了L
的 NYY 结果,这意味着他们输了3
gives the TBR result of W
, meaning they won第3
行给出了W
的 TBR 结果,表示他们赢了What I'm trying to do is group all row pairs that relate to the same game.我想要做的是将与同一游戏相关的所有行对分组。 My initial idea was to create a new column that would act as a label for the pair and then use that to group on or set MultiIndex
.我最初的想法是创建一个新列作为该对的标签,然后使用它来分组或设置MultiIndex
。 I thought about it and considered concatenating the three columns into a single string for each row and then, using sets, look through all rows for each date in Date
and find the other row that contains the same characters:我考虑了一下,并考虑将每行的三列连接成一个字符串,然后使用集合,查看Date
中每个日期的所有行,并找到包含相同字符的另一行:
df['Match'] = df['Date'].dt.strftime('%Y-%m-%d') + ',' + df['Team'] + ',' + df['Opp']
print(df)
OUTPUT:输出:
Date Team Home Opp Rslt Match
0 2017-04-02 CHC True STL L 2017-04-02,CHC,STL
1 2017-04-02 ARI False SFG W 2017-04-02,ARI,SFG
2 2017-04-02 NYY True TBR L 2017-04-02,NYY,TBR
3 2017-04-02 TBR False NYY W 2017-04-02,TBR,NYY
4 2017-04-02 STL False CHC W 2017-04-02,STL,CHC
5 2017-04-02 SFG True ARI L 2017-04-02,SFG,ARI
From here, I'm not sure how to proceed.从这里开始,我不确定如何进行。 I have a method in mind using sets that I've used in the past.我有一个使用我过去使用过的集合的方法。 If we focus on row 2
and 3
again, subtracting the sets of the string, split using the ,
, and taking the bool()
will return False
for two sets containing the same string elements and True
for anything else (different sets):如果我们再次关注第2
行和3
行,减去字符串的集合,使用,
进行拆分,并采用bool()
将返回False
用于包含相同字符串元素的两个集合,而对于其他任何内容(不同集合)返回True
:
print(
bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,TBR,NYY'.split(',')))
)
print(
bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,CHC,STL'.split(',')))
)
OUTPUT:输出:
False
True
Is there a better way to take a row value in a column and lookup all other row values in that same column and label the rows where they are related?有没有更好的方法来获取列中的行值并在同一列中查找所有其他行值并标记它们相关的行? The kind of label I would like to have is creating a unique numbering of games.我想要的标签是创建一个独特的游戏编号。 Since these three games happen on the same day, labelling the pairs as 1, 2, 3
would be great so that each game pair for each day has a unique ID.由于这三场比赛发生在同一天,因此将这些对标记为1, 2, 3
会很好,这样每天的每个比赛对都有一个唯一的 ID。
PS I've also seen this post that kinda looks like what I'm trying to do... I've tried using .isin()
but kept running into errors so scrapped that approach. PS我也看过这篇文章,看起来有点像我正在尝试做的事情......我试过使用.isin()
但一直遇到错误,所以放弃了这种方法。 I thought about pd.DataFrame.lookup
but I'm not quite sure if that's the right approach either.我考虑过pd.DataFrame.lookup
但我也不太确定这是否是正确的方法。 Just need a way to group up each pair of rows.只需要一种方法来对每对行进行分组。
Merge the DataFrame on itself, swap values where it's not a home game, take the information you want, and then drop the duplicates:合并 DataFrame 本身,在不是家庭游戏的地方交换值,获取您想要的信息,然后删除重复项:
df2 = df.merge(df, left_on=['Date', 'Team'], right_on=['Date', 'Opp'], suffixes=['_Home', '_Away'])
swap_cols = ['Team', 'Opp', 'Rslt', 'Home']
for col in swap_cols:
df2[f'{col}_Home'], df2[f'{col}_Away'] = np.where(df2.Home_Home, [df2[f'{col}_Home'],df2[f'{col}_Away']], [df2[f'{col}_Away'],df2[f'{col}_Home']])
df2 = df2[['Date','Team_Home', 'Rslt_Home', 'Team_Away', 'Rslt_Away']].drop_duplicates()
print(df2)
Output:输出:
Date Team_Home Rslt_Home Team_Away Rslt_Away
0 2017-04-02 CHC L STL W
1 2017-04-02 SFG L ARI W
2 2017-04-02 NYY L TBR W
IIUC,国际大学联合会,
You can do it this way using merge and query.您可以使用合并和查询来做到这一点。
df_match = df.merge(df,
left_on=['Date', 'Team'],
right_on=['Date', 'Opp'],
suffixes=('','_opp'))
df_match.query('Home')
Output:输出:
Date Team Home Opp Rslt Team_opp Home_opp Opp_opp Rslt_opp
0 2017-04-02 CHC True STL L STL False CHC W
2 2017-04-02 NYY True TBR L TBR False NYY W
5 2017-04-02 SFG True ARI L ARI False SFG W
To match the other answers better, you can deduplicate by selecting only the rows where the team was at home and then creating new columns for Result_home and Result_away.为了更好地匹配其他答案,您可以通过仅选择团队所在的行然后为 Result_home 和 Result_away 创建新列来进行重复数据删除。
I haven't tested, but I think this should be faster than approaches that merge the table onto itself我没有测试过,但我认为这应该比将表格合并到自身上的方法更快
I'm not sure what you want your final output to look like but here's one option我不确定你希望你的最终输出是什么样子,但这是一个选择
rename_cols = {
'Team':'Home_Team',
'Opp':'Away_Team',
'Rslt':'Result_home',
}
df = df[df['Home']]
df['Result_away'] = df['Rslt'].replace({'L':'W','W':'L'})
df = df.rename(columns=rename_cols).drop(columns=['Home'])
df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.