[英]Filter rows from a grouped data frame based on string columns
I have a data frame grouped by multiple columns but in this example it would be grouped only by Year
.我有一个按多列分组的数据框,但在此示例中,它将仅按
Year
分组。
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse
2 2002 Lion
3 2002 Duck
4 2010 Dog Cat
5 2010 Cat
6 2010 Lion
7 2010 Mouse
I would like for each group, from the rows where Animal2
is empty to filter out the rows where Animal2
does not appear in the column Animal1
.我希望对于每个组,从
Animal2
为空的行中过滤掉Animal2
未出现在Animal1
列中的行。
The expected output would be:预期的 output 将是:
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse
2 2002 Lion
3 2010 Dog Cat
4 2010 Cat
Rows 0 & 3 stayed since Animal2
is not empty.由于
Animal2
不为空,因此保留第 0 行和第 3 行。
Rows 1 & 2 stayed since Mouse & Lion are in Animal2
for the first group.第 1 行和第 2 行保留,因为 Mouse 和 Lion 在第一组的
Animal2
中。
Row 4 stayed since cat appear in Animal2
for the second group第 4 行保留,因为猫出现在第二组的
Animal2
中
You can use masks and regexes:您可以使用掩码和正则表达式:
# non empty Animal2
m1 = df['Animal2'].notna()
# make patterns with those Animals2 per Year
patterns = df[m1].groupby('Year')['Animal2'].agg('|'.join).str.replace(',', '|')
# for each Year select with the matching regex
m2 = (df.groupby('Year', group_keys=False)['Animal1']
.apply(lambda g: g.str.fullmatch(patterns[g.name]))
)
out = df.loc[m1|m2]
Or sets:或设置:
m1 = df['Animal2'].notna()
sets = (df.loc[m1, 'Animal2'].str.split(',')
.groupby(df['Year'])
.agg(lambda x: set().union(*x))
)
m2 = (df.groupby('Year', group_keys=False)['Animal1']
.apply(lambda g: g.isin(sets[g.name]))
)
out = df.loc[m1|m2]
Output: Output:
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse None
2 2002 Lion None
4 2010 Dog Cat
5 2010 Cat None
Here is a solution using list comprehension这是一个使用列表理解的解决方案
(df.loc[
[a1 in a2 for a1,a2 in zip(df['Animal1'],df['Year'].map(df['Animal2'].str.split(',').groupby(df['Year']).sum()))] |
df['Animal2'].notna()]
)
or或者
d = df['Animal2'].str.split(',').groupby(df['Year']).sum()
(df.loc[df.groupby('Year')['Animal1'].transform(lambda x: x.isin(d.loc[x.name])) |
df['Animal2'].notna()]
)
Output: Output:
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse None
2 2002 Lion None
4 2010 Dog Cat
5 2010 Cat None
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.