简体   繁体   English

根据字符串列过滤分组数据框中的行

[英]Filter rows from a grouped data frame based on string columns

I have a data frame grouped by multiple columns but in this example it would be grouped only by Year .我有一个按多列分组的数据框,但在此示例中,它将仅按Year分组。

   Year Animal1  Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion            
3  2002   Duck            
4  2010    Dog   Cat
5  2010    Cat            
6  2010   Lion            
7  2010  Mouse      

I would like for each group, from the rows where Animal2 is empty to filter out the rows where Animal2 does not appear in the column Animal1 .我希望对于每个组,从Animal2为空的行中过滤掉Animal2未出现在Animal1列中的行。

The expected output would be:预期的 output 将是:

  Year Animal1   Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion                   
3  2010    Dog   Cat
4  2010    Cat                        

Rows 0 & 3 stayed since Animal2 is not empty.由于Animal2不为空,因此保留第 0 行和第 3 行。

Rows 1 & 2 stayed since Mouse & Lion are in Animal2 for the first group.第 1 行和第 2 行保留,因为 Mouse 和 Lion 在第一组的Animal2中。

Row 4 stayed since cat appear in Animal2 for the second group第 4 行保留,因为猫出现在第二组的Animal2

You can use masks and regexes:您可以使用掩码和正则表达式:

# non empty Animal2
m1 = df['Animal2'].notna()

# make patterns with those Animals2 per Year
patterns = df[m1].groupby('Year')['Animal2'].agg('|'.join).str.replace(',', '|')

# for each Year select with the matching regex
m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.str.fullmatch(patterns[g.name]))
     )

out = df.loc[m1|m2]

Or sets:或设置:

m1 = df['Animal2'].notna()

sets = (df.loc[m1, 'Animal2'].str.split(',')
          .groupby(df['Year'])
          .agg(lambda x: set().union(*x))
       )

m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.isin(sets[g.name]))
     )

out = df.loc[m1|m2]

Output: Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None

Here is a solution using list comprehension这是一个使用列表理解的解决方案

(df.loc[
    [a1 in a2 for a1,a2 in zip(df['Animal1'],df['Year'].map(df['Animal2'].str.split(',').groupby(df['Year']).sum()))] | 
    df['Animal2'].notna()]
    )

or或者

d = df['Animal2'].str.split(',').groupby(df['Year']).sum()

(df.loc[df.groupby('Year')['Animal1'].transform(lambda x: x.isin(d.loc[x.name])) | 
df['Animal2'].notna()]
)

Output: Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM