过滤 Pandas DataFrame 中的行

Question

I am looking for a way to filter rows in a DataFrame.我正在寻找一种方法来过滤 DataFrame 中的行。 I have the following data:我有以下数据：

data = [
    {'year':2015, 'v1':'str1', 'v2':'str2', 'v3':'str3', 'val': 6}, 
    {'year':2016, 'v1':'str1', 'v2':'str2', 'v3':'str3', 'val': 5}, 
    {'year':2017, 'v1':'str1', 'v2':'str2', 'v3':'str3', 'val': 3},
    {'year':2015, 'v1':'str11', 'v2':'str2', 'v3':'str3', 'val': 4},
    {'year':2016, 'v1':'str11', 'v2':'str2', 'v3':'str3', 'val': 9},
    {'year':2017, 'v1':'str12', 'v2':'str2', 'v3':'str3', 'val': 1},
    {'year':2016, 'v1':'str1', 'v2':'str21', 'v3':'str3', 'val': 9},
    {'year':2017, 'v1':'str1', 'v2':'str21', 'v3':'str3', 'val': 7},
    {'year':2018, 'v1':'str1', 'v2':'str21', 'v3':'str3', 'val': 8},
    {'year':2015, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 6}, 
    {'year':2016, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 5},
    {'year':2016, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 6}, 
    {'year':2017, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 3},
    {'year':2018, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 4}
]

The filtering rule: if there are not at least three subsequent years, starting with 2015, with rows which match in v1, v2 and v3, then those rows should be removed.过滤规则：如果没有至少三年的后续年份，从 2015 年开始，在 v1、v2 和 v3 中匹配的行，那么这些行应该被删除。 The rows which match in v1, v2 and v3 for at least three subsequent years from 2015 on, should be kept.应保留从 2015 年起至少三年内在 v1、v2 和 v3 中匹配的行。

The expected output after filtering for the example above is:上例过滤后的预期输出为：

import pandas as pd
df = pd.DataFrame(data)
# filtering step
print(df)

    year     v1     v2     v3  val
0   2015   str1   str2   str3    6
1   2016   str1   str2   str3    5
2   2017   str1   str2   str3    3
3   2015   str1   str2  str31    6
4   2016   str1   str2  str31    5
5   2016   str1   str2  str31    6
6   2017   str1   str2  str31    3
7   2018   str1   str2  str31    4

Any ideas?有任何想法吗？

Answer 1

You can chain two groupby + filter您可以链接两个groupby + filter

v = ['v1', 'v2', 'v3']

(df.groupby(v).filter(lambda s: 2015 in s['year'].values)
   .groupby(v).filter(lambda s: s.year.nunique() >= 3) and s.year.diff().isin([0, 1, np.nan]).all())

   year    v1    v2     v3  val
0  2015  str1  str2   str3    6
1  2016  str1  str2   str3    5
2  2017  str1  str2   str3    3
3  2015  str1  str2  str31    6
4  2016  str1  str2  str31    5
5  2016  str1  str2  str31    6
6  2017  str1  str2  str31    3
7  2018  str1  str2  str31    4

Answer 2

I feel like we can short the filter as below我觉得我们可以像下面这样缩短filter

df.groupby(['v1','v2','v3']).filter(lambda x : pd.Series([2015,2016,2017]).isin(x['year']).all())
Out[142]: 
    year    v1    v2     v3  val
0   2015  str1  str2   str3    6
1   2016  str1  str2   str3    5
2   2017  str1  str2   str3    3
9   2015  str1  str2  str31    6
10  2016  str1  str2  str31    5
11  2016  str1  str2  str31    6
12  2017  str1  str2  str31    3
13  2018  str1  str2  str31    4

过滤 Pandas DataFrame 中的行

问题描述

2 个解决方案

解决方案1
2 2019-11-28 17:22:55

解决方案2
2 已采纳 2019-11-28 17:55:31

过滤 Pandas DataFrame 中的行

问题描述

2 个解决方案

解决方案1 2 2019-11-28 17:22:55

解决方案2 2 已采纳 2019-11-28 17:55:31

解决方案1
2 2019-11-28 17:22:55

解决方案2
2 已采纳 2019-11-28 17:55:31