[英]Filter rows in a pandas DataFrame
I am looking for a way to filter rows in a DataFrame.我正在寻找一种方法来过滤 DataFrame 中的行。 I have the following data:我有以下数据:
data = [
{'year':2015, 'v1':'str1', 'v2':'str2', 'v3':'str3', 'val': 6},
{'year':2016, 'v1':'str1', 'v2':'str2', 'v3':'str3', 'val': 5},
{'year':2017, 'v1':'str1', 'v2':'str2', 'v3':'str3', 'val': 3},
{'year':2015, 'v1':'str11', 'v2':'str2', 'v3':'str3', 'val': 4},
{'year':2016, 'v1':'str11', 'v2':'str2', 'v3':'str3', 'val': 9},
{'year':2017, 'v1':'str12', 'v2':'str2', 'v3':'str3', 'val': 1},
{'year':2016, 'v1':'str1', 'v2':'str21', 'v3':'str3', 'val': 9},
{'year':2017, 'v1':'str1', 'v2':'str21', 'v3':'str3', 'val': 7},
{'year':2018, 'v1':'str1', 'v2':'str21', 'v3':'str3', 'val': 8},
{'year':2015, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 6},
{'year':2016, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 5},
{'year':2016, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 6},
{'year':2017, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 3},
{'year':2018, 'v1':'str1', 'v2':'str2', 'v3':'str31', 'val': 4}
]
The filtering rule: if there are not at least three subsequent years, starting with 2015, with rows which match in v1, v2 and v3, then those rows should be removed.过滤规则:如果没有至少三年的后续年份,从 2015 年开始,在 v1、v2 和 v3 中匹配的行,那么这些行应该被删除。 The rows which match in v1, v2 and v3 for at least three subsequent years from 2015 on, should be kept.应保留从 2015 年起至少三年内在 v1、v2 和 v3 中匹配的行。
The expected output after filtering for the example above is:上例过滤后的预期输出为:
import pandas as pd
df = pd.DataFrame(data)
# filtering step
print(df)
year v1 v2 v3 val
0 2015 str1 str2 str3 6
1 2016 str1 str2 str3 5
2 2017 str1 str2 str3 3
3 2015 str1 str2 str31 6
4 2016 str1 str2 str31 5
5 2016 str1 str2 str31 6
6 2017 str1 str2 str31 3
7 2018 str1 str2 str31 4
Any ideas?有任何想法吗?
You can chain two groupby
+ filter
您可以链接两个groupby
+ filter
v = ['v1', 'v2', 'v3']
(df.groupby(v).filter(lambda s: 2015 in s['year'].values)
.groupby(v).filter(lambda s: s.year.nunique() >= 3) and s.year.diff().isin([0, 1, np.nan]).all())
year v1 v2 v3 val
0 2015 str1 str2 str3 6
1 2016 str1 str2 str3 5
2 2017 str1 str2 str3 3
3 2015 str1 str2 str31 6
4 2016 str1 str2 str31 5
5 2016 str1 str2 str31 6
6 2017 str1 str2 str31 3
7 2018 str1 str2 str31 4
I feel like we can short the filter
as below我觉得我们可以像下面这样缩短filter
df.groupby(['v1','v2','v3']).filter(lambda x : pd.Series([2015,2016,2017]).isin(x['year']).all())
Out[142]:
year v1 v2 v3 val
0 2015 str1 str2 str3 6
1 2016 str1 str2 str3 5
2 2017 str1 str2 str3 3
9 2015 str1 str2 str31 6
10 2016 str1 str2 str31 5
11 2016 str1 str2 str31 6
12 2017 str1 str2 str31 3
13 2018 str1 str2 str31 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.