[英]Create a python dataframe using a conditional groupby
I have below data: 我有以下数据:
year code value
2003 A 12
2003 B 11
2003 C 12
2004 A 14
2004 B 15
2004 C 13
2004 E 16
2005 A 9
2005 B 18
2005 C 16
2005 F 8
2005 G 19
I WANT TO RETAIN ONLY THOSE CODES THAT ARE PRESENT FOR EVERY YEAR. 我只想保留每年存在的代码。
From the above dataframe I need to extract all the rows that have codes appear in the years (2003, 2004, 2005). 从上面的数据框中,我需要提取年份中出现过代码的所有行(2003、2004、2005)。 Which means I should have a new df with 9 rows for codes A, B and C. I tried using groupby and isin() but unable to get exactly what I need. 这意味着我应该有一个新的df,其中包含9行,分别用于代码A,B和C。我尝试使用groupby和isin(),但无法准确获得所需的内容。
Without groupby
没有groupby
df.set_index(['year','code']).unstack().dropna(axis=1).stack().reset_index()
Out[528]:
year code value
0 2003 A 12.0
1 2003 B 11.0
2 2003 C 12.0
3 2004 A 14.0
4 2004 B 15.0
5 2004 C 13.0
6 2005 A 9.0
7 2005 B 18.0
8 2005 C 16.0
I believe you need filtering by isin
, but if want dynamically get all values which are in all years use reduce
: 我相信您需要按isin
过滤,但是如果要动态获取所有年份的所有值,请使用reduce
:
s = df.groupby('year')['code'].apply(list)
from functools import reduce
a = reduce(lambda x, y: set(x) & set(y), s)
print (list(a))
['C', 'A', 'B']
df = df[df['code'].isin(list(a))]
print (df)
year code value
0 2003 A 12
1 2003 B 11
2 2003 C 12
3 2004 A 14
4 2004 B 15
5 2004 C 13
7 2005 A 9
8 2005 B 18
9 2005 C 16
You could use 你可以用
Option 1 选项1
In [647]: codes = pd.crosstab(df.year, df.code).replace({0: np.nan}).dropna(axis=1).columns
In [648]: df.query('code in @codes')
Out[648]:
year code value
0 2003 A 12
1 2003 B 11
2 2003 C 12
3 2004 A 14
4 2004 B 15
5 2004 C 13
7 2005 A 9
8 2005 B 18
9 2005 C 16
Option 2 选项2
In [657]: codes = df.groupby(['year', 'code']).size().unstack().dropna(axis=1).columns
In [658]: df[df.code.isin(codes)]
Out[658]:
year code value
0 2003 A 12
1 2003 B 11
2 2003 C 12
3 2004 A 14
4 2004 B 15
5 2004 C 13
7 2005 A 9
8 2005 B 18
9 2005 C 16
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.