[英]Pandas new column based on condition after groupby
我有一個數據集,其中基於兩列進行分組:代碼和組。 樣本數據可以生成如下:
import pandas as pd
# Sample dataframe
df = pd.DataFrame({'code': [12] * 5 + [20] * 5,
'group': ['A', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'B'],
'options': ['x,y', 'x', 'x', 'y', 'y', 'z', 'z', 'x', 'y', 'z']})
print(df)
code group options
0 12 A x,y
1 12 A x
2 12 A x
3 12 B y
4 12 B y
5 20 A z
6 20 A z
7 20 B x
8 20 B y
9 20 B z
我要做的第一件事是生成一個新列,其中包含每個組的所有可能選項。 我無法一步完成,但這是我所做的:
# First generate a new column joining all the options by group in temporary strings
df['group_options'] = df.groupby(['code','group'])['options'].transform(lambda x: ','.join(x))
# Transform these temporary strings into lists containing unique values
df['group_options'] = df['group_options'].map(lambda x: list(set([option for temp_str in x.split(',') for option in temp_str])))
結果:
code group options group_options
0 12 A x,y [x, y]
1 12 A x [x, y]
2 12 A x [x, y]
3 12 B y [y]
4 12 B y [y]
5 20 A z [z]
6 20 A z [z]
7 20 B x [x, z, y]
8 20 B y [x, z, y]
9 20 B z [x, z, y]
現在我想生成兩個新列以供以后使用, group_a_options
和group_b_options
,這些列應該包含每個code
組group_options
中的數據:
code group options group_options group_a_options group_b_options
0 12 A x,y [x, y] [x, y] [y]
1 12 A x [x, y] [x, y] [y]
2 12 A x [x, y] [x, y] [y]
3 12 B y [y] [x, y] [y]
4 12 B y [y] [x, y] [y]
5 20 A z [z] [z] [x, y, z]
6 20 A z [z] [z] [x, y, z]
7 20 B x [x, z, y] [z] [x, y, z]
8 20 B y [x, z, y] [z] [x, y, z]
9 20 B z [x, z, y] [z] [x, y, z]
我一直在嘗試使用groupby
生成這個新列並進行transform
,但沒有成功。 如何將列group
的條件添加到groupby
以獲得所需的輸出? 任何幫助表示贊賞。
首先是通過連接值來創建帶有set
s 的Series
,
然后拆分,最后轉換為list
s:
s = df.groupby(['code','group'])['options'].agg(lambda x: list(set(','.join(x).split(','))))
然后通過Series.unstack
重塑並更改列名稱:
df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)
最后使用DataFrame.join
兩列,然后列code
:
df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
code group options group_options group_a_options group_b_options
0 12 A x,y [y, x] [y, x] [y]
1 12 A x [y, x] [y, x] [y]
2 12 A x [y, x] [y, x] [y]
3 12 B y [y] [y, x] [y]
4 12 B y [y] [y, x] [y]
5 20 A z [z] [z] [y, x, z]
6 20 A z [z] [z] [y, x, z]
7 20 B x [y, x, z] [z] [y, x, z]
8 20 B y [y, x, z] [z] [y, x, z]
9 20 B z [y, x, z] [z] [y, x, z]
如果排序很重要,則通過dict.fromkeys
技巧刪除重復值:
s = (df.groupby(['code','group'])['options']
.agg(lambda x: list(dict.fromkeys(','.join(x).split(',')))))
df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)
df = df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
code group options group_options group_a_options group_b_options
0 12 A x,y [x, y] [x, y] [y]
1 12 A x [x, y] [x, y] [y]
2 12 A x [x, y] [x, y] [y]
3 12 B y [y] [x, y] [y]
4 12 B y [y] [x, y] [y]
5 20 A z [z] [z] [x, y, z]
6 20 A z [z] [z] [x, y, z]
7 20 B x [x, y, z] [z] [x, y, z]
8 20 B y [x, y, z] [z] [x, y, z]
9 20 B z [x, y, z] [z] [x, y, z]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.