[英]groupby function returns undesired result for pandas dataframe
so I have this dataframe here所以我这里有这个 dataframe
>>> df
uniprot_id protein_group protein_family protein_subfamily
0 Q8TAS1 Other KIS NaN
1 P35916 TK VEGFR NaN
2 Q96SB4 CMGC SRPK NaN
3 Q6P3W7 Other SCY1 NaN
4 Q9UKI8 Other TLK NaN
.. ... ... ... ...
561 Q96S53 TKL LISK TESK
562 Q13163 STE STE7 NaN
563 P45985 STE STE7 NaN
564 Q5VT25 AGC DMPK GEK
565 O00141 AGC SGK NaN
There are some duplicate values in the uniprot_id
column and I want to combine them and make idenitcal values merge but different values seperated by a semicolon because the rows for these duplicate uniprot_id
values are similar but not identical uniprot_id
列中有一些重复的值,我想将它们组合起来并使相同的值合并,但不同的值用分号分隔,因为这些重复的uniprot_id
值的行相似但不相同
after applying the code below I don't get the result I am looking for, and I'm wondering what i'm doing wrong应用下面的代码后,我没有得到我正在寻找的结果,我想知道我做错了什么
df2 = df.groupby(['uniprot_id'])['protein_group','protein_family','protein_subfamily'].apply(lambda x: '; '.join(set(x))).reset_index()
>>> print(df2)
uniprot_id 0
0 A0A0B4J2F2 protein_subfamily; protein_family; protein_group
1 A4QPH2 protein_subfamily; protein_family; protein_group
2 B5MCJ9 protein_subfamily; protein_family; protein_group
3 O00141 protein_subfamily; protein_family; protein_group
4 O00238 protein_subfamily; protein_family; protein_group
.. ... ...
547 Q9Y616 protein_subfamily; protein_family; protein_group
548 Q9Y6E0 protein_subfamily; protein_family; protein_group
549 Q9Y6M4 protein_subfamily; protein_family; protein_group
550 Q9Y6R4 protein_subfamily; protein_family; protein_group
551 Q9Y6S9 protein_subfamily; protein_family; protein_group
I need duplicate rows to combine and to look like this我需要重复的行来组合并看起来像这样
uniprot_id protein_group protein_family protein_subfamily
133 Q9UK32 Other RSK; RSKb RSKp90; RSKb
Use GroupBy.agg
with remove missing values by Series.dropna
:使用
GroupBy.agg
通过Series.dropna
删除缺失值:
df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
.agg(lambda x: '; '.join(set(x.dropna())))
.reset_index())
print (df2)
uniprot_id protein_group protein_family protein_subfamily
0 O00141 AGC SGK
1 P35916 TK VEGFR
2 P45985 STE STE7
3 Q13163 STE STE7
4 Q5VT25 AGC DMPK GEK
5 Q6P3W7 Other SCY1
6 Q8TAS1 Other KIS
7 Q96S53 TKL LISK TESK
8 Q96SB4 CMGC SRPK
9 Q9UKI8 Other TLK
If order is important dont use set
s, because there is order not defined, use dict.fromkeys
trick:如果顺序很重要,请不要使用
set
,因为没有定义顺序,请使用dict.fromkeys
技巧:
df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
.agg(lambda x: '; '.join(dict.fromkeys(x.dropna()).keys()))
.reset_index())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.