简体   繁体   English

groupby function 返回不想要的结果 pandas dataframe

[英]groupby function returns undesired result for pandas dataframe

so I have this dataframe here所以我这里有这个 dataframe

>>> df
    uniprot_id protein_group protein_family protein_subfamily
0       Q8TAS1         Other            KIS               NaN
1       P35916            TK          VEGFR               NaN
2       Q96SB4          CMGC           SRPK               NaN
3       Q6P3W7         Other           SCY1               NaN
4       Q9UKI8         Other            TLK               NaN
..         ...           ...            ...               ...
561     Q96S53           TKL           LISK              TESK
562     Q13163           STE           STE7               NaN
563     P45985           STE           STE7               NaN
564     Q5VT25           AGC           DMPK               GEK
565     O00141           AGC            SGK               NaN

There are some duplicate values in the uniprot_id column and I want to combine them and make idenitcal values merge but different values seperated by a semicolon because the rows for these duplicate uniprot_id values are similar but not identical uniprot_id列中有一些重复的值,我想将它们组合起来并使相同的值合并,但不同的值用分号分隔,因为这些重复的uniprot_id值的行相似但不相同

after applying the code below I don't get the result I am looking for, and I'm wondering what i'm doing wrong应用下面的代码后,我没有得到我正在寻找的结果,我想知道我做错了什么

df2 = df.groupby(['uniprot_id'])['protein_group','protein_family','protein_subfamily'].apply(lambda x: '; '.join(set(x))).reset_index()
>>> print(df2)
     uniprot_id                                                 0
0    A0A0B4J2F2  protein_subfamily; protein_family; protein_group
1        A4QPH2  protein_subfamily; protein_family; protein_group
2        B5MCJ9  protein_subfamily; protein_family; protein_group
3        O00141  protein_subfamily; protein_family; protein_group
4        O00238  protein_subfamily; protein_family; protein_group
..          ...                                               ...
547      Q9Y616  protein_subfamily; protein_family; protein_group
548      Q9Y6E0  protein_subfamily; protein_family; protein_group
549      Q9Y6M4  protein_subfamily; protein_family; protein_group
550      Q9Y6R4  protein_subfamily; protein_family; protein_group
551      Q9Y6S9  protein_subfamily; protein_family; protein_group

I need duplicate rows to combine and to look like this我需要重复的行来组合并看起来像这样

    uniprot_id protein_group protein_family protein_subfamily
133       Q9UK32         Other            RSK; RSKb               RSKp90; RSKb

Use GroupBy.agg with remove missing values by Series.dropna :使用GroupBy.agg通过Series.dropna删除缺失值:

df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
         .agg(lambda x: '; '.join(set(x.dropna())))
         .reset_index())

print (df2)
  uniprot_id protein_group protein_family protein_subfamily
0     O00141           AGC            SGK                  
1     P35916            TK          VEGFR                  
2     P45985           STE           STE7                  
3     Q13163           STE           STE7                  
4     Q5VT25           AGC           DMPK               GEK
5     Q6P3W7         Other           SCY1                  
6     Q8TAS1         Other            KIS                  
7     Q96S53           TKL           LISK              TESK
8     Q96SB4          CMGC           SRPK                  
9     Q9UKI8         Other            TLK      

If order is important dont use set s, because there is order not defined, use dict.fromkeys trick:如果顺序很重要,请不要使用set ,因为没有定义顺序,请使用dict.fromkeys技巧:

df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
         .agg(lambda x: '; '.join(dict.fromkeys(x.dropna()).keys()))
         .reset_index())            

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM