简体   繁体   English

如何对 pandas 中的 dataframe 中的某些列进行分组?

[英]How to groupby certain column in a dataframe in pandas?

I have the following dataframe where I have different genes, drugs ID's and citations.我有以下 dataframe 我有不同的基因、药物 ID 和引用。 I essentially need the same gene to be merged with the same drug but include both citations for that drug if it is to occur.我基本上需要相同的基因与相同的药物合并,但如果要发生该药物,则包括该药物的两个引用。 For example below: pharmacogenomic例如以下:药物基因组学

      Gene                          Drug                     ID     Cite
1  MAD1L1                       Lithium[17]           34718328     [17]
2    OAS1                       Lithium[17]           34718328     [17]
3    OAS1                       Lithium[7]            27401222      [7]

MAD1L1 has lithium and citation 17, but OAS1 has lithium and citation 17 and 7. I would like to concat the table into something similar to below: MAD1L1 有锂和引文 17,但 OAS1 有锂和引文 17 和 7。我想将表格连接成类似于下面的内容:

      Gene                          Drug                     ID     Cite
1  MAD1L1                       Lithium[17]           34718328     [17]
2    OAS1                       Lithium[17][7]        34718328     [17]

OAS1 has lithium,but both citation are next to eachother, and MAD1L1 is unchanged as it does not share the same citation for lithium as OAS1. OAS1 有锂,但两个引用彼此相邻,而 MAD1L1 没有改变,因为它与 OAS1 不共享相同的锂引用。

here is one way to do it这是一种方法

#use cite to group together the citations
df['cite2']=df.groupby('Gene')['Cite'].transform('sum')

#group by gene, and take the first result for each gene
df2=df.groupby('Gene').first()

#split the citation from the Drug name and append the cite2 (created above)
df2['Drug']=df2['Drug'].str.split('[', expand=True)[0] + df2['cite2']

# drop the temporary cite2 columns
df2.drop(columns='cite2', inplace=True)
df2.reset_index()
    Gene    Drug    ID  Cite
0   MAD1L1  Lithium[17]     34718328    [17]
1   OAS1    Lithium[17][7]  34718328    [17]

Remove the citation from "Drug", then groupby.agg , either as 'first' or to join the strings.从 "Drug" 中删除引用,然后从groupby.agg中删除,或者作为 'first' 或者join字符串。 Then add back the citations:然后添加引用:

out = (df
 .assign(Drug=df['Drug'].str.extract(r'(^[^\[\]]+)', expand=False))
 .groupby(['Gene', 'Drug'], as_index=False)
 .agg({'ID': 'first', 'Cite': ''.join})
 .assign(Drug=lambda d: d['Drug']+d['Cite'])
)

Output: Output:

     Gene            Drug        ID     Cite
0  MAD1L1     Lithium[17]  34718328     [17]
1    OAS1  Lithium[17][7]  34718328  [17][7]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM