如何對 pandas 中的 dataframe 中的某些列進行分組？

Question

我有以下 dataframe 我有不同的基因、葯物 ID 和引用。 我基本上需要相同的基因與相同的葯物合並，但如果要發生該葯物，則包括該葯物的兩個引用。 例如以下：葯物基因組學

      Gene                          Drug                     ID     Cite
1  MAD1L1                       Lithium[17]           34718328     [17]
2    OAS1                       Lithium[17]           34718328     [17]
3    OAS1                       Lithium[7]            27401222      [7]

MAD1L1 有鋰和引文 17，但 OAS1 有鋰和引文 17 和 7。我想將表格連接成類似於下面的內容：

      Gene                          Drug                     ID     Cite
1  MAD1L1                       Lithium[17]           34718328     [17]
2    OAS1                       Lithium[17][7]        34718328     [17]

OAS1 有鋰，但兩個引用彼此相鄰，而 MAD1L1 沒有改變，因為它與 OAS1 不共享相同的鋰引用。

Answer 1

這是一種方法

#use cite to group together the citations
df['cite2']=df.groupby('Gene')['Cite'].transform('sum')

#group by gene, and take the first result for each gene
df2=df.groupby('Gene').first()

#split the citation from the Drug name and append the cite2 (created above)
df2['Drug']=df2['Drug'].str.split('[', expand=True)[0] + df2['cite2']

# drop the temporary cite2 columns
df2.drop(columns='cite2', inplace=True)
df2.reset_index()

    Gene    Drug    ID  Cite
0   MAD1L1  Lithium[17]     34718328    [17]
1   OAS1    Lithium[17][7]  34718328    [17]

Answer 2

從 "Drug" 中刪除引用，然后從groupby.agg中刪除，或者作為 'first' 或者join字符串。 然后添加引用：

out = (df
 .assign(Drug=df['Drug'].str.extract(r'(^[^\[\]]+)', expand=False))
 .groupby(['Gene', 'Drug'], as_index=False)
 .agg({'ID': 'first', 'Cite': ''.join})
 .assign(Drug=lambda d: d['Drug']+d['Cite'])
)

Output：

     Gene            Drug        ID     Cite
0  MAD1L1     Lithium[17]  34718328     [17]
1    OAS1  Lithium[17][7]  34718328  [17][7]

如何對 pandas 中的 dataframe 中的某些列進行分組？

問題描述

2 個解決方案

解決方案1
0 2022-08-10 21:49:18

解決方案2
0 2022-08-10 21:49:39

如何對 pandas 中的 dataframe 中的某些列進行分組？

問題描述

2 個解決方案

解決方案1 0 2022-08-10 21:49:18

解決方案2 0 2022-08-10 21:49:39

解決方案1
0 2022-08-10 21:49:18

解決方案2
0 2022-08-10 21:49:39