I have the following dataframe where I have different genes, drugs ID's and citations. I essentially need the same gene to be merged with the same drug but include both citations for that drug if it is to occur. For example below: pharmacogenomic
Gene Drug ID Cite
1 MAD1L1 Lithium[17] 34718328 [17]
2 OAS1 Lithium[17] 34718328 [17]
3 OAS1 Lithium[7] 27401222 [7]
MAD1L1 has lithium and citation 17, but OAS1 has lithium and citation 17 and 7. I would like to concat the table into something similar to below:
Gene Drug ID Cite
1 MAD1L1 Lithium[17] 34718328 [17]
2 OAS1 Lithium[17][7] 34718328 [17]
OAS1 has lithium,but both citation are next to eachother, and MAD1L1 is unchanged as it does not share the same citation for lithium as OAS1.
here is one way to do it
#use cite to group together the citations
df['cite2']=df.groupby('Gene')['Cite'].transform('sum')
#group by gene, and take the first result for each gene
df2=df.groupby('Gene').first()
#split the citation from the Drug name and append the cite2 (created above)
df2['Drug']=df2['Drug'].str.split('[', expand=True)[0] + df2['cite2']
# drop the temporary cite2 columns
df2.drop(columns='cite2', inplace=True)
df2.reset_index()
Gene Drug ID Cite
0 MAD1L1 Lithium[17] 34718328 [17]
1 OAS1 Lithium[17][7] 34718328 [17]
Remove the citation from "Drug", then groupby.agg
, either as 'first' or to join
the strings. Then add back the citations:
out = (df
.assign(Drug=df['Drug'].str.extract(r'(^[^\[\]]+)', expand=False))
.groupby(['Gene', 'Drug'], as_index=False)
.agg({'ID': 'first', 'Cite': ''.join})
.assign(Drug=lambda d: d['Drug']+d['Cite'])
)
Output:
Gene Drug ID Cite
0 MAD1L1 Lithium[17] 34718328 [17]
1 OAS1 Lithium[17][7] 34718328 [17][7]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.