![](/img/trans.png)
[英]How to merge(join) two rows in pandas with different values in each column?
[英]Pandas merge rows with different operations for each column
import pandas as pd
df = pd.DataFrame({'case_id':['1', '1', '1','2','2','2'],
'Gene':['KRAS','SMAD4','TP53','TP000','SMAD000','TP000'],
'ch_a':[0,1,0,0,0,0], 'ch_b':[0,0,0,1,1,0], 'ch_c':[0,0,0,1,1,0]})
case_id Gene ch_a ch_b ch_c
0 1 KRAS 0 0 0
1 1 SMAD4 1 0 0
2 1 TP53 0 0 0
3 2 TP000 0 1 1
4 2 SMAD000 0 1 1
5 2 TP000 0 0 0
1)按case_id,Gene排序
2)應用lambda將唯一的排序字符串連接到組
3)將max應用於組上的連接二進制變量(按列掩碼定義)
4)合並兩個結果
binary_cols = df.columns[df.columns.str.contains('^ch_')]
df_case_gene = df.groupby('case_id')['Gene'].agg(lambda x: '->'.join(x.sort_values().unique())).reset_index()
df_case_binary_cols = df.groupby('case_id')[binary_cols].agg('max').reset_index()
df_final = df_case_gene.merge(df_case_binary_cols)
df_final:
case_id Gene ch_a ch_b ch_c
0 1 KRAS->SMAD4->TP53 1 0 0
1 2 SMAD000->TP000 0 1 1
根據我對輸入數據的理解,我准備了一個示例數據框。 然后您可以看到完成了創建新數據框的聚合。
orig_df = pd.DataFrame({'case_id':[1,2,3,2,1],'Gene':['KRAS','SMAD4','TP53','SMAD4','OTHER'],'col_X':[1,0,0,1,0], 'col_X2':[0,0,0,0,1})
case_id Gene col_X col_X2
0 1 KRAS 1 0
1 2 SMAD4 0 0
2 3 TP53 0 0
3 2 SMAD4 1 0
4 1 BLAH 0 1
new_df = pd.DataFrame()
#lambda function identifies unique values of Gene and sorts them
new_df['Strings'] = orig_df.groupby('case_id')['Gene'].apply(lambda x: sorted(x.unique())).transform(lambda x: '->'.join(x))
#here, max function is used to take 1 during aggregation if 0's and 1's are present
cols_to_agg = [col for col in orig_df if col.startswith('col_')]
new_df[cols_to_agg] = orig_df.groupby('case_id')[cols_to_agg].agg(max)
Strings col_X col_X2
case_id
1 BLAH->KRAS 1 1
2 SMAD4 1 0
3 TP53 0 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.