[英]Finding intersection of two Data Frames based on columns
考慮我有以下兩個數據框:
df1:
Composite Beta_value Chromosome Start End Gene_Symbol
0 cg00000029 0.297449111 chr16 53434200 53434201 RBL2
1 cg00000108 0.660066803 chr3 37417715 37417716 C3orf35
2 cg00000109 0.660066803 chr3 172198247 172198248 FNDC3B
3 cg00000165 0.660066803 chr1 90729117 90729118 C3orf35
4 cg00000236 0.905679244 chr8 42405776 42405777 VDAC3
df2:
Composite Beta_value Chromosome Start End Gene_Symbol
2 cg00000109 0.660066803 chr3 172198247 172198248 FNDC3B
3 cg00000165 0.660066803 chr1 90729117 90729118 C3orf35
4 cg00000236 0.905679244 chr8 42405776 42405777 VDAC3
46 cg00002116 0.017114732 chr17 81703380 81703381 MRPL12
47 cg00002145 0.780230816 chr2 237340893 237340894 COL6A3
48 cg00002190 0.781140134 chr8 19697522 19697523 CSGALNACT1
49 cg00002224 0.220786047 chr8 143038982 143038983 C8orf31
我想要的是根據“Start”和“Gene_Symbol”列找到這兩個數據框的交集,如果它們的“Start”和“Gene_Symbol”與df2中的行匹配,則只保留df1中的行。 例如,我希望我的結果如下所示:
Composite Beta_value Chromosome Start End Gene_Symbol
2 cg00000109 0.660066803 chr3 172198247 172198248 FNDC3B
3 cg00000165 0.660066803 chr1 90729117 90729118 C3orf35
4 cg00000236 0.905679244 chr8 42405776 42405777 VDAC3
通過交集,我並不是要像使用以下方法一樣合並數據框並最終得到 12 列:
intersection = pd.merge(df1, df2, how='inner', on=['Start','Gene_Symbol'])
s1.dropna(inplace=True)
其中合並了我的兩個數據框中的列,例如:
intersection.columns
Index(['Composite Element REF_x', 'Beta_value_x', 'Chromosome_x', 'Start',
'End_x', 'Gene_Symbol', 'Gene_Type_x', 'Transcript_ID_x',
'Position_to_TSS_x', 'CGI_Coordinate_x', 'Feature_Type_x',
'Composite Element REF_y', 'Beta_value_y', 'Chromosome_y', 'End_y',
'Gene_Type_y', 'Transcript_ID_y', 'Position_to_TSS_y',
'CGI_Coordinate_y', 'Feature_Type_y'],
dtype='object')
確保在使用DataFrame.merge
時選擇正確的列,這樣不會合並來自df2
所有列:
keys = ['Start', 'Gene_Symbol']
intersection = df1.merge(df2[keys], on=keys)
Composite Beta_value Chromosome Start End Gene_Symbol
0 cg00000109 0.660067 chr3 172198247 172198248 FNDC3B
1 cg00000165 0.660067 chr1 90729117 90729118 C3orf35
2 cg00000236 0.905679 chr8 42405776 42405777 VDAC3
僅使用 df2 中所需的列。
pd.merge(df1, df2[['Start','Gene_Symbol']], on=['Start','Gene_Symbol'])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.