簡體   English   中英

基於列查找兩個數據幀的交集

[英]Finding intersection of two Data Frames based on columns

考慮我有以下兩個數據框:

df1:
    Composite   Beta_value  Chromosome  Start       End     Gene_Symbol
0   cg00000029  0.297449111 chr16       53434200    53434201    RBL2
1   cg00000108  0.660066803 chr3        37417715    37417716    C3orf35
2   cg00000109  0.660066803 chr3        172198247   172198248   FNDC3B
3   cg00000165  0.660066803 chr1        90729117    90729118    C3orf35
4   cg00000236  0.905679244 chr8        42405776    42405777    VDAC3



df2:     
    Composite   Beta_value  Chromosome  Start       End     Gene_Symbol
2   cg00000109  0.660066803 chr3        172198247   172198248   FNDC3B
3   cg00000165  0.660066803 chr1        90729117    90729118    C3orf35
4   cg00000236  0.905679244 chr8        42405776    42405777    VDAC3
46  cg00002116  0.017114732 chr17       81703380    81703381    MRPL12
47  cg00002145  0.780230816 chr2        237340893   237340894   COL6A3
48  cg00002190  0.781140134 chr8        19697522    19697523    CSGALNACT1
49  cg00002224  0.220786047 chr8        143038982   143038983   C8orf31

我想要的是根據“Start”和“Gene_Symbol”列找到這兩個數據框的交集,如果它們的“Start”和“Gene_Symbol”與df2中的行匹配,則只保留df1中的行。 例如,我希望我的結果如下所示:

    Composite   Beta_value  Chromosome  Start       End     Gene_Symbol
2   cg00000109  0.660066803 chr3        172198247   172198248   FNDC3B
3   cg00000165  0.660066803 chr1        90729117    90729118    C3orf35
4   cg00000236  0.905679244 chr8        42405776    42405777    VDAC3

通過交集,我並不是要像使用以下方法一樣合並數據框並最終得到 12 列:

intersection = pd.merge(df1, df2, how='inner', on=['Start','Gene_Symbol'])
s1.dropna(inplace=True)

其中合並了我的兩個數據框中的列,例如:

intersection.columns
Index(['Composite Element REF_x', 'Beta_value_x', 'Chromosome_x', 'Start',
       'End_x', 'Gene_Symbol', 'Gene_Type_x', 'Transcript_ID_x',
       'Position_to_TSS_x', 'CGI_Coordinate_x', 'Feature_Type_x',
       'Composite Element REF_y', 'Beta_value_y', 'Chromosome_y', 'End_y',
       'Gene_Type_y', 'Transcript_ID_y', 'Position_to_TSS_y',
       'CGI_Coordinate_y', 'Feature_Type_y'],
      dtype='object')

確保在使用DataFrame.merge時選擇正確的列,這樣不會合並來自df2所有列:

keys = ['Start', 'Gene_Symbol']
intersection = df1.merge(df2[keys], on=keys)
    Composite  Beta_value Chromosome      Start        End Gene_Symbol
0  cg00000109    0.660067       chr3  172198247  172198248      FNDC3B
1  cg00000165    0.660067       chr1   90729117   90729118     C3orf35
2  cg00000236    0.905679       chr8   42405776   42405777       VDAC3

僅使用 df2 中所需的列。

pd.merge(df1, df2[['Start','Gene_Symbol']], on=['Start','Gene_Symbol'])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM