繁体   English   中英

基于部分匹配合并两个文件

[英]merge two files based on partial matching

我有两个文件

文件A.txt

ID
479432_Sros_4274
330214_NIDE2792
517722_CJLT1_010100003977
257310_BB0482
...

FileB.txt(**只是为了帮助您识别匹配项)

members   category
6085.XP_002168109,**479432_Sros_4274**,4956.XP_002495993.1,457425.SSHG_03214,51511.ENSCSAVP000  P
7159.AAEL006372-PA,**257310_BB0482** J
**517722_CJLT1_010100003977**,701176.VIBRN418_17773,9785.ENSLAFP00000010769,28377.ENSACAP00000014901,4081.Solyc03g120250.2.1,3847.GLYMA18G02240.1 U
500485.XP_002561312.1,1042876.PPS_0730,222929.XP_003071446.1,**330214_NIDE2792**  S
...

预期 output

Output.txt

ID  category
479432_Sros_4274  P
330214_NIDE2792  S
517722_CJLT1_010100003977  U
257310_BB0482  J
...

我已经根据其他问题的答案尝试了 awk 和 R 中的一些代码,但我无法获得所需的 output。

这是一种方法:

$ awk '
NR==FNR {                  # process file1
    if(FNR==1)             # print header, no newline
        printf $1
    a[$1]                  # hash data
    next
}
{                          # process file2
    if(FNR==1)             # print the other half of the header
        print OFS $2
    for(i in a)            # loop all items in hash
        if($1 ~ i)         # check for partial match
            print i,$2     # if found, output
}' file1 file2             # mind the order

Output(按file2顺序,注意output最后一行的部分匹配,留作警告):

ID category
479432_Sros_4274 P
257310_BB0482 J
517722_CJLT1_010100003977 U
330214_NIDE2792 S
ID S

请您尝试以下操作。

awk '
BEGIN{
  print "ID  category"
}
FNR==NR{
  a[$0]
  next
}
{
  for(i in a){
    if(match($0,i)){
      print i,$NF
    }
  }
}
'  Input_filea   Input_fileb

说明:为上述代码添加说明。

awk '                               ##Starting awk program here.
BEGIN{                              ##Starting BEGIN section from here.
  print "ID  category"              ##Printing string ID, category here.
}                                   ##Closing BLOCK for BEGIN section.
FNR==NR{                            ##Checking condition FNR==NR which will be TRUE when 1st Input_file is being read.
  a[$0]                             ##Creating an array named a whose index is $).
  next                              ##next will skip all further statements from here.
}
{
  for(i in a){                      ##Traversing through array a with for loop.
    if(match($0,i)){                ##Checking condition if match is having a proper regex matched then do following.
      print i,$NF                   ##Printing variable i and $NF of current line.
    }
  }
}
'  Input_filea   Input_fileb        ##Mentioning Input_file names here.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM