[英]awk match three columns from two files and append matching lines to a new file
There are many posts that are similar to this one.有很多帖子与此类似。 Hours into trouble shooting this I'm desperate, as it seems like it should be simple.
解决这个问题的几个小时我很绝望,因为它看起来应该很简单。
I have one file that looks like this:我有一个看起来像这样的文件:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2
tig00000005 2685 4511 XP_012144644.1 NW_003797249.1 LOC105662970 PREDICTED: fibrinogen alpha chain-like isoform X2
tig00000005 28923 29432 XP_012148395.1 NW_003797444.1 LOC100881617 PREDICTED: eukaryotic translation initiation factor 4 gamma 3-like isoform X12
tig00000005 32415 34324 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like
And a second file that looks like this:还有一个看起来像这样的第二个文件:
tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 maker gene 16764 17237 . + . ID=snap_masked-tig00000005-processed-gene-0.3;Name=snap_masked-tig00000005-processed-gene-0.3
tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
tig00000005 maker gene 25472 26900 . + . ID=snap_masked-tig00000005-processed-gene-0.5;Name=snap_masked-tig00000005-processed-gene-0.5
I would like to match the 1, 2, and 3 column in the first file with the 1, 4, and 5 in the second, and if they match, append the second file's data to the first file, like so:我想将第一个文件中的 1、2 和 3 列与第二个文件中的 1、4 和 5 相匹配,如果它们匹配,则将第二个文件的数据附加到第一个文件中,如下所示:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
Some example code that has not worked:一些不起作用的示例代码:
awk 'OFS="\t"; FS="\t"; NR==FNR{a[$1,$2,$3]=$0; next} (($1,$4,$5) in a){print $0,a[$0]}' file 1 file 2
awk 'OFS="\t"; FS="\t"; NR==FNR{a[$1,$2,$3]=($1,$4,$5)} {print $0,a[$0]}' file 1 file 2
First outputs a file with every line from file 1 followed by (not appended) file 2, second code throws errors related to the = function.首先输出文件 1 中的每一行,然后是(未附加)文件 2,第二个代码抛出与 = 函数相关的错误。 I've tried any permutation of this I can imagine.
我已经尝试了我能想象的任何排列。 Thank you for any help you can provide
感谢您提供任何帮助
Like this?像这样?
awk 'NR==FNR{a[$1 $2 $3]=$0; next}; {if($1 $4 $5 in a){print a[$1 $4 $5],$0}}' file1 file2
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
Ti write to a new file just do awk 'NR==FNR{a[$1 $2 $3]=$0; next}; {if($1 $4 $5 in a){print a[$1 $4 $5],$0}}' file1 file2 > file3
Ti 写入新文件只需执行
awk 'NR==FNR{a[$1 $2 $3]=$0; next}; {if($1 $4 $5 in a){print a[$1 $4 $5],$0}}' file1 file2 > file3
awk 'NR==FNR{a[$1 $2 $3]=$0; next}; {if($1 $4 $5 in a){print a[$1 $4 $5],$0}}' file1 file2 > file3
A couple small changes to OP's first awk
script:对 OP 的第一个
awk
脚本进行了一些小改动:
# old:
awk 'OFS="\t"; FS="\t"; NR==FNR{a[$1,$2,$3]=$0; next} (($1,$4,$5) in a){print $0,a[$0]}' file1 file2
# new - add BEGIN block, modify print statement:
awk 'BEGIN {FS=OFS="\t"} NR==FNR{a[$1,$2,$3]=$0; next} (($1,$4,$5) in a){print a[$1,$4,$5],$0}' file1 file2
The modified awk
script generates:修改后的
awk
脚本生成:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.