[英]Compare file1 with one column to two columns from file2
I have two files: file1.txt and file2.txt我有两个文件:file1.txt 和 file2.txt
#file1.txt
xap1
NM_121
abc4
xxx0
uvw
#file2.txt
A123 001 xap1 mmmmm
B123 xxx0 nnnnn
C123 003 yyy1 ppppp
D123 004 zzz1 NM_121
E123 005 abc4 llllll
F123 jjjj www
I want following output based on matching of column 1 of file1 with column 3 and column 4 of file2, get column 2 from file2 and print both:我想要基于 file1 的第 1 列与 file2 的第 3 列和第 4 列的匹配的以下输出,从 file2 获取第 2 列并打印两者:
#file3.txt
xap1 001
NM_121 004
abc4 005
xxx0 NA
uvw NA
I used the following command, but don't know how to print column 1 from file1:我使用了以下命令,但不知道如何从 file1 打印第 1 列:
grep -w -F -f file1.txt file2.txt | awk '{print $2) > file3.txt
Thanks.谢谢。
Could you please try following, this could be done in a single awk
itself (written and tested with shown samples only). 您能否请尝试以下操作,这可以在单个awk
本身中完成(仅使用显示的示例进行编写和测试)。
awk '
FNR==NR{
a[$3]=$2
a[$4]=$2
next
}
{
printf("%s\n",$1 in a?$1 OFS a[$1]:$1 OFS "NA")
}
' Input_file2 Input_file1
One liner form as follows: 一种班轮形式如下:
awk 'FNR==NR{a[$3]=$2;a[$4]=$2;next}{printf("%s\n",$1 in a?$1 OFS a[$1]:$1 OFS "NA")}' file2.txt file1.txt
Output will be as follows. 输出如下。
xap1 001
NM_121 004
abc4 005
xxx0 NA
OR: For windows systems, try following by making an .awk
script(named script.txt
), here is a very nice link which can guide you more how to run awk
on windows systems too Convert Text to Table (Space Delimited or Fixed length) 或:对于Windows系统,请尝试制作一个.awk
脚本(名为script.txt
),以下是一个非常不错的链接,该链接也可以指导您更多如何在Windows系统上运行awk
。 将文本转换为表格(以空格分隔或固定长度) )
If you have installed Windows Subsystem for Linux, you can directly execute the awk script as described above on the bash command line. 如果您已经安装了Linux的Windows子系统,则可以按照上面bash命令行中的描述直接执行awk脚本。 If you have installed (or going to install) gawk as an independent application software, following guidance will help: 如果您已将gawk作为独立的应用程序软件安装(或将要安装),则以下指导将有所帮助:
First download Gawk for Windows from an appropriate server such as Sourceforge. 首先从适当的服务器(例如Sourceforge)下载适用于Windows的Gawk。 There are two types of installation: with installer or without installer. 有两种安装类型:有安装程序或无安装程序。 The choice is up to you. 这个选择由你。 Following description is based on the case without installer. 以下说明基于没有安装程序的情况。
Unzip the downloaded file to extract binaries and modules in an arbitrary location. 解压缩下载的文件以在任意位置提取二进制文件和模块。 (Download folder, desktop, or wherever). (下载文件夹,桌面或任何位置)。
Create a working folder with an arbitrary name (such as "myawk") on your desktop or wherever convenient. 在桌面上或方便的地方使用任意名称(例如“ myawk”)创建一个工作文件夹。
Copy the script below to a file with an arbitrary name (such as "script.txt"). 将以下脚本复制到具有任意名称的文件(例如“ script.txt”)。
As awk executable doesn't care about the extension of the script file, you can keep it with ".txt" to associate a text editor or can change to ".awk" for specification. 由于awk可执行文件不关心脚本文件的扩展名,因此可以将其保留为“ .txt”以关联文本编辑器,也可以更改为“ .awk”以进行规范。
FNR==NR{
a[$3]=$2
a[$4]=$2
next
}
{
printf("%s\n",$1 in a?$1 OFS a[$1]:$1 OFS "NA")
}
Now type following command on your terminal: 现在在终端上键入以下命令:
C:\your\path\to\gawk.exe -f script.txt Input_file2 Input_file1
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
map[$3] = map[$4] = ($2 == "" ? "NA1" : $2)
next
}
{ print $1, ($1 in map ? map[$1] : "NA2") }
$ awk -f tst.awk file2 file1
xap1 001
NM_121 004
abc4 005
xxx0 NA1
I used 2 different NA values to distinguish between cases where $1 exists in file 2 but has a blank entry (eg xxx0
) vs where it just doesn't exist in file2, eg some random string like foobar
:我使用了 2 个不同的 NA 值来区分 $1 存在于文件 2 中但有一个空白条目(例如xxx0
)的情况与它在文件 2 中不存在的情况,例如一些随机字符串,例如foobar
:
$ cat file1
xap1
NM_121
abc4
xxx0
foobar
$ awk -f tst.awk file2 file1
xap1 001
NM_121 004
abc4 005
xxx0 NA1
foobar NA2
Massage to suit.按摩适合。
The following:下列:
{
join -t$'\t' -12 -23 -o1.1,1.2,2.2 <(nl -w1 file1.txt | sort -t$'\t' -k2) <(sort -t$'\t' -k3 file2.txt)
join -t$'\t' -12 -24 -o1.1,1.2,2.2 <(nl -w1 file1.txt | sort -t$'\t' -k2) <(sort -t$'\t' -k4 file2.txt)
} |
sort -t$'\t' -k1 | cut -f2- |
# insert NA is it's missing value
sed 's/\t$/\tNA/'
With the following recreation of input files:通过以下输入文件的重新创建:
cat <<EOF >file1.txt
xap1
NM_121
abc4
xxx0
EOF
# used tr to recreate a tab separated file
tr ' ' '\t' <<EOF >file2.txt
A123 001 xap1 mmmmm
B123 xxx0 nnnnn
C123 003 yyy1 ppppp
D123 004 zzz1 NM_121
E123 005 abc4 llllll
EOF
Outputs:输出:
xap1 001
NM_121 004
abc4 005
xxx0 NA
Short explanation of main points:要点的简要说明:
nl -w1 file1.txt | sort -t$'\\t' -k2
nl -w1 file1.txt | sort -t$'\\t' -k2
- number lines in file2.txt and sort with the second field nl -w1 file1.txt | sort -t$'\\t' -k2
- nl -w1 file1.txt | sort -t$'\\t' -k2
行进行编号并使用第二个字段进行排序join
- join files. join
- 加入文件。 We join file1.txt with file2.txt twice - first on first field and 3rd field and then on first field and 4th field from file1.txt and file2.txt.我们将 file1.txt 与 file2.txt 连接两次 - 首先在第一个字段和第三个字段上,然后在 file1.txt 和 file2.txt 的第一个字段和第四个字段上。 For join
inputs have to be sorted on the joined fields.对于join
输入必须在连接字段上排序。sort -t$'\\t' -k1 | cut -f2-
sort -t$'\\t' -k1 | cut -f2-
- the lines in file1.txt
are numbered, so later we can sort them using the line numbers (ie. restore original sorting order of file1.txt) and remove the line numbers sort -t$'\\t' -k1 | cut -f2-
- file1.txt
中的行被编号,所以稍后我们可以使用行号对其进行排序(即恢复 file1.txt 的原始排序顺序)并删除行号sed 's/\\t$/\\tNA/'
- the field is empty in file2.txt
, while OP specified the output to be "NA". sed 's/\\t$/\\tNA/'
-该字段为空file2.txt
,而OP指定的输出为“NA”。 If the second column is missing from the output, insert the chracters NA
there.如果输出中缺少第二列,请在那里插入字符NA
。>( ... )
is a process substition >( ... )
是一个过程替换sort -k1 -u
to remove the duplicates, depending on needs.它可以根据需要通过sort -k1 -u
进行管道传输以删除重复项。 The sorting and numbering of file1.txt
could be optimized with some tee
like nl | sort | tee >(join - file2.txt) >(join - file2.txt)
file1.txt
的排序和编号可以通过一些像nl | sort | tee >(join - file2.txt) >(join - file2.txt)
tee
进行优化nl | sort | tee >(join - file2.txt) >(join - file2.txt)
nl | sort | tee >(join - file2.txt) >(join - file2.txt)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.