将带有一列的 file1 与来自 file2 的两列进行比较

Question

I have two files: file1.txt and file2.txt我有两个文件：file1.txt 和 file2.txt

#file1.txt
xap1
NM_121
abc4
xxx0
uvw


#file2.txt
A123  001  xap1    mmmmm
B123       xxx0    nnnnn
C123  003  yyy1    ppppp
D123  004  zzz1    NM_121
E123  005  abc4    llllll
F123       jjjj    www

I want following output based on matching of column 1 of file1 with column 3 and column 4 of file2, get column 2 from file2 and print both:我想要基于 file1 的第 1 列与 file2 的第 3 列和第 4 列的匹配的以下输出，从 file2 获取第 2 列并打印两者：

#file3.txt
xap1    001
NM_121  004
abc4    005
xxx0    NA
uvw     NA

I used the following command, but don't know how to print column 1 from file1:我使用了以下命令，但不知道如何从 file1 打印第 1 列：

grep -w -F -f file1.txt file2.txt | awk '{print $2) > file3.txt

Thanks.谢谢。

Answer 1

Could you please try following, this could be done in a single awk itself (written and tested with shown samples only). 您能否请尝试以下操作，这可以在单个awk本身中完成（仅使用显示的示例进行编写和测试）。

awk '
FNR==NR{
  a[$3]=$2
  a[$4]=$2
  next
}
{
  printf("%s\n",$1 in a?$1 OFS a[$1]:$1 OFS "NA")
}
' Input_file2  Input_file1

One liner form as follows: 一种班轮形式如下：

awk 'FNR==NR{a[$3]=$2;a[$4]=$2;next}{printf("%s\n",$1 in a?$1 OFS a[$1]:$1 OFS "NA")}' file2.txt file1.txt

Output will be as follows. 输出如下。

xap1 001
NM_121 004
abc4 005
xxx0 NA

OR: For windows systems, try following by making an .awk script(named script.txt ), here is a very nice link which can guide you more how to run awk on windows systems too Convert Text to Table (Space Delimited or Fixed length) 或：对于Windows系统，请尝试制作一个.awk脚本（名为script.txt ），以下是一个非常不错的链接，该链接也可以指导您更多如何在Windows系统上运行awk 。将文本转换为表格（以空格分隔或固定长度））

If you have installed Windows Subsystem for Linux, you can directly execute the awk script as described above on the bash command line. 如果您已经安装了Linux的Windows子系统，则可以按照上面bash命令行中的描述直接执行awk脚本。 If you have installed (or going to install) gawk as an independent application software, following guidance will help: 如果您已将gawk作为独立的应用程序软件安装（或将要安装），则以下指导将有所帮助：
First download Gawk for Windows from an appropriate server such as Sourceforge. 首先从适当的服务器（例如Sourceforge）下载适用于Windows的Gawk。 There are two types of installation: with installer or without installer. 有两种安装类型：有安装程序或无安装程序。 The choice is up to you. 这个选择由你。 Following description is based on the case without installer. 以下说明基于没有安装程序的情况。
Unzip the downloaded file to extract binaries and modules in an arbitrary location. 解压缩下载的文件以在任意位置提取二进制文件和模块。 (Download folder, desktop, or wherever). （下载文件夹，桌面或任何位置）。
Create a working folder with an arbitrary name (such as "myawk") on your desktop or wherever convenient. 在桌面上或方便的地方使用任意名称（例如“ myawk”）创建一个工作文件夹。
Copy the script below to a file with an arbitrary name (such as "script.txt"). 将以下脚本复制到具有任意名称的文件（例如“ script.txt”）。

As awk executable doesn't care about the extension of the script file, you can keep it with ".txt" to associate a text editor or can change to ".awk" for specification. 由于awk可执行文件不关心脚本文件的扩展名，因此可以将其保留为“ .txt”以关联文本编辑器，也可以更改为“ .awk”以进行规范。

FNR==NR{
  a[$3]=$2
  a[$4]=$2
  next
}
{
  printf("%s\n",$1 in a?$1 OFS a[$1]:$1 OFS "NA")
}

Now type following command on your terminal: 现在在终端上键入以下命令：

C:\your\path\to\gawk.exe -f script.txt Input_file2  Input_file1

Answer 2

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
    map[$3] = map[$4] = ($2 == "" ? "NA1" : $2)
    next
}
{ print $1, ($1 in map ? map[$1] : "NA2") }

$ awk -f tst.awk file2 file1
xap1    001
NM_121  004
abc4    005
xxx0    NA1

I used 2 different NA values to distinguish between cases where $1 exists in file 2 but has a blank entry (eg xxx0 ) vs where it just doesn't exist in file2, eg some random string like foobar :我使用了 2 个不同的 NA 值来区分 $1 存在于文件 2 中但有一个空白条目（例如xxx0 ）的情况与它在文件 2 中不存在的情况，例如一些随机字符串，例如foobar ：

$ cat file1
xap1
NM_121
abc4
xxx0
foobar

$ awk -f tst.awk file2 file1
xap1    001
NM_121  004
abc4    005
xxx0    NA1
foobar  NA2

Massage to suit.按摩适合。

Answer 3

The following:下列：

{
join -t$'\t' -12 -23 -o1.1,1.2,2.2 <(nl -w1 file1.txt | sort -t$'\t' -k2) <(sort -t$'\t' -k3 file2.txt)
join -t$'\t' -12 -24 -o1.1,1.2,2.2 <(nl -w1 file1.txt | sort -t$'\t' -k2) <(sort -t$'\t' -k4 file2.txt)
} |
sort -t$'\t' -k1 | cut -f2- |
# insert NA is it's missing value
sed 's/\t$/\tNA/'

With the following recreation of input files:通过以下输入文件的重新创建：

cat <<EOF >file1.txt
xap1
NM_121
abc4
xxx0
EOF

# used tr to recreate a tab separated file
tr ' ' '\t' <<EOF >file2.txt
A123 001 xap1 mmmmm
B123  xxx0 nnnnn
C123 003 yyy1 ppppp
D123 004 zzz1 NM_121
E123 005 abc4 llllll
EOF

Outputs:输出：

xap1    001
NM_121  004
abc4    005
xxx0    NA

Tested on repl .在repl 上测试。

Short explanation of main points:要点的简要说明：

nl -w1 file1.txt | sort -t$'\\t' -k2 nl -w1 file1.txt | sort -t$'\\t' -k2 - number lines in file2.txt and sort with the second field nl -w1 file1.txt | sort -t$'\\t' -k2 - nl -w1 file1.txt | sort -t$'\\t' -k2行进行编号并使用第二个字段进行排序
join - join files. join - 加入文件。 We join file1.txt with file2.txt twice - first on first field and 3rd field and then on first field and 4th field from file1.txt and file2.txt.我们将 file1.txt 与 file2.txt 连接两次 - 首先在第一个字段和第三个字段上，然后在 file1.txt 和 file2.txt 的第一个字段和第四个字段上。 For join inputs have to be sorted on the joined fields.对于join输入必须在连接字段上排序。
sort -t$'\\t' -k1 | cut -f2- sort -t$'\\t' -k1 | cut -f2- - the lines in file1.txt are numbered, so later we can sort them using the line numbers (ie. restore original sorting order of file1.txt) and remove the line numbers sort -t$'\\t' -k1 | cut -f2- - file1.txt中的行被编号，所以稍后我们可以使用行号对其进行排序（即恢复 file1.txt 的原始排序顺序）并删除行号
sed 's/\\t$/\\tNA/' - the field is empty in file2.txt , while OP specified the output to be "NA". sed 's/\\t$/\\tNA/' -该字段为空file2.txt ，而OP指定的输出为“NA”。 If the second column is missing from the output, insert the chracters NA there.如果输出中缺少第二列，请在那里插入字符NA 。
>( ... ) is a process substition >( ... )是一个过程替换
If a value is matched with both 3rd and 4rd field, it will be twice in the output.如果一个值与第 3 个和第 4 个字段都匹配，它将在输出中出现两次。 It could be just piped via sort -k1 -u to remove the duplicates, depending on needs.它可以根据需要通过sort -k1 -u进行管道传输以删除重复项。

将带有一列的 file1 与来自 file2 的两列进行比较

问题描述

2 个解决方案

解决方案1
0 2019-11-28 14:46:00

解决方案2
0 2019-11-28 17:44:02

解决方案3
-1 2019-11-28 15:28:46

将带有一列的 file1 与来自 file2 的两列进行比较

问题描述

2 个解决方案

解决方案1 0 2019-11-28 14:46:00

解决方案2 0 2019-11-28 17:44:02

解决方案3 -1 2019-11-28 15:28:46

解决方案1
0 2019-11-28 14:46:00

解决方案2
0 2019-11-28 17:44:02

解决方案3
-1 2019-11-28 15:28:46