简体   繁体   English

将带有一列的 file1 与来自 file2 的两列进行比较

[英]Compare file1 with one column to two columns from file2

I have two files: file1.txt and file2.txt我有两个文件:file1.txt 和 file2.txt

#file1.txt
xap1
NM_121
abc4
xxx0
uvw


#file2.txt
A123  001  xap1    mmmmm
B123       xxx0    nnnnn
C123  003  yyy1    ppppp
D123  004  zzz1    NM_121
E123  005  abc4    llllll
F123       jjjj    www

I want following output based on matching of column 1 of file1 with column 3 and column 4 of file2, get column 2 from file2 and print both:我想要基于 file1 的第 1 列与 file2 的第 3 列和第 4 列的匹配的以下输出,从 file2 获取第 2 列并打印两者:

#file3.txt
xap1    001
NM_121  004
abc4    005
xxx0    NA
uvw     NA

I used the following command, but don't know how to print column 1 from file1:我使用了以下命令,但不知道如何从 file1 打印第 1 列:

grep -w -F -f file1.txt file2.txt | awk '{print $2) > file3.txt

Thanks.谢谢。

Could you please try following, this could be done in a single awk itself (written and tested with shown samples only). 您能否请尝试以下操作,这可以在单个awk本身中完成(仅使用显示的示例进行编写和测试)。

awk '
FNR==NR{
  a[$3]=$2
  a[$4]=$2
  next
}
{
  printf("%s\n",$1 in a?$1 OFS a[$1]:$1 OFS "NA")
}
' Input_file2  Input_file1

One liner form as follows: 一种班轮形式如下:

awk 'FNR==NR{a[$3]=$2;a[$4]=$2;next}{printf("%s\n",$1 in a?$1 OFS a[$1]:$1 OFS "NA")}' file2.txt file1.txt

Output will be as follows. 输出如下。

xap1 001
NM_121 004
abc4 005
xxx0 NA


OR: For windows systems, try following by making an .awk script(named script.txt ), here is a very nice link which can guide you more how to run awk on windows systems too Convert Text to Table (Space Delimited or Fixed length) 或:对于Windows系统,请尝试制作一个.awk脚本(名为script.txt ),以下是一个非常不错的链接,该链接也可以指导您更多如何在Windows系统上运行awk将文本转换为表格(以空格分隔或固定长度) )

  • If you have installed Windows Subsystem for Linux, you can directly execute the awk script as described above on the bash command line. 如果您已经安装了Linux的Windows子系统,则可以按照上面bash命令行中的描述直接执行awk脚本。 If you have installed (or going to install) gawk as an independent application software, following guidance will help: 如果您已将gawk作为独立的应用程序软件安装(或将要安装),则以下指导将有所帮助:

  • First download Gawk for Windows from an appropriate server such as Sourceforge. 首先从适当的服务器(例如Sourceforge)下载适用于Windows的Gawk。 There are two types of installation: with installer or without installer. 有两种安装类型:有安装程序或无安装程序。 The choice is up to you. 这个选择由你。 Following description is based on the case without installer. 以下说明基于没有安装程序的情况。

  • Unzip the downloaded file to extract binaries and modules in an arbitrary location. 解压缩下载的文件以在任意位置提取二进制文件和模块。 (Download folder, desktop, or wherever). (下载文件夹,桌面或任何位置)。

  • Create a working folder with an arbitrary name (such as "myawk") on your desktop or wherever convenient. 在桌面上或方便的地方使用任意名称(例如“ myawk”)创建一个工作文件夹。

  • Copy the script below to a file with an arbitrary name (such as "script.txt"). 将以下脚本复制到具有任意名称的文件(例如“ script.txt”)。

As awk executable doesn't care about the extension of the script file, you can keep it with ".txt" to associate a text editor or can change to ".awk" for specification. 由于awk可执行文件不关心脚本文件的扩展名,因此可以将其保留为“ .txt”以关联文本编辑器,也可以更改为“ .awk”以进行规范。

FNR==NR{
  a[$3]=$2
  a[$4]=$2
  next
}
{
  printf("%s\n",$1 in a?$1 OFS a[$1]:$1 OFS "NA")
}

Now type following command on your terminal: 现在在终端上键入以下命令:

C:\your\path\to\gawk.exe -f script.txt Input_file2  Input_file1
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
    map[$3] = map[$4] = ($2 == "" ? "NA1" : $2)
    next
}
{ print $1, ($1 in map ? map[$1] : "NA2") }

$ awk -f tst.awk file2 file1
xap1    001
NM_121  004
abc4    005
xxx0    NA1

I used 2 different NA values to distinguish between cases where $1 exists in file 2 but has a blank entry (eg xxx0 ) vs where it just doesn't exist in file2, eg some random string like foobar :我使用了 2 个不同的 NA 值来区分 $1 存在于文件 2 中但有一个空白条目(例如xxx0 )的情况与它在文件 2 中不存在的情况,例如一些随机字符串,例如foobar

$ cat file1
xap1
NM_121
abc4
xxx0
foobar

$ awk -f tst.awk file2 file1
xap1    001
NM_121  004
abc4    005
xxx0    NA1
foobar  NA2

Massage to suit.按摩适合。

The following:下列:

{
join -t$'\t' -12 -23 -o1.1,1.2,2.2 <(nl -w1 file1.txt | sort -t$'\t' -k2) <(sort -t$'\t' -k3 file2.txt)
join -t$'\t' -12 -24 -o1.1,1.2,2.2 <(nl -w1 file1.txt | sort -t$'\t' -k2) <(sort -t$'\t' -k4 file2.txt)
} |
sort -t$'\t' -k1 | cut -f2- |
# insert NA is it's missing value
sed 's/\t$/\tNA/'

With the following recreation of input files:通过以下输入文件的重新创建:

cat <<EOF >file1.txt
xap1
NM_121
abc4
xxx0
EOF

# used tr to recreate a tab separated file
tr ' ' '\t' <<EOF >file2.txt
A123 001 xap1 mmmmm
B123  xxx0 nnnnn
C123 003 yyy1 ppppp
D123 004 zzz1 NM_121
E123 005 abc4 llllll
EOF

Outputs:输出:

xap1    001
NM_121  004
abc4    005
xxx0    NA

Tested on repl .repl 上测试。

Short explanation of main points:要点的简要说明:

  • nl -w1 file1.txt | sort -t$'\\t' -k2 nl -w1 file1.txt | sort -t$'\\t' -k2 - number lines in file2.txt and sort with the second field nl -w1 file1.txt | sort -t$'\\t' -k2 - nl -w1 file1.txt | sort -t$'\\t' -k2行进行编号并使用第二个字段进行排序
  • join - join files. join - 加入文件。 We join file1.txt with file2.txt twice - first on first field and 3rd field and then on first field and 4th field from file1.txt and file2.txt.我们将 file1.txt 与 file2.txt 连接两次 - 首先在第一个字段和第三个字段上,然后在 file1.txt 和 file2.txt 的第一个字段和第四个字段上。 For join inputs have to be sorted on the joined fields.对于join输入必须在连接字段上排序。
  • sort -t$'\\t' -k1 | cut -f2- sort -t$'\\t' -k1 | cut -f2- - the lines in file1.txt are numbered, so later we can sort them using the line numbers (ie. restore original sorting order of file1.txt) and remove the line numbers sort -t$'\\t' -k1 | cut -f2- - file1.txt中的行被编号,所以稍后我们可以使用行号对其进行排序(即恢复 file1.txt 的原始排序顺序)并删除行号
  • sed 's/\\t$/\\tNA/' - the field is empty in file2.txt , while OP specified the output to be "NA". sed 's/\\t$/\\tNA/' -该字段为空file2.txt ,而OP指定的输出为“NA”。 If the second column is missing from the output, insert the chracters NA there.如果输出中缺少第二列,请在那里插入字符NA
  • >( ... ) is a process substition >( ... )是一个过程替换
  • If a value is matched with both 3rd and 4rd field, it will be twice in the output.如果一个值与第 3 个和第 4 个字段都匹配,它将在输出中出现两次。 It could be just piped via sort -k1 -u to remove the duplicates, depending on needs.它可以根据需要通过sort -k1 -u进行管道传输以删除重复项。

The sorting and numbering of file1.txt could be optimized with some tee like nl | sort | tee >(join - file2.txt) >(join - file2.txt) file1.txt的排序和编号可以通过一些像nl | sort | tee >(join - file2.txt) >(join - file2.txt) tee进行优化nl | sort | tee >(join - file2.txt) >(join - file2.txt) nl | sort | tee >(join - file2.txt) >(join - file2.txt)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 合并两个文件,将 file1 中的一个字段替换为 file2 中的另一个字段 - Combine two files, replacing one field from file1 with another from file2 awk比较两个文件中的列,如果在file1中看不到文件2列,则打印该列 - awk compare columns from two files and print the file 2 column if it is not seen in file1 如果file1的B列= file2的B列,则将A列file1替换为file2的A列 - If column B of file1 = column B of file2, replace column A file1 with column A of file2 我想从 file1 中的第 1 列和第 2 列中找到一些与 file2 中的第 1 列匹配的字符串/单词,并替换为 file2 中的第 2 列字符串/单词 - I want to find some strings/words from column 1 and 2 in file1 that match column 1 in file2 and replace with column 2 strings/words in file2 awk 比较两个文件中的列,如果在 file2 中没有看到,则打印文件 1 列 [与此相关的类似帖子被错误地询问] - awk compare columns from two files and print the file 1 column if it is not seen in file2 [Similar post related to this was wrongly asked] 如何基于文件/ file1(仅)第一列与linux中的file2的匹配信息从file1提取行? - how to extract rows from file1 based on matching information of its/file1 (only)first column with file2 in linux? 如何使用 grep 或 unix 命令将 file1 的列与 file2 的列、select 匹配值以及 output 与新文件进行比较 - How to compare the columns of file1 to the columns of file2, select matching values, and output to new file using grep or unix commands 当 ID 与 file2 匹配时,从 file1 复制一列,并根据文件 2 打印 output - copy a column from file1 when the ID's matches to file2 and print output according to file 2 Linux 中两个文件的重叠或将 file1 中的行分别替换为 file2 中的行 - Overlap of two files in Linux or respective replacement of lines from file1 with lines from file2 如何根据与file2的列匹配删除file1中的行 - How to delete lines in file1 based on column match with file2
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM