我如何比较两个在UNIX中具有多个字段的文本文件

Question

i have two text files 我有两个文本文件

file 1 文件1

 number,name,account id,vv,sfee,dac acc,TDID 7000,john,2,0,0,1,6 7001,elen,2,0,0,1,7 7002,sami,2,0,0,1,6 7003,mike,1,0,0,2,1 8001,nike,1,2,4,1,8 8002,paul,2,0,0,2,7

file 2 文件2

 number,account id,dac acc,TDID 7000,2,1,6 7001,2,1,7 7002,2,1,6 7003,1,2,1

i want to compare those two text files. 我想比较这两个文本文件。 if the four columns of file 2 is there in file 1 and equal means i want output like this 如果文件2的四个列在文件1中并且相等，则意味着我要这样输出

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt .. this works good for comparing two single column in two files. nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt ..这对于比较两个文件中的两个单列非常nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt 。 i want to compare multiple column. 我想比较多列。 any one have suggestion? 有人有建议吗？

EDIT: From the OP's comments: 编辑：从OP的评论：

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt

.. this works good for comparing two single column in two files. ..这对于比较两个文件中的两个单列效果很好。 i want to compare multiple column. 我想比较多列。 you have any suggestion? 你有什么建议吗？

Answer 1

This awk one-liner works for multi-column on unsorted files: 这个awk单行代码适用于未排序文件的多列：

awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt

In order for this to work, it is imperative that the first file used for input (file1.txt in my example) be the file that only has 4 fields like so: 为了使它起作用，必须将用于输入的第一个文件（在我的示例中为file1.txt）设为只有4个字段的文件，如下所示：

file1.txt file1.txt

7000,2,1,6
7001,2,1,7
7002,2,1,6
7003,1,2,1

file2.txt file2.txt

7000,john,2,0,0,1,6
7000,john,2,0,0,1,7
7000,john,2,0,0,1,8
7000,john,2,0,0,1,9
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
7003,mike,1,0,0,2,2
7003,mike,1,0,0,2,3
7003,mike,1,0,0,2,4
8001,nike,1,2,4,1,8
8002,paul,2,0,0,2,7

Output 输出量

$ awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

Alternatively, you could also use the following syntax which more closely matches the one in your question but is not very readable IMHO 另外，您也可以使用以下语法，该语法与问题中的语法更匹配，但是恕不具体理解

awk -F, 'NR==FNR{a[$1,$2,$3,$4];next} ($1SUBSEP$3SUBSEP$6SUBSEP$7 in a)' file1.txt file2.txt

Answer 2

TxtSushi looks like what you want. TxtSushi看起来像您想要的。 It allows to work with CSV files using SQL. 它允许使用SQL处理CSV文件。

Answer 3

It's not an elegant one-liner, but you could do it with perl. 这不是一个优雅的单行代码，但是您可以使用perl做到这一点。

#!/usr/bin/perl
open A, $ARGV[0];
while(split/,/,<A>) {
    $k{$_[0]} = [@_];
}
close A;

open B, $ARGV[1];
while(split/,/,<B>) {
    print join(',',@{$k{$_[0]}}) if
        defined($k{$_[0]}) &&
        $k{$_[0]}->[2] == $_[1] &&
        $k{$_[0]}->[5] == $_[2] &&
        $k{$_[0]}->[6] == $_[3];
}
close B;

Answer 4

快速解答：使用cut拆分所需的字段，然后使用diff比较结果。

Answer 5

This is neither efficient nor pretty it will however get the job done. 这既不高效也不漂亮，但是可以完成工作。 It is not the most efficient implementation as it parses file1 multiple times however it does not read the entire file into RAM either so has some benefits over the simple scripting approaches. 这不是最有效的实现，因为它多次解析file1，但是它也不会将整个文件读入RAM，因此与简单的脚本方法相比，具有一些好处。

sed -n '2,$p' file1 | awk -F, '{print $1 "," $3 "," $6 "," $7 " " $0 }' | \
sort | join file2 - |awk '{print $2}'

This works as follows 其工作方式如下

sed -n '2,$p' file1 sends file1 to STDOUT without the header line sed -n '2,$p' file1将文件1发送到STDOUT而没有标题行
The first awk command prints the 4 "key fields" from file1 in the same format as they are in file2 followed by a space followed by the contents of file1 第一个awk命令以与file2中相同的格式从file1打印4个“关键字段”，后跟一个空格，后跟file1的内容
The sort command ensures that file1 is in the same order as file2 sort命令确保file1与file2的顺序相同
The join command joins file2 and STDOUT only writing records that have a matching record in file2 join命令仅将file2和STDOUT连接起来，写入在file2中具有匹配记录的记录
The final awk command prints just the original part of file1 最后的awk命令仅打印file1的原始部分

In order for this to work you must ensure that file2 is sorted before running the command. 为了使它起作用，必须在运行命令之前确保对file2进行排序。

Running this against your example data gave the following result 对您的示例数据运行此操作可获得以下结果

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

EDIT 编辑

I note from your comments you are getting a sorting error. 我从您的评论中注意到您遇到了排序错误。 If this error is occuring when sorting file2 before running the pipeline command then you could split the file, sort each part and then cat them back together again. 如果在运行pipeline命令之前对file2进行排序时发生此错误，则可以拆分文件，对每个部分进行排序，然后将它们重新组合在一起。

Something like this would do that for you 这样的事情会为你做

mv file2 file2.orig
for i in 0 1 2 3 4 5 6 7 8 9
do
  grep "^${i}" file2.orig |sort > file2.$i
done
cat file2.[0-9] >file2
rm file2.[0-9] file2.orig

You may need to modify the variables passed to for if your file is not distributed evenly across the full range of leading digits. 如果您的文件未在整个前导数字范围内平均分配，则可能需要修改传递给的变量。

Answer 6

Not really well tested, but this might work: 尚未经过很好的测试，但这可能会起作用：

join -t, file1 file2 | awk -F, 'BEGIN{OFS=","} {if ($3==$8 && $6==$9 && $7==$10) print $1,$2,$3,$4,$6,$7}'

(Of course, this assumes the input files are sorted). （当然，这假定输入文件已排序）。

Answer 7

The statistical package R handles processing multiple csv tables really easily. 统计数据包R可以非常轻松地处理多个csv表。 See An Intro. 请参阅简介。 to R or R for Beginners . 对于R或R对于初学者。

我如何比较两个在UNIX中具有多个字段的文本文件

问题描述

7 个解决方案

解决方案1
3 已采纳 2010-07-07 07:24:52

file1.txt file1.txt

file2.txt file2.txt

Output 输出量

解决方案2
1 2010-07-06 13:41:39

解决方案3
1 2010-07-06 14:40:54

解决方案4
0 2010-07-06 12:43:27

解决方案5
0 2010-07-06 18:09:27

解决方案6
0 2010-07-06 18:25:10

解决方案7
0 2011-12-21 22:16:21

我如何比较两个在UNIX中具有多个字段的文本文件

问题描述

7 个解决方案

解决方案1 3 已采纳 2010-07-07 07:24:52

file1.txt file1.txt

file2.txt file2.txt

Output 输出量

解决方案2 1 2010-07-06 13:41:39

解决方案3 1 2010-07-06 14:40:54

解决方案4 0 2010-07-06 12:43:27

解决方案5 0 2010-07-06 18:09:27

解决方案6 0 2010-07-06 18:25:10

解决方案7 0 2011-12-21 22:16:21

解决方案1
3 已采纳 2010-07-07 07:24:52

解决方案2
1 2010-07-06 13:41:39

解决方案3
1 2010-07-06 14:40:54

解决方案4
0 2010-07-06 12:43:27

解决方案5
0 2010-07-06 18:09:27

解决方案6
0 2010-07-06 18:25:10

解决方案7
0 2011-12-21 22:16:21