[英]Joining multiple fields in text files on Unix
How can I do it?我该怎么做?
File1 looks like this: File1看起来像这样:
foo 1 scaf 3
bar 2 scaf 3.3
File2 looks like this: File2看起来像这样:
foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00
What I want to do is to find lines that co-occur in File1 and File2 when fields 1,2, and 3 are the same.我想要做的是在字段1,2 和 3相同时找到在File1和File2中共同出现的行。
Is there a way to do it?有没有办法做到这一点?
Here is the correct answer (in terms of using standard GNU coreutils tools, and not writing custom script in perl/awk you name it).这是正确的答案(就使用标准GNU coreutils工具而言,而不是在perl/awk 中编写自定义脚本)。
$ join -j1 -o1.2,1.3,1.4,1.5,2.5 <(<file1 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1) <(<file2 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1)
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5
OK, how does it work:好的,它是如何工作的:
First of all we will use a great tool join
which can merge two lines.首先,我们将使用一个很好的join
工具,它可以合并两条线。 join
has two requirements: join
有两个要求:
We need to generate keys in input files and for that we use a simple awk
script:我们需要在输入文件中生成密钥,为此我们使用一个简单的awk
脚本:
$ cat file1 foo 1 scaf 3 bar 2 scaf 3.3 $ <file1 awk '{print $1"-"$2"-"$3" "$0}' foo-1-scaf foo 1 scaf 3 bar-2-scaf bar 2 scaf 3.3
You see, we added 1st column with some key like " foo-1-scaf ".你看,我们添加了第一列,其中有一些键,比如“ foo-1-scaf ”。 We do the same with file2 .我们对file2做同样的事情。 BTW.顺便提一句。 <file awk
, is just fancy way of writing awk file
, or cat file | awk
<file awk
,只是写awk file
或cat file | awk
奇特方式cat file | awk
cat file | awk
. cat file | awk
。
We also should sort our files by the key, in our case this is column 1, so we add to the end of the command the | sort -k1,1
我们还应该按关键字对文件进行排序,在我们的例子中,这是第 1 列,因此我们在命令的末尾添加| sort -k1,1
| sort -k1,1
( sort by text from column 1 to column 1) | sort -k1,1
(按从第 1 列到第 1 列的文本排序)
At this point we could just generate files file1.with.key and file2.with.key and join them, but suppose those file are huge, we don't want to copy them over filesystem.此时我们可以只生成文件file1.with.key和file2.with.key并加入它们,但假设这些文件很大,我们不想将它们复制到文件系统上。 Instead we can use something called bash
process substitution to generate output into named pipe (this will avoid any unnecessary intermediate file creation).相反,我们可以使用称为bash
进程替换的东西将输出生成到命名管道中(这将避免任何不必要的中间文件创建)。 For more info please read the provided link.有关更多信息,请阅读提供的链接。
Our target syntax is: join <( some command ) <(some other command)
我们的目标语法是: join <( some command ) <(some other command)
The last thing is to explain fancy join arguments: -j1 -o1.2,1.3,1.4,1.5,2.5
最后一件事是解释花式连接参数: -j1 -o1.2,1.3,1.4,1.5,2.5
-j1
- join by key in 1st column (in both files) -j1
- 在第一列中通过键加入(在两个文件中) -o
- output only those fields 1.2
(1st file field2), 1.3
(1st file column 3), etc. -o
- 仅输出字段1.2
(第一个文件字段1.2
)、 1.3
(第一个文件第 3 列)等。
This way we joined lines, but join
outputs only the necessary columns.这样我们连接了行,但join
只输出必要的列。
The lessons learned from this post should be:从这篇文章中吸取的教训应该是:
Extensive experimentation plus close scrutiny of the manual pages indicates that you cannot directly join multiple columns - and all my working examples of join, funnily enough, use just one joining column.广泛的实验加上对手册页的仔细审查表明您不能直接连接多个列 - 有趣的是,我所有的连接工作示例都只使用一个连接列。
Consequently, any solution will require the columns-to-be-joined to be concatenated into one column, somehow.因此,任何解决方案都需要以某种方式将要连接的列连接成一列。 The standard join command also requires its inputs to be in the correct sorted order - there's a remark in the GNU join (info coreutils join) about it not always requiring sorted data:标准 join 命令还要求其输入按正确的排序顺序排列 - 在 GNU join (info coreutils join) 中有一条关于它并不总是需要排序数据的注释:
However, as a GNU extension, if the input has no unpairable lines the sort order can be any order that considers two fields to be equal if and only if the sort comparison described above considers them to be equal.但是,作为 GNU 扩展,如果输入没有不可配对的行,则排序顺序可以是将两个字段视为相等的任何顺序,当且仅当上述排序比较认为它们相等时。
One possible way to do it with the given files is:对给定文件执行此操作的一种可能方法是:
awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file1 |
sort > sort1
awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file2 |
sort > sort2
join -1 1 -2 1 -o 1.2,1.3,1.4,1.5,2.5 sort1 sort2
This creates a composite sort field at the start, using ':' to separate the sub-fields, and then sorts the file - for each of two files.这会在开始时创建一个复合排序字段,使用 ':' 分隔子字段,然后对文件进行排序 - 对于两个文件中的每一个。 The join command then joins on the two composite fields, but prints out only the non-composite (non-join) fields.然后 join 命令连接两个复合字段,但仅打印出非复合(非联接)字段。
The output is:输出是:
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5
join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2加入 -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2
On MacOS X 10.6.3, this gives:在 MacOS X 10.6.3 上,这给出:
$ cat file1 foo 1 scaf 3 bar 2 scaf 3.3 $ cat file2 foo 1 scaf 4.5 foo 1 boo 2.3 bar 2 scaf 1.00 $ join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2 foo 1 scaf 3 4.5 bar 2 scaf 3.3 4.5 $
This is joining on field 3 (only) - which is not what is wanted.这是加入第 3 场(仅) - 这不是我们想要的。
You do need to ensure that the input files are in the correct sorted order.您确实需要确保输入文件的排序顺序正确。
It's probably easiest to combine the first three fields with awk:将前三个字段与 awk 结合起来可能是最简单的:
awk '{print $1 "_" $2 "_" $3 " " $4}' filename
Then you can use join
normally on "field 1"然后您可以在“字段1”上正常使用join
you can try this你可以试试这个
awk '{
o1=$1;o2=$2;o3=$3
$1=$2=$3="";gsub(" +","")
_[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS $0
}
END{ for(i in _) print i,_[i] }' file1 file2
output输出
$ ./shell.sh
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
foo 1 boo 2.3
If you want to omit uncommon lines如果你想省略不常见的行
awk 'FNR==NR{
s=""
for(i=4;i<=NF;i++){ s=s FS $i }
_[$1$2$3] = s
next
}
{
printf $1 FS $2 FS $3 FS
for(o=4;o<NF;o++){
printf $i" "
}
printf $NF FS _[$1$2$3]"\n"
} ' file2 file1
output输出
$ ./shell.sh
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
How about:怎么样:
cat file1 file2
| awk '{print $1" "$2" "$3}'
| sort
| uniq -c
| grep -v '^ *1 '
| awk '{print $2" "$3" "$4}'
This is assuming you're not too worried about the white space between fields (in other words, three tabs and a space is no different to a space and 7 tabs).这是假设您不太担心字段之间的空白(换句话说,三个制表符和一个空格与一个空格和 7 个制表符没有区别)。 This is usually the case when you're talking about fields within a text file.当您谈论文本文件中的字段时,通常就是这种情况。
What it does is output both files, stripping off the last field (since you don't care about that one in terms of comparisons).它的作用是输出两个文件,去掉最后一个字段(因为在比较方面你不关心那个)。 It the sorts that so that similar lines are adjacent then uniquifies them (replaces each group of adjacent identical lines with one copy and a count).它排序,以便相似的行相邻然后将它们统一(用一个副本和一个计数替换每组相邻的相同行)。
It then gets rid of all those that had a one-count (no duplicates) and prints out each with the count stripped off.然后它去掉所有那些有一次计数(没有重复)的人,并打印出每一个都去掉计数。 That gives you your "keys" to the duplicate lines and you can then use another awk iteration to locate those keys in the files if you wish.这为您提供了重复行的“键”,然后如果您愿意,您可以使用另一个 awk 迭代来在文件中定位这些键。
This won't work as expected if two identical keys are only in one file since the files are combined early on.如果两个相同的密钥只在一个文件中,这将不会按预期工作,因为这些文件很早就被合并了。 In other words, if you have duplicate keys in file1
but not in file2
, that will be a false positive.换句话说,如果您在file1
有重复的键但在file2
没有,那将是误报。
Then, the only real solution I can think of is a solution which checks file2
for each line in file1
although I'm sure others may come up with cleverer solutions.然后,我能想到的唯一真正的解决方案是检查file1
每一行的file2
的解决方案,尽管我相信其他人可能会想出更聪明的解决方案。
And, for those who enjoy a little bit of sado-masochism, here's the afore-mentioned not-overly-efficient solution:而且,对于那些享受一点施虐受虐狂的人来说,这里是上述不太有效的解决方案:
cat file1
| sed
-e 's/ [^ ]*$/ "/'
-e 's/ / */g'
-e 's/^/grep "^/'
-e 's/$/ file2 | awk "{print \\$1\\" \\"\\$2\\" \\"\\$3}"/'
>xx99
bash xx99
rm xx99
This one constructs a separate script file to do the work.这将构造一个单独的脚本文件来完成这项工作。 For each line in file1
, it creates a line in the script to look for that in file2
.对于file1
每一行,它会在脚本中创建一行以在file2
查找该行。 If you want to see how it works, just have a look at xx99
before you delete it.如果您想了解它是如何工作的,请在删除之前查看xx99
。
And, in this one, the spaces do matter so don't be surprised if it doesn't work for lines where spaces are different between file1
and file2
(though, as with most "hideous" scripts, that can be fixed with just another link in the pipeline).而且,在这个中,空格确实很重要,所以如果它对file1
和file2
之间的空格不同的行不起作用,请不要感到惊讶(尽管,与大多数“可怕的”脚本一样,可以通过另一个管道中的链接)。 It's more here as an example of the ghastly things you can create for quick'n'dirty jobs.它更多地是作为您可以为快速不肮脏的工作创造的可怕事物的一个例子。
This is not what I would do for production-quality code but it's fine for a once-off, provided you destroy all evidence of it before The Daily WTF finds out about it :-)这不是我会为生产质量代码做的事情,但一次性很好,前提是你在每日 WTF发现之前销毁它的所有证据:-)
Here is a way to do it in Perl:这是在 Perl 中执行此操作的一种方法:
#!/usr/local/bin/perl
use warnings;
use strict;
open my $file1, "<", "file1" or die $!;
my %file1keys;
while (<$file1>) {
my @keys = split /\s+/, $_;
next unless @keys;
$file1keys{$keys[0]}{$keys[1]}{$keys[2]} = [$., $_];
}
close $file1 or die $!;
open my $file2, "<", "file2" or die $!;
while (<$file2>) {
my @keys = split /\s+/, $_;
next unless @keys;
if (my $found = $file1keys{$keys[0]}{$keys[1]}{$keys[2]}) {
print "Keys occur at file1:$found->[0] and file2:$..\n";
}
}
close $file2 or die $!;
Simple method (no awk , join , sed , or perl ), using software tools cut
, grep
, and sort
:简单的方法(没有awk 、 join 、 sed或perl ),使用软件工具cut
、 grep
和sort
:
cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g
Output (does not print unmatched lines):输出(不打印不匹配的行):
bar 2 scaf 1.00
bar 2 scaf 3.3
foo 1 scaf 3
foo 1 scaf 4.5
How it works...这个怎么运作...
cut
makes a list of all the lines to search for. cut
列出要搜索的所有行。grep
's -f -
switch inputs the lines from cut
and searches File1 and File2 for them. grep
的-f -
开关输入来自cut
的行并为它们搜索File1和File2 。sort
isn't necessary, but makes the data easier to read. sort
不是必需的,但可以使数据更易于阅读。 Condensed results with datamash
:使用datamash
浓缩结果:
cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | \
datamash -t ' ' -s -g1,2,3 collapse 4
Output:输出:
bar 2 scaf 3.3,1.00
foo 1 scaf 3,4.5
If File1 is huge and is somewhat redundant, adding sort -u
should speed things up:如果File1很大并且有点多余,添加sort -u
应该会加快速度:
cut -d ' ' -f1-3 File1 | sort -u | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g
A professor I used to work with created a set of perl scripts that can perform a lot of database-like operations on column-oriented flat text files.我曾经与之共事的一位教授创建了一组 perl 脚本,这些脚本可以对面向列的纯文本文件执行许多类似数据库的操作。 It's called Fsdb .它被称为Fsdb 。 It can definitely do this, and it's especially worth looking into if this isn't just a one-off need (so you're not constantly writing custom scripts).它绝对可以做到这一点,如果这不仅仅是一次性需求(因此您不会经常编写自定义脚本),则特别值得研究。
A similar solution as the one Jonathan Leffler offered.与 Jonathan Leffler 提供的解决方案类似的解决方案。
Create 2 temporary sorted files with a different delimeter which has the matching columns combined in the first field.使用不同的分隔符创建 2 个临时排序文件,其中匹配的列组合在第一个字段中。 Then join the temp files on the first field, and output the second field.然后加入第一个字段上的临时文件,并输出第二个字段。
$ cat file1.txt |awk -F" " '{print $1"-"$2"-"$3";"$0}' |sort >file1.tmp
$ cat file2.txt |awk -F" " '{print $1"-"$2"-"$3";"$0}' |sort >file2.tmp
$ join -t; -o 1.2 file1.tmp file2.tmp >file1.same.txt
$ join -t; -o 2.2 file1.tmp file2.tmp >file2.same.txt
$ rm -f file1.tmp file2.tmp
$ cat file1.same.txt
bar 2 scaf 3.3
foo 1 scaf 3
$ cat file2.same.txt
bar 2 scaf 1.00
foo 1 scaf 4.5
Using datamash
's collapse operation, plus a bit of cosmetic sort
ing and tr
ing:使用datamash
的折叠操作,再加上一些装饰性sort
和tr
ing:
cat File* | datamash -t ' ' -s -g1,2,3 collapse 4 |
sort -g -k2 | tr ',' ' '
Output (common lines have a 5th field, uncommon lines do not):输出(常用行有第 5 个字段,不常用行没有):
foo 1 boo 2.3
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
The OP doesn't show the expected output so idk if this is exactly the desired output but this is the way to approach the problem:如果这正是所需的输出,OP 不会显示预期的输出,但这是解决问题的方法:
$ awk '
{ key=$1 FS $2 FS $3 }
NR==FNR { val[key]=$4; next }
key in val {print $0, val[key] }
' file1 file2
foo 1 scaf 4.5 3
bar 2 scaf 1.00 3.3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.