简体   繁体   English

在 Unix 上连接文本文件中的多个字段

[英]Joining multiple fields in text files on Unix

How can I do it?我该怎么做?

File1 looks like this: File1看起来像这样:

foo 1 scaf 3 
bar 2 scaf 3.3

File2 looks like this: File2看起来像这样:

foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00

What I want to do is to find lines that co-occur in File1 and File2 when fields 1,2, and 3 are the same.我想要做的是在字段1,2 和 3相同时找到在File1File2中共同出现的行。

Is there a way to do it?有没有办法做到这一点?

Here is the correct answer (in terms of using standard GNU coreutils tools, and not writing custom script in perl/awk you name it).这是正确的答案(就使用标准GNU coreutils工具而言,而不是在perl/awk 中编写自定义脚本)。

$ join -j1 -o1.2,1.3,1.4,1.5,2.5 <(<file1 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1) <(<file2 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1)
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5

OK, how does it work:好的,它是如何工作的:

  1. First of all we will use a great tool join which can merge two lines.首先,我们将使用一个很好的join工具,它可以合并两条线。 join has two requirements: join有两个要求:

    • We can join only by a single field.我们只能通过一个字段加入
    • Both files must be sorted by key column!两个文件都必须关键列排序
  2. We need to generate keys in input files and for that we use a simple awk script:我们需要在输入文件中生成密钥,为此我们使用一个简单的awk脚本:

     $ cat file1 foo 1 scaf 3 bar 2 scaf 3.3 $ <file1 awk '{print $1"-"$2"-"$3" "$0}' foo-1-scaf foo 1 scaf 3 bar-2-scaf bar 2 scaf 3.3

    You see, we added 1st column with some key like " foo-1-scaf ".你看,我们添加了第一列,其中有一些键,比如“ foo-1-scaf ”。 We do the same with file2 .我们对file2做同样的事情。 BTW.顺便提一句。 <file awk , is just fancy way of writing awk file , or cat file | awk <file awk ,只是写awk filecat file | awk奇特方式cat file | awk cat file | awk . cat file | awk

    We also should sort our files by the key, in our case this is column 1, so we add to the end of the command the | sort -k1,1我们还应该按关键字对文件进行排序,在我们的例子中,这是第 1 列,因此我们在命令的末尾添加| sort -k1,1 | sort -k1,1 ( sort by text from column 1 to column 1) | sort -k1,1 (按从第 1 列到第 1 列的文本排序

  3. At this point we could just generate files file1.with.key and file2.with.key and join them, but suppose those file are huge, we don't want to copy them over filesystem.此时我们可以只生成文件file1.with.keyfile2.with.key并加入它们,但假设这些文件很大,我们不想将它们复制到文件系统上。 Instead we can use something called bash process substitution to generate output into named pipe (this will avoid any unnecessary intermediate file creation).相反,我们可以使用称为bash进程替换的东西将输出生成到命名管道中(这将避免任何不必要的中间文件创建)。 For more info please read the provided link.有关更多信息,请阅读提供的链接。

    Our target syntax is: join <( some command ) <(some other command)我们的目标语法是: join <( some command ) <(some other command)

  4. The last thing is to explain fancy join arguments: -j1 -o1.2,1.3,1.4,1.5,2.5最后一件事是解释花式连接参数: -j1 -o1.2,1.3,1.4,1.5,2.5

    • -j1 - join by key in 1st column (in both files) -j1 - 在第一列中通过键加入(在两个文件中)
    • -o - output only those fields 1.2 (1st file field2), 1.3 (1st file column 3), etc. -o - 仅输出字段1.2 (第一个文件字段1.2 )、 1.3 (第一个文件第 3 列)等。

      This way we joined lines, but join outputs only the necessary columns.这样我们连接了行,但join只输出必要的列。

The lessons learned from this post should be:从这篇文章中吸取的教训应该是:

  • you should master the coreutils package, those tools are very powerful when combined and you almost never need to write custom program to deal with such cases,你应该掌握coreutils包,这些工具组合起来非常强大,你几乎不需要编写自定义程序来处理这种情况,
  • core utils tools are also blazing fast and heavily tested, so they are always best choice. core utils 工具也非常快速且经过大量测试,因此它们始终是最佳选择。

The join command is hard to use and only joins on one column join 命令很难使用,只能连接一列

Extensive experimentation plus close scrutiny of the manual pages indicates that you cannot directly join multiple columns - and all my working examples of join, funnily enough, use just one joining column.广泛的实验加上对手册页的仔细审查表明您不能直接连接多个列 - 有趣的是,我所有的连接工作示例都只使用一个连接列。

Consequently, any solution will require the columns-to-be-joined to be concatenated into one column, somehow.因此,任何解决方案都需要以某种方式将要连接的列连接成一列。 The standard join command also requires its inputs to be in the correct sorted order - there's a remark in the GNU join (info coreutils join) about it not always requiring sorted data:标准 join 命令还要求其输入按正确的排序顺序排列 - 在 GNU join (info coreutils join) 中有一条关于它并不总是需要排序数据的注释:

However, as a GNU extension, if the input has no unpairable lines the sort order can be any order that considers two fields to be equal if and only if the sort comparison described above considers them to be equal.但是,作为 GNU 扩展,如果输入没有不可配对的行,则排序顺序可以是将两个字段视为相等的任何顺序,当且仅当上述排序比较认为它们相等时。

One possible way to do it with the given files is:对给定文件执行此操作的一种可能方法是:

awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file1 |
sort > sort1
awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file2 |
sort > sort2
join -1 1 -2 1 -o 1.2,1.3,1.4,1.5,2.5 sort1 sort2

This creates a composite sort field at the start, using ':' to separate the sub-fields, and then sorts the file - for each of two files.这会在开始时创建一个复合排序字段,使用 ':' 分隔子字段,然后对文件进行排序 - 对于两个文件中的每一个。 The join command then joins on the two composite fields, but prints out only the non-composite (non-join) fields.然后 join 命令连接两个复合字段,但仅打印出非复合(非联接)字段。

The output is:输出是:

bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5

Failed attempts to make join do what it won't do失败的加入尝试做它不会做的事情

join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2加入 -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2

On MacOS X 10.6.3, this gives:在 MacOS X 10.6.3 上,这给出:

 $ cat file1 foo 1 scaf 3 bar 2 scaf 3.3 $ cat file2 foo 1 scaf 4.5 foo 1 boo 2.3 bar 2 scaf 1.00 $ join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2 foo 1 scaf 3 4.5 bar 2 scaf 3.3 4.5 $

This is joining on field 3 (only) - which is not what is wanted.这是加入第 3 场(仅) - 这不是我们想要的。

You do need to ensure that the input files are in the correct sorted order.您确实需要确保输入文件的排序顺序正确。

It's probably easiest to combine the first three fields with awk:将前三个字段与 awk 结合起来可能是最简单的:

awk '{print $1 "_" $2 "_" $3 " " $4}' filename

Then you can use join normally on "field 1"然后您可以在“字段1”上正常使用join

you can try this你可以试试这个

awk '{
 o1=$1;o2=$2;o3=$3
 $1=$2=$3="";gsub(" +","")
 _[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS $0
}
END{ for(i in _) print i,_[i] }' file1 file2

output输出

$ ./shell.sh
foo 1 scaf  3 4.5
bar 2 scaf  3.3 1.00
foo 1 boo  2.3

If you want to omit uncommon lines如果你想省略不常见的行

awk 'FNR==NR{
 s=""
 for(i=4;i<=NF;i++){ s=s FS $i }
 _[$1$2$3] = s
 next
}
{
  printf $1 FS $2 FS $3 FS
  for(o=4;o<NF;o++){
   printf $i" "
  }
  printf $NF FS _[$1$2$3]"\n"
 } ' file2 file1

output输出

$ ./shell.sh
foo 1 scaf 3  4.5
bar 2 scaf 3.3  1.00

How about:怎么样:

cat file1 file2
    | awk '{print $1" "$2" "$3}'
    | sort
    | uniq -c
    | grep -v '^ *1 '
    | awk '{print $2" "$3" "$4}'

This is assuming you're not too worried about the white space between fields (in other words, three tabs and a space is no different to a space and 7 tabs).这是假设您不太担心字段之间的空白(换句话说,三个制表符和一个空格与一个空格和 7 个制表符没有区别)。 This is usually the case when you're talking about fields within a text file.当您谈论文本文件中的字段时,通常就是这种情况。

What it does is output both files, stripping off the last field (since you don't care about that one in terms of comparisons).它的作用是输出两个文件,去掉最后一个字段(因为在比较方面你不关心那个)。 It the sorts that so that similar lines are adjacent then uniquifies them (replaces each group of adjacent identical lines with one copy and a count).它排序,以便相似的行相邻然后将它们统一(用一个副本和一个计数替换每组相邻的相同行)。

It then gets rid of all those that had a one-count (no duplicates) and prints out each with the count stripped off.然后它去掉所有那些有一次计数(没有重复)的人,并打印出每一个都去掉计数。 That gives you your "keys" to the duplicate lines and you can then use another awk iteration to locate those keys in the files if you wish.这为您提供了重复行的“键”,然后如果您愿意,您可以使用另一个 awk 迭代来在文件中定位这些键。

This won't work as expected if two identical keys are only in one file since the files are combined early on.如果两个相同的密钥只在一个文件中,这将不会按预期工作,因为这些文件很早就被合并了。 In other words, if you have duplicate keys in file1 but not in file2 , that will be a false positive.换句话说,如果您在file1有重复的键但在file2没有,那将是误报。

Then, the only real solution I can think of is a solution which checks file2 for each line in file1 although I'm sure others may come up with cleverer solutions.然后,我能想到的唯一真正的解决方案是检查file1每一行的file2的解决方案,尽管我相信其他人可能会想出更聪明的解决方案。


And, for those who enjoy a little bit of sado-masochism, here's the afore-mentioned not-overly-efficient solution:而且,对于那些享受一点施虐受虐狂的人来说,这里是上述不太有效的解决方案:

cat file1
    | sed
        -e 's/ [^ ]*$/ "/'
        -e 's/ /  */g'
        -e 's/^/grep "^/'
        -e 's/$/ file2 | awk "{print \\$1\\" \\"\\$2\\" \\"\\$3}"/'
    >xx99
bash xx99
rm xx99

This one constructs a separate script file to do the work.这将构造一个单独的脚本文件来完成这项工作。 For each line in file1 , it creates a line in the script to look for that in file2 .对于file1每一行,它会在脚本中创建一行以在file2查找该行。 If you want to see how it works, just have a look at xx99 before you delete it.如果您想了解它是如何工作的,请在删除之前查看xx99

And, in this one, the spaces do matter so don't be surprised if it doesn't work for lines where spaces are different between file1 and file2 (though, as with most "hideous" scripts, that can be fixed with just another link in the pipeline).而且,在这个中,空格确实很重要,所以如果它对file1file2之间的空格不同的行不起作用,请不要感到惊讶(尽管,与大多数“可怕的”脚本一样,可以通过另一个管道中的链接)。 It's more here as an example of the ghastly things you can create for quick'n'dirty jobs.它更多地是作为您可以为快速不肮脏的工作创造的可怕事物的一个例子。

This is not what I would do for production-quality code but it's fine for a once-off, provided you destroy all evidence of it before The Daily WTF finds out about it :-)不是我会为生产质量代码做的事情,但一次性很好,前提是你在每日 WTF发现之前销毁它的所有证据:-)

Here is a way to do it in Perl:这是在 Perl 中执行此操作的一种方法:

#!/usr/local/bin/perl
use warnings;
use strict;
open my $file1, "<", "file1" or die $!;
my %file1keys;
while (<$file1>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    $file1keys{$keys[0]}{$keys[1]}{$keys[2]} = [$., $_];
}
close $file1 or die $!;
open my $file2, "<", "file2" or die $!;
while (<$file2>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    if (my $found = $file1keys{$keys[0]}{$keys[1]}{$keys[2]}) {
        print "Keys occur at file1:$found->[0] and file2:$..\n";
    }
}
close $file2 or die $!;

Simple method (no awk , join , sed , or perl ), using software tools cut , grep , and sort :简单的方法(没有awkjoinsedperl ),使用软件工具cutgrepsort

cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g

Output (does not print unmatched lines):输出(不打印不匹配的行):

bar 2 scaf 1.00
bar 2 scaf 3.3
foo 1 scaf 3 
foo 1 scaf 4.5

How it works...这个怎么运作...

  1. cut makes a list of all the lines to search for. cut列出要搜索的所有行。
  2. grep 's -f - switch inputs the lines from cut and searches File1 and File2 for them. grep-f -开关输入来自cut的行并为它们搜索File1File2
  3. sort isn't necessary, but makes the data easier to read. sort不是必需的,但可以使数据更易于阅读。

Condensed results with datamash :使用datamash浓缩结果:

cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | \
datamash -t ' ' -s -g1,2,3 collapse 4

Output:输出:

bar 2 scaf 3.3,1.00
foo 1 scaf 3,4.5

If File1 is huge and is somewhat redundant, adding sort -u should speed things up:如果File1很大并且有点多余,添加sort -u应该会加快速度:

cut -d ' ' -f1-3 File1 | sort -u | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g

A professor I used to work with created a set of perl scripts that can perform a lot of database-like operations on column-oriented flat text files.我曾经与之共事的一位教授创建了一组 perl 脚本,这些脚本可以对面向列的纯文本文件执行许多类似数据库的操作。 It's called Fsdb .它被称为Fsdb It can definitely do this, and it's especially worth looking into if this isn't just a one-off need (so you're not constantly writing custom scripts).它绝对可以做到这一点,如果这不仅仅是一次性需求(因此您不会经常编写自定义脚本),则特别值得研究。

A similar solution as the one Jonathan Leffler offered.与 Jonathan Leffler 提供的解决方案类似的解决方案。

Create 2 temporary sorted files with a different delimeter which has the matching columns combined in the first field.使用不同的分隔符创建 2 个临时排序文件,其中匹配的列组合在第一个字段中。 Then join the temp files on the first field, and output the second field.然后加入第一个字段上的临时文件,并输出第二个字段。

$ cat file1.txt |awk -F" " '{print $1"-"$2"-"$3";"$0}' |sort >file1.tmp
$ cat file2.txt |awk -F" " '{print $1"-"$2"-"$3";"$0}' |sort >file2.tmp

$ join -t; -o 1.2 file1.tmp file2.tmp >file1.same.txt
$ join -t; -o 2.2 file1.tmp file2.tmp >file2.same.txt
$ rm -f file1.tmp file2.tmp

$ cat file1.same.txt
bar 2 scaf 3.3
foo 1 scaf 3

$ cat file2.same.txt
bar 2 scaf 1.00
foo 1 scaf 4.5

Using datamash 's collapse operation, plus a bit of cosmetic sort ing and tr ing:使用datamash折叠操作,再加上一些装饰性sorttr ing:

cat File* | datamash -t ' ' -s -g1,2,3  collapse 4 | 
sort -g -k2 | tr ',' ' '

Output (common lines have a 5th field, uncommon lines do not):输出(常用行有第 5 个字段,不常用行没有):

foo 1 boo 2.3
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00

The OP doesn't show the expected output so idk if this is exactly the desired output but this is the way to approach the problem:如果这正是所需的输出,OP 不会显示预期的输出,但这是解决问题的方法:

$ awk '
    { key=$1 FS $2 FS $3 }
    NR==FNR { val[key]=$4; next }
    key in val {print $0, val[key] }
' file1 file2
foo 1 scaf 4.5 3
bar 2 scaf 1.00 3.3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM