简体   繁体   English

从unix命令行连接两个文件的最简单方法,为缺少的键插入零条目

[英]Easiest way to join two files from the unix command line, inserting zero entries for missing keys

I'm trying to join two files each of which contains rows of the form <key> <count> . 我正在尝试连接两个文件,每个文件包含<key> <count>形式的行。 Each file contains a few lines that are missing from the other, and I would like to have zero inserted for all such values rather than omitting these lines (I've seen -a, but this isn't quite what I'm looking for). 每个文件包含一些从另一个文件中缺失的行,我想为所有这些值插入零而不是省略这些行(我见过-a,但这不是我想要的)。 Is there a simple way to accomplish this? 有没有一种简单的方法来实现这一目标?

Here is some sample input: 以下是一些示例输入:

a.txt A.TXT

apple 5
banana 7

b.txt b.txt

apple 6
cherry 4

expected output: 预期产量:

apple 5 6
banana 7 0
cherry 0 4
join -o 0,1.2,2.2 -e 0 -a1 -a2 a.txt b.txt
  • -o 0,1.2,2.2 → output join field, then 2nd field of 1st file, then 2nd field of 2nd file. -o 0,1.2,2.2 →输出连接字段,然后是第一个文件的第二个字段,然后是第二个文件的第二个字段。
  • -e 0 → Output 0 on empty input fields. -e 0 →空输入字段输出0
  • -a1 -a2 → Show all values from file 1 and file 2. -a1 -a2 →显示文件1和文件2中的所有值。

Write a script, whatever language you want. 写一个脚本,无论你想要什么语言。 You will parse both files using a map/hashtable/dictionary data structure (lets just say dictionary). 您将使用map / hashtable / dictionary数据结构解析这两个文件(简单地说就是字典)。 Each dictionary will have the first word as the key and the count (or even a string of counts) as the value. 每个字典都将第一个单词作为键,计数(甚至是一串计数)作为值。 Here is some pseudocode of the algorithm: 这是算法的一些伪代码:

Dict fileA, fileB; //Already parsed
while(!fileA.isEmpty()) {
      string check = fileA.top().key();
      int val1 = fileA.top().value();
      if(fileB.contains(check)) {
          printToFile(check + " " + val1 + " " + fileB.getValue(check));
          fileB.remove(check);
      }
      else {
          printToFile(check + " " + val1 + " 0");
      }
      fileA.pop();
}
while(!fileB.isEmpty()) {      //Know key does not exist in FileA
     string check = fileB.top().key();
     int val1 = fileB.top().value();
     printToFile(check + " 0 " + val1);
     fileB.pop();
}

You can use any type of iterator to go through the data structure instead of pop and top. 您可以使用任何类型的迭代器来遍历数据结构而不是pop和top。 Obviously you may need to access the data a different way depending on what language/data structure you need to use. 显然,您可能需要以不同的方式访问数据,具体取决于您需要使用的语言/数据结构。

@ninjalj's answer is much saner, but here's a shell script implementation just for fun: @ninjalj的答案非常合理,但这里有一个shell脚本实现只是为了好玩:

exec 8< a.txt
exec 9< b.txt

while true; do
   if [ -z "$k1" ]; then
    read k1 v1 <& 8
   fi
   if [ -z "$k2" ]; then
    read k2 v2 <& 9
   fi
   if [ -z "$k1$k2" ]; then break; fi
   if [ "$k1" == "$k2" ]; then
    echo $k1 $v1 $v2 
    k1=
    k2=
   elif [ -n "$k1" -a "$k1" '<' "$k2" ]; then
    echo $k1 $v1 0 
    k1=
   else 
    echo $k2 0 $v2
    k2=
   fi
done

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM