带有伪条目的Unix连接命令

Question

I have a poor understanding of the semantics behind the join command. 我对join命令背后的语义了解甚少。 I would like to merge through a shell scrip and add dummy values. 我想通过外壳脚本合并并添加虚拟值。

I have two files that I want to merge together. 我有两个文件要合并在一起。 File A has 4 columns with a KEY column. 文件A有4列，其中包含KEY列。 File B has 60K+ columns, with the very first column being the KEY column. 文件B有60K +列，第一列是KEY列。

Both keys overlap at ~80%. 两个键重叠约80％。

Goal: create File C, which is every entry from File A and the matching rows from File B. If A did not match in B, I'd like dummy value "0" to be inserted into every missing field (60K+ fields) 目标：创建文件C，它是文件A中的每个条目以及文件B中的匹配行。如果A在B中不匹配，我希望将伪值“ 0”插入每个缺少的字段（60K +字段）中

Approach: 方法：

As a newbie to shell scripting, I figured a simple join would be effective. 作为shell脚本的新手，我认为简单的join将是有效的。 I sorted File A and B by the KEY value first using sort -k# appropriately. 我首先使用sort -k#按KEY值对文件A和B进行了sort -k# 。

join -a1 -1 2 -2 1 -e "0" file.A file.B > file.C

Now, how does join see the fields/columns its looking at? 现在， join如何看到其所查看的字段/列？ File B has 60k-1 columns that are spaced as: 文件B具有60k-1列，其间距为：

KEY 1 0     1 1     2 4     0 1 ...

Now, when I tried my command, file C has the correct number of entries, but I couldn't figure out how to add the missing values. 现在，当我尝试命令时，文件C的条目数正确，但是我不知道如何添加缺少的值。 File A has entries that file B does not have, and I'd like to place the null value 0 in every column that was not matched in file A by file B. 文件A具有文件B没有的条目，我想将空值0放置在文件B与文件A中不匹配的每一列中。

Thus, in file C, the result should be (according to my understanding of join ): 因此，在文件C中，结果应为（根据我对join的理解）：

KEY A1 A2 A3 A4 1 0     1 1     2 4     0 1 ... 
KEY A1 A2 A3 A4 0 0     0 0     0 0     0 0 ...

The spacing AFTER the joining doesn't matter to me, but the file B is created with the alternating tab-space-tab-space format. 联接后的间距对我来说并不重要，但是文件B是使用交替的tab-space-tab-space格式创建的。

Why isn't join -e "0" adding in my dummy values when I asked it to? 为什么当我要求join -e "0"添加虚拟值？ I would also appreciate any other shell strategies to do this. 我还要感谢其他任何Shell策略可以做到这一点。 I know I can merge in perl by running it through line by line (or R if it didn't take so long to load), but I feel shell is more powerfully equipped for this. 我知道我可以通过逐行运行它来合并在perl中（如果加载时间不长，则可以通过R运行），但是我觉得shell对此功能更强大。

EDIT 编辑

The data in teh file is mixed. 文件中的数据混合。 The first 5 columns are identifying strings in file A, and in file B there is a key string and single alphanumeric characters in each of the additional. 前5列是文件A中的标识字符串，文件B中有一个密钥字符串和每个附加的字母数字字符。 File A will always be small (no more than 1 MB), but file B can stretch up to 2+ GB. 文件A总是很小（不超过1 MB），但是文件B可以扩展到2+ GB。

Attempted R: df <- read.table("file.B", header=FALSE, fill=TRUE) 尝试R： df <- read.table("file.B", header=FALSE, fill=TRUE)

Answer 1

I read in the join info page: 我在join信息页面中阅读：

`-e STRING'
     Replace those output fields that are missing in the input with
     STRING.  I.E. missing fields specified with the `-12jo' options.

I inferred that -o was required. 我推断-o是必需的。 Try this: 尝试这个：

join -a1 -1 2 -2 1 -o auto -e "0" file.A file.B > file.C

带有伪条目的Unix连接命令

问题描述

1 个解决方案

解决方案1
1 2013-07-30 16:28:09

带有伪条目的Unix连接命令

问题描述

1 个解决方案

解决方案1 1 2013-07-30 16:28:09

解决方案1
1 2013-07-30 16:28:09