简体   繁体   English

带有伪条目的Unix连接命令

[英]Unix join command with dummy entries

I have a poor understanding of the semantics behind the join command. 我对join命令背后的语义了解甚少。 I would like to merge through a shell scrip and add dummy values. 我想通过外壳脚本合并并添加虚拟值。

I have two files that I want to merge together. 我有两个文件要合并在一起。 File A has 4 columns with a KEY column. 文件A有4列,其中包含KEY列。 File B has 60K+ columns, with the very first column being the KEY column. 文件B有60K +列,第一列是KEY列。

Both keys overlap at ~80%. 两个键重叠约80%。

Goal: create File C, which is every entry from File A and the matching rows from File B. If A did not match in B, I'd like dummy value "0" to be inserted into every missing field (60K+ fields) 目标:创建文件C,它是文件A中的每个条目以及文件B中的匹配行。如果A在B中不匹配,我希望将伪值“ 0”插入每个缺少的字段(60K +字段)中

Approach: 方法:

As a newbie to shell scripting, I figured a simple join would be effective. 作为shell脚本的新手,我认为简单的join将是有效的。 I sorted File A and B by the KEY value first using sort -k# appropriately. 我首先使用sort -k#按KEY值对文件A和B进行了sort -k#

join -a1 -1 2 -2 1 -e "0" file.A file.B > file.C

Now, how does join see the fields/columns its looking at? 现在, join如何看到其所查看的字段/列? File B has 60k-1 columns that are spaced as: 文件B具有60k-1列,其间距为:

KEY 1 0     1 1     2 4     0 1 ...  

Now, when I tried my command, file C has the correct number of entries, but I couldn't figure out how to add the missing values. 现在,当我尝试命令时,文件C的条目数正确,但是我不知道如何添加缺少的值。 File A has entries that file B does not have, and I'd like to place the null value 0 in every column that was not matched in file A by file B. 文件A具有文件B没有的条目,我想将空值0放置在文件B与文件A中不匹配的每一列中。

Thus, in file C, the result should be (according to my understanding of join ): 因此,在文件C中,结果应为(根据我对join的理解):

KEY A1 A2 A3 A4 1 0     1 1     2 4     0 1 ... 
KEY A1 A2 A3 A4 0 0     0 0     0 0     0 0 ...

The spacing AFTER the joining doesn't matter to me, but the file B is created with the alternating tab-space-tab-space format. 联接后的间距对我来说并不重要,但是文件B是使用交替的tab-space-tab-space格式创建的。

Why isn't join -e "0" adding in my dummy values when I asked it to? 为什么当我要求join -e "0"添加虚拟值? I would also appreciate any other shell strategies to do this. 我还要感谢其他任何Shell策略可以做到这一点。 I know I can merge in perl by running it through line by line (or R if it didn't take so long to load), but I feel shell is more powerfully equipped for this. 我知道我可以通过逐行运行它来合并在perl中(如果加载时间不长,则可以通过R运行),但是我觉得shell对此功能更强大。

EDIT 编辑

The data in teh file is mixed. 文件中的数据混合。 The first 5 columns are identifying strings in file A, and in file B there is a key string and single alphanumeric characters in each of the additional. 前5列是文件A中的标识字符串,文件B中有一个密钥字符串和每个附加的字母数字字符。 File A will always be small (no more than 1 MB), but file B can stretch up to 2+ GB. 文件A总是很小(不超过1 MB),但是文件B可以扩展到2+ GB。

Attempted R: df <- read.table("file.B", header=FALSE, fill=TRUE) 尝试R: df <- read.table("file.B", header=FALSE, fill=TRUE)

I read in the join info page: 我在join信息页面中阅读:

`-e STRING'
     Replace those output fields that are missing in the input with
     STRING.  I.E. missing fields specified with the `-12jo' options.

I inferred that -o was required. 我推断-o是必需的。 Try this: 尝试这个:

join -a1 -1 2 -2 1 -o auto -e "0" file.A file.B > file.C

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM