当行名称与排序不匹配时，基于UNIX中的一列内部连接两个文件

Question

I'm running into an issue trying to join two datasets in unix and could use your help. 我试图在UNIX中加入两个数据集时遇到问题，可以使用您的帮助。 I've spent a long time searching the forum for a solution but turned up empty handed. 我花了很长时间在论坛上搜索解决方案，但空手而归。

I have a list of accession numbers in one dataset and need to convert these to gene symbols. 我在一个数据集中列出了登录号，需要将其转换为基因符号。 In order to do so I downloaded the gene2accession.gz from NCBI . 为此，我从NCBI下载了gene2accession.gz。 The uncompressed file is ~7Gb so first I cut out the accession and gene symbol from this dataset 未压缩的文件约为7Gb，因此首先我从该数据集中切出了登录号和基因符号

cut -f 2,16 gene2accession > accession2genesymbol

There are ~70 million lines as per wc -l accession2genesymbol with many duplicates so I removed these with sort accession2genesymbol | uniq 根据wc -l accession2genesymbol有大约7000万行，其中有很多重复项，因此我使用sort accession2genesymbol | uniq将其删除了sort accession2genesymbol | uniq sort accession2genesymbol | uniq which resulted in ~20 million lines. sort accession2genesymbol | uniq产生了约2000万行。

Now normally I would do an inner_join() using the dplyr package in R (return all rows from x where there are matching values in y, and all columns from x and y); 现在通常我会在R中使用dplyr包来做一个inner_join() （从x中返回所有行，其中y中有匹配值，并且从x和y中返回所有列）； however, this dataset is far too large for me to work with. 但是，这个数据集对于我来说太大了。

Here is a sample of the unsorted accession2genesymbol dataset: 这是未排序的 accession2genesymbol数据集的示例：

100000492       mafaa
1000004 XCC3444
110047139       LOC110047139
110047140       LOC110047140
9951915         LOAG_14435
9951916         LOAG_14436
999999          gndA
999             CDH1
9               NAT1

A short example of the unsorted the Accessions looks like this (for the whole dataset -1,576 lines see the gist ): 一个未排序的登录号的简短示例如下所示（有关整个数据集-1,576行，请参见要点）：

EDIT: Only accession 9951915 and 110047140 here have matches so my expected output would be: 编辑：只有登录号9951915和110047140在这里有匹配项，所以我的预期输出将是：

9951915         LOAG_14435
110047140       LOC110047140

Not having worked with unix much for data manipulation/joining I searched Stack Overflow for similar problems. 与Unix的数据处理/连接工作不多，我在Stack Overflow中搜索了类似的问题。

For example this one . 例如这个。 It's my understanding that unix join function can only be used if the files are sorted so I tried the following: 据我了解，unix join功能仅在文件排序后才能使用，因此我尝试了以下操作：

join -t "\t" <(dos2unix <accession) <(dos2unix <accession2genesymbol.txt)

Perhaps this is not working because I would need exactly the same row numbers in both datasets (ie if row 100 of dataset doesn't match row 100 of dataset2 it wont work) but perhaps I'm wrong and this didn't work for some other reason? 也许这行不通，因为我在两个数据集中都需要完全相同的行号（即，如果数据集的第100行与数据集2的第100行不匹配，它将行不通），但也许我错了，这对某些行不通还有其他原因吗？

Perhaps awk is a better solution, so I tried a suggestion from this post : 也许awk是更好的解决方案，所以我尝试了这篇文章的建议：

awk '{a[$1]=a[$1] FS $2} END {for (i in a) print i a[i]}' accession accession2genesymbol | sort > file3

This produces a file with ~20 million lines and since my accession is only 9000 lines I would expect 9000 (or potentially fewer if those accessions no longer exist, for example). 这将产生一个约2000万行的文件，由于我的加入只有9000行，因此我希望达到9000行（例如，如果这些加入不再存在，则可能会更少）。

I tried another awk solution from the first post: 我从第一篇文章尝试了另一个awk解决方案：

awk -F, 'FNR==NR{a[$1];next}($1 in a){print $2}' accession accession2genesymbol > file3
awk: warning: escape sequence `\+' treated as plain `+'

But I end up with an empty file. 但是我最终得到一个空文件。

I'd appreciate an awk(ward) solution, python, or whatever would help me solve this problem. 我希望使用awk（ward）解决方案，python或任何可以帮助我解决此问题的方法。 Thank you very much. 非常感谢你。

Answer 1

join should work for your case. join应该适合您的情况。 since your input files don't have matches here is a made up example and using your map file 由于您的输入文件不匹配，因此这里是一个组合示例，并且使用了地图文件

$ head file
100000009
100000061
100000030

$ join <(sed 1d map) <(sort file)
100000009 sema5bb+
100000030 btr24+
100000061 si:ch211-133n4.9+

assuming your map file is already sorted, you need to remove the header sed 1d and need to sort your input file . 假设您的map文件已经排序，则需要删除标题sed 1d并需要对输入file进行排序。 Note that sorting should be both numerical or lexical. 请注意，排序应为数字或词汇。

Another alternative, which doesn't require sorting is with grep 另一个不需要排序的替代方法是使用grep

$ grep -wFf file map
100000009       sema5bb+
100000030       btr24+
100000061       si:ch211-133n4.9+

if the numbers and codes are not in the same format there won't be false matches. 如果数字和代码的格式不同，则不会出现错误匹配。

Answer 2

We haven't seen a sample of your original gene2accession file yet but let's assume it's a tab-separated field with accession in the 2nd column and gene in the 16th (since that's what your cut is selecting) with a header line. 我们还没有看到你的原始样本gene2accession文件还没有，但让我们假设它是用制表符分隔场accession的第2列和gene在16（因为这是你的cut是选择）与标题行。 Let's also assume that your Accessions file isn't absolutely enormous. 我们还假设您的Accessions文件不是绝对巨大。

Given that, this script should do what you want: 鉴于此，此脚本应执行您想要的操作：

awk -F'\t' 'NR==FNR{a[$1];next} ($2 in a) && !seen[$2]++{print $2, $16}' Accessions gene2accession

but you could try this to see if it's faster: 但您可以尝试这样做，看看是否更快：

awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions <(sort -u -t'\t' -k2,2 gene2accession)

and if it is and you want an intermediate file for the output of the sort to use in subsequent runs: 如果是，并且您想要一个中间文件，以便在后续运行中使用该sort输出：

sort -u -t'\t' -k2,2 gene2accession > unq_gene2accession &&
awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions unq_gene2accession

当行名称与排序不匹配时，基于UNIX中的一列内部连接两个文件

问题描述

2 个解决方案

解决方案1
2 2017-12-21 15:55:16

解决方案2
1 已采纳 2017-12-27 13:26:43

当行名称与排序不匹配时，基于UNIX中的一列内部连接两个文件

问题描述

2 个解决方案

解决方案1 2 2017-12-21 15:55:16

解决方案2 1 已采纳 2017-12-27 13:26:43

解决方案1
2 2017-12-21 15:55:16

解决方案2
1 已采纳 2017-12-27 13:26:43