[英]Inner join two files based on one column in unix when row names don't match with sort
I'm running into an issue trying to join two datasets in unix and could use your help. 我试图在UNIX中加入两个数据集时遇到问题,可以使用您的帮助。 I've spent a long time searching the forum for a solution but turned up empty handed. 我花了很长时间在论坛上搜索解决方案,但空手而归。
I have a list of accession numbers in one dataset and need to convert these to gene symbols. 我在一个数据集中列出了登录号,需要将其转换为基因符号。 In order to do so I downloaded the gene2accession.gz from NCBI . 为此,我从NCBI下载了gene2accession.gz。 The uncompressed file is ~7Gb so first I cut out the accession and gene symbol from this dataset 未压缩的文件约为7Gb,因此首先我从该数据集中切出了登录号和基因符号
cut -f 2,16 gene2accession > accession2genesymbol
There are ~70 million lines as per wc -l accession2genesymbol
with many duplicates so I removed these with sort accession2genesymbol | uniq
根据wc -l accession2genesymbol
有大约7000万行,其中有很多重复项,因此我使用sort accession2genesymbol | uniq
将其删除了sort accession2genesymbol | uniq
sort accession2genesymbol | uniq
which resulted in ~20 million lines. sort accession2genesymbol | uniq
产生了约2000万行。
Now normally I would do an inner_join()
using the dplyr
package in R (return all rows from x where there are matching values in y, and all columns from x and y); 现在通常我会在R中使用dplyr
包来做一个inner_join()
(从x中返回所有行,其中y中有匹配值,并且从x和y中返回所有列); however, this dataset is far too large for me to work with. 但是,这个数据集对于我来说太大了。
Here is a sample of the unsorted accession2genesymbol dataset: 这是未排序的 accession2genesymbol数据集的示例:
100000492 mafaa
1000004 XCC3444
110047139 LOC110047139
110047140 LOC110047140
9951915 LOAG_14435
9951916 LOAG_14436
999999 gndA
999 CDH1
9 NAT1
A short example of the unsorted the Accessions looks like this (for the whole dataset -1,576 lines see the gist ): 一个未排序的登录号的简短示例如下所示(有关整个数据集-1,576行, 请参见要点 ):
Accessions
10047140
100913206
10092617
10190704
10190704
103471987
103471997
103472005
103472005
105990514
45006951
45006986
45006986
45007007
45007007
4501883
4501887
94721250
94721261
9558733
9845516
98986457
98986457
98986464
99028871
9910242
9951915
9966805
9966827
9966867
9994185
EDIT: Only accession 9951915 and 110047140 here have matches so my expected output would be: 编辑:只有登录号9951915和110047140在这里有匹配项,所以我的预期输出将是:
9951915 LOAG_14435
110047140 LOC110047140
Not having worked with unix much for data manipulation/joining I searched Stack Overflow for similar problems. 与Unix的数据处理/连接工作不多,我在Stack Overflow中搜索了类似的问题。
For example this one . 例如这个 。 It's my understanding that unix join
function can only be used if the files are sorted so I tried the following: 据我了解,unix join
功能仅在文件排序后才能使用,因此我尝试了以下操作:
join -t "\t" <(dos2unix <accession) <(dos2unix <accession2genesymbol.txt)
Perhaps this is not working because I would need exactly the same row numbers in both datasets (ie if row 100 of dataset doesn't match row 100 of dataset2 it wont work) but perhaps I'm wrong and this didn't work for some other reason? 也许这行不通,因为我在两个数据集中都需要完全相同的行号(即,如果数据集的第100行与数据集2的第100行不匹配,它将行不通),但也许我错了,这对某些行不通还有其他原因吗?
Perhaps awk
is a better solution, so I tried a suggestion from this post : 也许awk
是更好的解决方案,所以我尝试了这篇文章的建议:
awk '{a[$1]=a[$1] FS $2} END {for (i in a) print i a[i]}' accession accession2genesymbol | sort > file3
This produces a file with ~20 million lines and since my accession is only 9000 lines I would expect 9000 (or potentially fewer if those accessions no longer exist, for example). 这将产生一个约2000万行的文件,由于我的加入只有9000行,因此我希望达到9000行(例如,如果这些加入不再存在,则可能会更少)。
I tried another awk
solution from the first post: 我从第一篇文章尝试了另一个awk
解决方案:
awk -F, 'FNR==NR{a[$1];next}($1 in a){print $2}' accession accession2genesymbol > file3
awk: warning: escape sequence `\+' treated as plain `+'
But I end up with an empty file. 但是我最终得到一个空文件。
I'd appreciate an awk(ward) solution, python, or whatever would help me solve this problem. 我希望使用awk(ward)解决方案,python或任何可以帮助我解决此问题的方法。 Thank you very much. 非常感谢你。
join
should work for your case. join
应该适合您的情况。 since your input files don't have matches here is a made up example and using your map file 由于您的输入文件不匹配,因此这里是一个组合示例,并且使用了地图文件
$ head file
100000009
100000061
100000030
$ join <(sed 1d map) <(sort file)
100000009 sema5bb+
100000030 btr24+
100000061 si:ch211-133n4.9+
assuming your map
file is already sorted, you need to remove the header sed 1d
and need to sort your input file
. 假设您的map
文件已经排序,则需要删除标题sed 1d
并需要对输入file
进行排序。 Note that sorting should be both numerical or lexical. 请注意,排序应为数字或词汇。
Another alternative, which doesn't require sorting is with grep
另一个不需要排序的替代方法是使用grep
$ grep -wFf file map
100000009 sema5bb+
100000030 btr24+
100000061 si:ch211-133n4.9+
if the numbers and codes are not in the same format there won't be false matches. 如果数字和代码的格式不同,则不会出现错误匹配。
We haven't seen a sample of your original gene2accession
file yet but let's assume it's a tab-separated field with accession
in the 2nd column and gene
in the 16th (since that's what your cut
is selecting) with a header line. 我们还没有看到你的原始样本gene2accession
文件还没有,但让我们假设它是用制表符分隔场accession
的第2列和gene
在16(因为这是你的cut
是选择)与标题行。 Let's also assume that your Accessions
file isn't absolutely enormous. 我们还假设您的Accessions
文件不是绝对巨大。
Given that, this script should do what you want: 鉴于此,此脚本应执行您想要的操作:
awk -F'\t' 'NR==FNR{a[$1];next} ($2 in a) && !seen[$2]++{print $2, $16}' Accessions gene2accession
but you could try this to see if it's faster: 但您可以尝试这样做,看看是否更快:
awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions <(sort -u -t'\t' -k2,2 gene2accession)
and if it is and you want an intermediate file for the output of the sort
to use in subsequent runs: 如果是,并且您想要一个中间文件,以便在后续运行中使用该sort
输出:
sort -u -t'\t' -k2,2 gene2accession > unq_gene2accession &&
awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions unq_gene2accession
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.