简体   繁体   English

基于列匹配组合两个文件 - 其中一个文件列具有多个相同的条目

[英]Combine two files based on a column matching - one of the file's column has the same entries more than once

I would like to match two files based on one column and combine the matching lines. 我想基于一列匹配两个文件并组合匹配的行。 But one of the files ( file1.txt ) has the same entry more than once. 但其中一个文件( file1.txt )不止一次具有相同的条目。 As an example: 举个例子:

file1.txt FILE1.TXT

chr:123 a
chr:123 b
chr:456 a

file2.txt FILE2.TXT

chr:123 aa
chr:456 bb

I would like to extract the indexes based on the first column. 我想根据第一列提取索引。

The final output should look like: 最终输出应如下所示:

chr:123 a aa
chr:123 b aa
chr:456 a bb

I tried intersect on R but couldn't figure out how to combine matching lines when file1.txt has the same entry more than once. 我尝试在R上交叉,但是当file1.txt多次具有相同的条目时,无法弄清楚如何组合匹配的行。 I am using two for loops but the files are very big and it takes lots of time. 我使用两个for循环,但文件非常大,需要很多时间。

Is there a quicker way to do this in perl or R? 在perl或R中有更快的方法吗?

Try this: 尝试这个:

one <- data.frame(
id=c("chr:123","chr:123","chr:456"),
value=c("a","b","a")
)

two <- data.frame(
id=c("chr:123","chr:456"),
value=c("aa","bb")
)

merge(one,two,by="id",all.x=TRUE)

#result
       id value.x value.y
1 chr:123       a      aa
2 chr:123       b      aa
3 chr:456       a      bb

Here's another option: 这是另一种选择:

use Modern::Perl;

my %file1Hash;

open my $file1, "<file1.txt" or die $!;
do { my ( $key, $value ) = split; push @{ $file1Hash{$key} }, $value }
  for <$file1>;
close $file1;

open my $file2, "<file2.txt" or die $!;
do {
    my ( $key, $value ) = split;
    do { say "$key $_ $value" for @{ $file1Hash{$key} } } if $file1Hash{$key};
  }
  for <$file2>;
close $file2;

Output: 输出:

chr:123 a aa
chr:123 b aa
chr:456 a bb

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM