繁体   English   中英

多列比较Perl

[英]Multiple column comparison perl

我正在努力从2个格式如下的文件中删除匹配数据:

scaffold_1  21786   .   A   G   198 .   DP=44;VDB=0.0402;AF1=1 

我想做的是针对1-5列(在这种情况下,从scaffold_1G )相互检查这2个文件。 如果所有这些都匹配,那么我想跳过这些行,但是如果我在这5个中不匹配,则它们将转到自己的写文件。

#!/usr/bin/perl 
#bothfixed.pl 
use strict; use warnings; 

die "usage <file1> <file2> <write1> <write2>\n" unless @ARGV ==4; 

open (my $file1, "<", "$ARGV[0]") or die "Can't open $file1:$!";
open (my $file2, "<", "$ARGV[1]") or die "Can't open $file1:$!"; 
open (my $write1, ">", "$ARGV[2]");
open (my $write2, ">", "$ARGV[3]"); 


my $line1;  
my $line2; 
my @array1; 
my @array2; 
while (not eof $file1 && not eof $file2) {
    $line1 = <$file1> ;$line2 = <$file2>;  #print $line1; 

    @array1 = split (/\t/, $line1); 
    my @slice1 = @array1[0..4];

    @array2 = split (/\t/, $line2); 
    my @slice2 = @array2[0..4];

    my $string1 = join ('',@slice1);
    my $string2 = join ('',@slice2); 

    if ($string1 eq $string2) {next}
    else {print $write1 "$line1" ; print $write2 "$line2"}  
}

我只看到在此过程中删除了一行,并且我认为这意味着我的while循环有问题,但这与我的猜测差不多。

有任何想法吗?

我添加了一些测试数据。

文件1中的行。

scaffold_1  721 .   T   C   222 .   DP=67;VDB=0.0411;AF1=1;AC1=2;DP4=0,0,30,33;MQ=60;FQ=-217    GT:PL:DP:GQ 1/1:255,190,0:63:99

scaffold_1  1282    .   T   G   36  .   DP=67;VDB=0.0396;AF1=0.5;AC1=1;DP4=23,23,15,5;MQ=26;FQ=39;PV4=0.1,0.097,0.039,1 GT:PL:DP:GQ 0/1:66,0,255:66:69

scaffold_1  15917   .   AATATATATATATATATATATATATATATATATATAT   AATATATATATATATATATATATATATATATATATATAT 127 .   INDEL;DP=71;VDB=0.0387;AF1=0.5;AC1=1;DP4=3,13,5,6;MQ=57;FQ=130;PV4=0.21,1,0.21,1    GT:PL:DP:GQ 0/1:165,0,255:27:99

scaffold_1  19183   .   TAC TACAC   217 .   INDEL;DP=83;VDB=0.0408;AF1=0.5;AC1=1;DP4=24,18,16,19;MQ=60;FQ=217;PV4=0.36,1,1,0.0074   GT:PL:DP:GQ 0/1:255,0,255:77:99

scaffold_1  21786   .   A   G   198 .   DP=44;VDB=0.0402;AF1=1;AC1=2;DP4=0,0,21,22;MQ=60;FQ=-156    GT:PL:DP:GQ 1/1:231,129,0:43:99

scaffold_1  26031   .   G   A   169 .   DP=83;VDB=0.0263;AF1=0.5;AC1=1;DP4=23,14,21,24;MQ=60;FQ=172;PV4=0.19,6.7e-28,1,1    GT:PL:DP:GQ 0/1:199,0,255:82:99

scaffold_1  33033   .   A   T   206 .   DP=61;VDB=0.0411;AF1=0.5;AC1=1;DP4=17,22,13,8;MQ=60;FQ=209;PV4=0.28,3.4e-05,1,1 GT:PL:DP:GQ 0/1:236,0,255:60:99

scaffold_1  33799   .   C   A   146 .   DP=56;VDB=0.0394;AF1=0.5;AC1=1;DP4=13,14,13,14;MQ=60;FQ=149;PV4=1,6.2e-28,0.16,0.14 GT:PL:DP:GQ 0/1:176,0,255:54:99

scaffold_1  35051   .   CAAAAAAAAAA CAAAAAAAAAAA    32.5    .   INDEL;DP=51;VDB=0.0447;AF1=1;AC1=2;DP4=1,1,22,13;MQ=60;FQ=-118;PV4=1,1,1,1  GT:PL:DP:GQ 1/1:73,83,0:37:99

文件2中的行

scaffold_1  721 .   T   C   221 .   DP=57;VDB=0.0407;AF1=1;AC1=2;DP4=0,0,23,32;MQ=60;FQ=-193    GT:PL:DP:GQ 1/1:254,166,0:55:99

scaffold_1  1282    .   T   G   80  .   DP=82;VDB=0.0383;AF1=0.5;AC1=1;DP4=29,30,13,9;MQ=26;FQ=83;PV4=0.46,0.19,0.5,1   GT:PL:DP:GQ 0/1:110,0,238:81:99

scaffold_1  10472   .   A   C   23  .   DP=44;VDB=0.0402;AF1=0.5;AC1=1;DP4=6,8,1,11;MQ=60;FQ=26;PV4=0.081,9.7e-12,1,1   GT:PL:DP:GQ 0/1:53,0,246:26:56

scaffold_1  15917   .   AATATATATATATATATATATATATATATATATATAT   AATATATATATATATATATATATATATATATATATATAT 186 .   INDEL;DP=39;VDB=0.0416;AF1=0.5;AC1=1;DP4=5,6,5,5;MQ=60;FQ=189;PV4=1,1,1,0.43    GT:PL:DP:GQ 0/1:224,0,237:21:99

scaffold_1  19183   .   TAC TACAC   217 .   INDEL;DP=76;VDB=0.0383;AF1=0.5;AC1=1;DP4=16,12,20,23;MQ=60;FQ=217;PV4=0.47,1,1,0.11 GT:PL:DP:GQ 0/1:255,0,255:71:99
scaffold_1  21786   .   A   G   196 .   DP=58;VDB=0.0365;AF1=1;AC1=2;DP4=0,0,33,24;MQ=60;FQ=-199    GT:PL:DP:GQ 1/1:229,172,0:57:99

scaffold_1  26031   .   G   A   169 .   DP=70;VDB=0.0407;AF1=0.5;AC1=1;DP4=13,12,22,23;MQ=60;FQ=172;PV4=1,6.1e-26,1,1   GT:PL:DP:GQ 0/1:199,0,255:70:99

scaffold_1  33033   .   A   T   225 .   DP=41;VDB=0.0404;AF1=0.5;AC1=1;DP4=8,9,13,10;MQ=60;FQ=225;PV4=0.75,0.0052,1,1   GT:PL:DP:GQ 0/1:255,0,255:40:99

scaffold_1  33799   .   C   A   116 .   DP=61;VDB=0.0410;AF1=0.5;AC1=1;DP4=18,20,11,12;MQ=60;FQ=119;PV4=1,3.6e-28,1,0.46    GT:PL:DP:GQ 0/1:146,0,255:61:99
scaffold_1  35051   .   CAAAAAAAAAA CAAAAAAAAAAA,CAAAAAAAAAAAA  47.5    .   INDEL;DP=40;VDB=0.0384;AF1=1;AC1=2;DP4=0,0,16,17;MQ=60;FQ=-128  GT:PL:DP:GQ 1/1:88,93,0,82,60,76:33:99

这应该做您想要的。 您需要将第一个文件保存到内存中,并将其存储在用$string1比较字符串索引的哈希中,然后可以检查第二个文件中的每一行是否我们以前看过。 此代码将对第一个文件中的输出行进行混洗; 如果需要保持顺序,则可以添加另一个哈希以保存行号,并使用该哈希对最终的输出循环进行排序。

#!/usr/bin/perl 
#bothfixed.pl 
use strict; use warnings; 

die "usage <file1> <file2> <write1> <write2>\n" unless @ARGV ==4; 

open (my $file1, "<", "$ARGV[0]") or die "Can't open $file1:$!";
open (my $file2, "<", "$ARGV[1]") or die "Can't open $file1:$!"; 
open (my $write1, ">", "$ARGV[2]");
open (my $write2, ">", "$ARGV[3]"); 


my $line1;  
my $line2; 
my @array1; 
my @array2;
my %file1;
my %match;

# Slurp first file into memory
while ( $line1 = <file1> ) {
    @array1 = split (/\t/, $line1); 
    my @slice1 = @array1[0..4];
    my $string1 = join ('',@slice1);
    $file1{$string1} = $line1;
}

# Run through second file
while ($line2 = <$file2>) {
    @array2 = split (/\t/, $line2); 
    my @slice2 = @array2[0..4];
    my $string2 = join ('',@slice2); 

    if ($file1{$string2}) {
        # If we've matched then we wont want to print it out at the end
        $match{$string2}++;
    } else {
        print $write2 "$line2";
    }  
}

foreach my $string1 (keys %file1) {
    next if $match{$string1};
    print $write1 "$line1";
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM