簡體   English   中英

匹配不同行上的列並求和

[英]match columns on different lines and sum

我有一個約160,000行的csv,它看起來像這樣:

chr1,160,161,3,0.333333333333333,+         
chr1,161,162,4,0.5,-      
chr1,309,310,14,0.0714285714285714,+     
chr1,311,312,2,0.5,-     
chr1,499,500,39,0.717948717948718,+     
chr2,500,501,8,0.375,-      
chr2,510,511,18,0.5,+         
chr2,511,512,6,0.333333333333333,-    

我想對第1列相同,第3列與第2列匹配,第6列為'+'行進行配對,而在另一行上為'-' 如果是這樣,我想對第4列和第5列求和。

我想要的輸出是

chr1,160,161,7,0.833333333333333,+         
chr1,309,310,14,0.0714285714285714,+     
chr1,311,312,2,0.5,-     
chr1,499,500,39,0.717948717948718,+     
chr2,500,501,8,0.375,-      
chr2,510,511,24,0.833333333333333,-  

我能想到的最好的解決方案是復制文件,然后匹配文件之間的列,並使用perl復制它:

#!/usr/bin/perl             
use strict;      
use warnings;          
open my $firstfile, '<', $ARGV[0] or die "$!";         
open my $secondfile, '<', $ARGV[1] or die "$!";            
my ($chr_a, $chr_b,$start,$end,$begin,$finish, $sum_a, $sum_b, $total_a, 
    $total_b,$sign_a,$sign_b);             

while (<$firstfile>) {
    my @col = split /,/;
    $chr_a  = $col[0];
    $start  = $col[1];
    $end    = $col[2];
    $sum_a  = $col[3];
    $total_a = $col[4];
    $sign_a = $col[5];

    seek($secondfile,0,0);
    while (<$secondfile>) {
       my @seccol = split /,/;
       $chr_b     = $seccol[0];
       $begin     = $seccol[1];
       $finish    = $seccol[2];
       $sum_b     = $seccol[3];
       $total_b   = $seccol[4];
       $sign_b    = $seccol[5];

       print join ("\t", $col[0], $col[1], $col[2], $col[3]+=$seccol[3], 
                         $col[4]+=$seccol[4], $col[5]), 
           "\n" if ($chr_a eq $chr_b and $end==$begin and $sign_a ne $sign_b);
    }

}

而且效果很好,但理想情況下,我希望能夠在文件本身中執行此操作而不必復制它,因為我有很多文件,因此我想在所有文件上運行一個腳本,這會減少時間-耗時。 謝謝。

在沒有回復我的評論的情況下,該程序將按照您的要求處理您提供的數據。

use strict;
use warnings;

my @last;

while (<DATA>) {
  s/\s+\z//;
  my @line = split /,/;

  if (@last
      and $last[0] eq $line[0]
      and $last[2] eq $line[1]
      and $last[5] eq '+' and $line[5] eq '-') {

    $last[3] += $line[3];
    $last[4] += $line[4];
    print join(',', @last), "\n";
    @last = ()
  }
  else {
    print join(',', @last), "\n" if @last;
    @last = @line;
  }
}

print join(',', @last), "\n" if @last;

__DATA__
chr1,160,161,3,0.333333333333333,+         
chr1,161,162,4,0.5,-      
chr1,309,310,14,0.0714285714285714,+     
chr1,311,312,2,0.5,-     
chr1,499,500,39,0.717948717948718,+     
chr2,500,501,8,0.375,-      
chr2,510,511,18,0.5,+         
chr2,511,512,6,0.333333333333333,-

產量

chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM