简体   繁体   English

使用指向Perl中文件行的指针

[英]Using pointers to lines in a file in perl

I'm trying to use some sort of pointers in perl so that I can look at two at two files that are sorted in alphabetical order and match things in both the files if they have the same name in the first column. 我试图在perl中使用某种指针,以便我可以查看两个按字母顺序排序的文件中的两个,如果两个文件在第一列中具有相同的名称,则它们将匹配这两个文件中的内容。 The way i'm searching through each file though is I'm looking at which lines first column is lower in alphabetical order and then moving the pointer on that file to the next line. 我搜索每个文件的方式是,我要查看第一行中哪些行的字母顺序较低,然后将文件上的指针移至下一行。 Somewhat similar to the pointers in merge sort. 有点类似于合并排序中的指针。 The code below is an example of what I want. 下面的代码是我想要的示例。

Using these two files. 使用这两个文件。

set1 设置1

 apple  17  20
 boombox  23  29
 carl  25  29
 cat  22  33
 dog  27  44

set2 SET2

 ants  yes
 boombox  no
 carl  yes
 dentist  yes
 dice  no
 dog  no

I can make a script that does something like this 我可以做一个像这样的脚本

($name, $affirmation) = first line in set2; #part I'm confused about I just kind of need some sort of command of something that will do this
while (<>){
        @set1 = split;
        while ($name < set1[0]){
             ($name, $affirmation) = next line in set2;  # part i'm confused about I just kind of need some sort of command of something that will do this
        }
        if ($name = $set[0]{
               print @set1, $affirmation;
        }

This is how I would run it 这就是我的运行方式

./script.txt set1

I would end up with 我最终会

boombox  23  29  no
carl  25  29  yes
dog  27  44  no

.

.

Edit: 编辑:

I tried some code in some of the answers to see if I could make some functional code out of it but I seem to be running into problems, and some of the syntax in the answers I could not understand so I'm having a lot of trouble figuring out how to debug or solve this. 我在一些答案中尝试了一些代码,以查看是否可以用它编写一些功能代码,但是我似乎遇到了问题,并且答案中的某些语法我无法理解,因此我遇到了很多弄清楚如何调试或解决此问题。

This is my specific example using the folllowing two text files 这是我使用以下两个文本文件的特定示例

text.txt 的text.txt

Apples 0       -1      -1      0       0       0       0       -1 
Apricots 0       1      1      0       0       0       0       1
Fruit        0       -1      -1      0       0       0       0       -1
Grapes        0       -2      -1      0       0       0       0       -2
Oranges   0       1      1      0       0       0       0       -1
Peaches  0       -2      -1      0       0       0       0       -2

text2.txt text2.txt

Apples      CHR1    +       1167628 1170420 1       1       N
Apricots      CHR1    -       2115898 2144159 1       1       N
Oranges       CHR1    -       19665266        19812066        1       1      N
Noidberry     CHR1    -       1337728 1329993 1       1       N
Peaches       CHR1    -       1337275 1342693 1       1       N

And this script script.pl 而这个脚本script.pl

#!/usr/bin/perl
use warnings;
my $file_1 = $ARGV[0];
my $file_2 = $ARGV[1];

open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";

open(my $single, '>', 'text.txt');
open(my $deep, '>', 'text2.txt');
OUTER: while (my $outer = <$fh1>){
        chomp $outer;
        @CopyNumber = split(' ', $outer);
        ($title, $title2) = split('\|', $CopyNumber[0]);
        #print 'title: ',$title,' title2: ',$title2,"\n";
        my $numLoss = 0;
        my $deepLoss = 0;
        for ($i = 1; $i <= $#CopyNumber; $i++){
                #print "$CopyNumber[$i], $#CopyNumber, $i, \n";
                if ($CopyNumber[$i] < 0){
                        $numLoss = $numLoss + 1;
                        if ($CopyNumber[$i] <-1){
                                $deepLoss = $deepLoss + 1;
                        }
                }
        }
        if ($GeneSym and (($GeneSym cmp $title)==0)){ #or (($GeneSym cmp $title2)==0))){
                print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
                print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
                next OUTER;
        }

        INNER: while (my $inner = <$fh2>){
                ($GeneSym, $Chrom, $Strand, $Start, $Stop, $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split(' ', $inner);
                $Chrom =~ s/CHR/hs/ee;
                my $cmp = ($GeneSym cmp $title);
                next OUTER if $cmp < 0;
                if ($cmp==0){ #or (($GeneSym cmp $title2)==0)){
                        print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
                        print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
                        next OUTER;
                }
        }
}

If I run ./script.pl text.txt text2.txt I should get this printed into Number.txt 如果我运行./script.pl text.txt text2.txt,我应该将此打印到Number.txt中

//corresponding to columns 2,4,5 of text2.txt and the last column being the percentage of columns which have a number lower than 0 //对应于text2.txt的第2、4、5列,最后一列是数字小于0的列的百分比

hs1     1167628     1170420    0.375 //For Apples
hs1     2115898     2144159    0 //For Apricots
hs1     19665266    19812066   0.125 //For Oranges
hs1     1337275     1342693    0.375 //For Peaches

Instead I get this 相反,我得到这个

hs1     1167628 1170420 0.375
hs1     2115898 2144159 0
hs1     1337275 1342693 0.375

So I'm just getting an error where 所以我只是出现一个错误

hs1     19665266    19812066   0.125 //For Oranges

isn't printing 没有打印

Quite like you state, with: use cmp for comparison, split line into two terms. 就像您陈述的那样,使用:使用cmp进行比较, split行分为两部分。

For each line of FILE1 file go through lines of FILE2 file, exiting when a match is found. 对于FILE1文件的每一行,都要经过FILE2文件的各行,找到匹配项后退出。 Once the FILE2 overshoots alphabetically move to the next line of FILE1 . 一旦FILE2超调按字母顺序移动到FILE1的下一行。

use warnings 'all';
use strict;

sub process {
   my ($name, $affirm_1, $affirm_2) = @_;
   print "$name $affirm_1 $affirm_2\n";
}

my $file_1 = 'set1.txt';
my $file_2 = 'set2.txt';

open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";

my ($name_2, $affirm_2);
FILE1: while (my $line1 = <$fh1>) {
    chomp $line1;
    my ($name_1, $affirm_1) = split ' ', $line1, 2;

    if ($name_2) {
        my $cmp = $name_1 cmp $name_2;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process($name_1, $affirm_1, $affirm_2);
            next FILE1;
        }
    }

    FILE2: while (my $line2 = <$fh2>) {
        chomp $line2;
        ($name_2, $affirm_2) = split ' ', $line2, 2;
        my $cmp = $name_1 cmp $name_2;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process($name_1, $affirm_1, $affirm_2);
            next FILE1;
        }
    }
}

Comments on a few remaining details. 对其他一些细节发表评论。

Once a FILE2 line "overshoots," in the next iteration of FILE1 we need to first check that line, before entering the FILE2 loop to iterate over its remaining lines. 一旦FILE2行“超调”,在FILE1的下一次迭代中,我们需要先检查该行,然后再进入FILE2循环以遍历其其余行。 For the first FILE1 line the $name_2 is still undef thus if ($name_2) . 对于第一行FILE1$name_2仍然是undef,因此if ($name_2)


Updated for edited post. 已更新为已编辑帖子。

use warnings 'all';
use strict;

sub process_line {
    my ($single, $deep, $rline, $GeneSym, $Chrom, $Start, $Stop) = @_;
    my ($numLoss, $deepLoss) = calc_loss($rline);
    $Chrom =~ s/CHR/hs/;
    print $single (join "\t", $Chrom, $Start, $Stop, $numLoss/$#$rline), "\n";
    print $deep   (join "\t", $Chrom, $Start, $Stop, $deepLoss/$#$rline), "\n";
}

sub calc_loss {
    my ($rline) = @_; 
    my ($numLoss, $deepLoss) = (0, 0); 
    for my $i (1.. $#$rline) {
        $numLoss  += 1  if $rline->[$i] < 0;
        $deepLoss += 1  if $rline->[$i] < -1; 
    }   
    return $numLoss, $deepLoss;
}

my ($Number,  $NumberDeep) = ('Number.txt', 'NumberDeep.txt');
open my $single, '>', $Number      or die "Can't open $Number: $!";
open my $deep,   '>', $NumberDeep  or die "Can't open $NumberDeep: $!";

my ($file_1, $file_2) = ('set1_new.txt', 'set2_new.txt');    
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";

my ($GeneSym, $Chrom, $Strand, $Start, $Stop, 
    $MapId, $TotalMap, $AbnormalMerge, $Overlap);

FILE1: while (my $line1 = <$fh1>) {
    next if $line1 =~ /^\s*$/;
    chomp $line1;

    my @line = split ' ', $line1;

    if ($GeneSym) {
        my $cmp = $line[0] cmp $GeneSym;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process_line($single, $deep, \@line,
                         $GeneSym, $Chrom, $Start, $Stop);
            next FILE1;
        }   
    }   

    FILE2: while (<$fh2>) {
        next if /^\s*$/;
        chomp;
        ($GeneSym, $Chrom, $Strand, $Start, $Stop, 
             $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split;
        my $cmp = $line[0] cmp $GeneSym;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process_line($single, $deep, \@line,
                         $GeneSym, $Chrom, $Start, $Stop);
            next FILE1;
        }
    }
}

This produces the desired output with given sample files. 这将在给定的示例文件中产生所需的输出。 Some shortcuts are taken, please let me know if comments would be helpful. 采取了一些捷径,请让我知道注释是否有帮助。 Here are a few 这里有一些

  • Much error checking should be added around. 应该增加很多错误检查。

  • I assume the first field of FILE1 to be used as it stands. 我假设要使用FILE1的第一个字段。 Otherwise changes are needed. 否则需要更改。

  • Processing is split into two functions, calculations being separate. 处理分为两个功能,计算是分开的。 This is not necessary. 这不是必需的。

  • $#$rline is the index of the last element of $rline arrayref. $#$rline$rline arrayref的最后一个元素的索引。 If this is too much syntax to stomach use @$rline - 1 , for example as (0..@$rline-1) 如果语法太多,请使用@$rline - 1 ,例如(0..@$rline-1)

Some comments on the code posted in the question: 对问题中发布的代码的一些评论:

  • Always, always , please use warnings; 总是, 总是 ,请use warnings; (and use strict; ) (并use strict;

  • loop over indices is best written foreach my $i (0..$#array) 最好在foreach my $i (0..$#array)写索引循环

  • The regex modifier /ee is very involved. regex修饰符/ee非常有用。 There is absolutely no need for it here. 这里绝对没有必要。

You're right. 你是对的。 It's exactly like a merge sort, except only matching lines are output. 完全类似于合并排序,只输出匹配的行。

sub read_and_parse1 {
   my ($fh) = @_;
   defined( my $line = <$fh> )
      or return undef;

   my ($id, @copy) = split(' ', $line);   # Use split(/\t/, $line) if tab-separated data
   my ($gene_sym) = split(/\|/, $id);

   return [ $gene_sym, @copy ];
}

sub read_and_parse2 {
   my ($fh) = @_;
   defined( my $line = <$fh> )
      or return undef;

   return [ split(' ', $line) ];         # Use split(/\t/, $line) if tab-separated data
}

my $fields1 = read_and_parse1($fh1);
my $fields2 = read_and_parse2($fh2);
while ($fields1 && $fields2) {
   my $cmp = $fields1->[0] cmp $fields2->[0];
   if    ($cmp < 0) { $fields1 = read_and_parse1($fh1); }
   elsif ($cmp > 0) { $fields2 = read_and_parse2($fh2); }
   else {
      my ($gene_sym, @copy) = @$fields1;
      my (undef, $chrom, $strand, $start, $stop, $map_id, $total_map, $abnormal_merge, $overlap) = @$fields2;

      $chrom =~ s/^CHR/hs/;

      my $num_loss  = grep { $_ <  0 } @copy;
      my $deep_loss = grep { $_ < -1 } @copy;

      print($single_fh join("\t", $chrom, $start, $stop, $num_loss/@copy  ) . "\n");
      print($deep_fh   join("\t", $chrom, $start, $stop, $deep_loss/@copy ) . "\n");

      $fields1 = read_and_parse1($fh1);
      $fields2 = read_and_parse2($fh2);
   }
}

Output: 输出:

$ cat single.txt
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 19665266 19812066 0.125
hs1 1337275 1342693 0.375

$ cat deep.txt
hs1 1167628 1170420 0
hs1 2115898 2144159 0
hs1 19665266 19812066 0
hs1 1337275 1342693 0.25

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM