简体   繁体   中英

Perl Multidimensional array column compare and display whole content with satisfied condition

I am facing little issue in taking array index for comparison and displaying the result. I have a tab delimited file with 9 columns and more than 100 rows. I want to compare the 8th column element of ith row with the 7th column element of i+1th row. If it is smaller than the 7th column element then print entire row else if it is greater than the 7th column element the compare the 6th element of both row and only print if the row if the 6th element is bigger.

Sample File

Recep_L_domain  PF01030.22      112     sp|P00533|EGFR_HUMAN    2.50E-30        104.7   57      167     Receptor
Furin-like      PF00757.18      149     sp|P00533|EGFR_HUMAN    4.10E-29        101.3   185     338     Furin-like
Recep_L_domain  PF01030.22      112     sp|P00533|EGFR_HUMAN    3.60E-28        97.8    361     480     Receptor
GF_recep_IV     PF14843.4       132     sp|P00533|EGFR_HUMAN    1.60E-46        157.2   505     636     Growth
Pkinase PF00069.23      264     sp|P00533|EGFR_HUMAN    2.70E-39        135     712     964     Protein
Pkinase_Tyr     PF07714.15      260     sp|P00533|EGFR_HUMAN    8.40E-88        293.9   714     965     Protein

For example if we compare the last two row then 8th column element is bigger than the next row's 7th column element, then in this case it should compare the two 6th column element and print the only row which is bigger. So from this two row it should print only last row. For me the below code is only printing the values if it is smaller, but I want to ask how can I compare 6th element and print results if 8th column is bigger?

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

open(IN,"<samplecode.txt");

my @Alifrom;
my @Alito;
my @data; ## multidimensional array

while(<IN>){
    chomp $_;
    #next if $_=undef;
    my @line = split("\t", $_);
    ##my($a, $b, $c, $d, $e, $f, $g, $h, $i) = split(/\t/,$_); // catch data and storing into multiple scalar variable

    push @data, [@line];
}

for (my $i = 0; $i < @data ; $i++){

    if ($data[$i][7] gt $data[$i][6]){

        for (my $j = 0; $j < @{$data[$i]}; $j++){
            #@Alifrom = map $data[$i][$j+6], @data;
            print "$data[$i][$j]\t";
        }
    }
    #else
    print "\n";
}

The description in your question is not entirely clear, but I'm taking an educated guess.

First, you should not read the whole file into an array. If your file really only has 100 rows, it's not a problem, but if there are more rows this will consume a lot of memory.

You say you want to compare values in every row i to values in row i+1 , so essentially in every row you want to look at values in the next row. That means you need to keep a maximum of two rows in memory at one time. Since that's linear, you can just read the first row, then read the second row, compare, and when you're done make the second row the new first row.

In your loop, you always read the second row, and keep around the first row from when you read it as the second row in the iteration before.

For that, it makes sense to turn the reading and splitting into a function. You can pass it a file handle. In my example above, I've used DATA with the __DATA__ section, but you can just open my $fh, '<', 'samplecode.txt' and pass $fh around.

Because you want to print the whole row in some cases, you should not just chomp and split it in a destructive manner, but rather keep around the actual full row including the line break. We therefore make the function to read and split return two values: the full row as a scalar string, and an array reference of the cleaned up columns.

If there are no more lines to read, we return an implicit undef , which will make the while loop stop. Therefore you can never process the last row of the file.

When comparing, note that list indexes in Perl always start on zero, so column 7 is index [6] .

Here's an example implementation.

use strict;
use warnings;

# this function reads a line from the filehandle that's passed in and returns
# the row as a string and an array ref of all columns, or undef if there are
# no more lines to read
sub read_and_split {
    my $fh = shift;

    # read one line and return undef if there is no more data
    my $row = <$fh>;
    return unless defined $row;

    # split into columns
    my @cols = split /\s+/, $row;    # Stack Overflow does not like tabs, use \t

    # only chomp after splitting so we retain the original line for printing
    chomp $cols[-1];

    # return both things
    return $row, \@cols;
}

# read the first line
my ( $row_i, $cols_i ) = read_and_split( \*DATA );

# read subsequent lines
while ( my ( $row_i_plus_one, $cols_i_plus_one ) = read_and_split( \*DATA ) ) {

    # 7th col of i is smaller than 6th col of i+1
    if ( $cols_i->[7] < $cols_i_plus_one->[6] ) {
        print $row_i;
    }
    else {
        # compare the 6th element of both row and only print
        # if the row if the 6th element is bigger
        if ( $cols_i->[5] > $cols_i_plus_one->[5] ) {
            print $row_i;
        }
    }

    # turn the current i+1 into i for the next iteration
    $row_i  = $row_i_plus_one;
    $cols_i = $cols_i_plus_one;
}

__DATA__
Recep_L_domain  PF01030.22      112     sp|P00533|EGFR_HUMAN    2.50E-30        104.7   57      167     Receptor
Furin-like      PF00757.18      149     sp|P00533|EGFR_HUMAN    4.10E-29        101.3   185     338     Furin-like
Recep_L_domain  PF01030.22      112     sp|P00533|EGFR_HUMAN    3.60E-28        97.8    361     480     Receptor
GF_recep_IV     PF14843.4       132     sp|P00533|EGFR_HUMAN    1.60E-46        157.2   505     636     Growth
Pkinase PF00069.23      264     sp|P00533|EGFR_HUMAN    2.70E-39        135     712     964     Protein
Pkinase_Tyr     PF07714.15      260     sp|P00533|EGFR_HUMAN    8.40E-88        293.9   714     965     Protein

It outputs these lines:

Recep_L_domain  PF01030.22      112     sp|P00533|EGFR_HUMAN    2.50E-30        104.7   57      167     Receptor
Furin-like      PF00757.18      149     sp|P00533|EGFR_HUMAN    4.10E-29        101.3   185     338     Furin-like
Recep_L_domain  PF01030.22      112     sp|P00533|EGFR_HUMAN    3.60E-28        97.8    361     480     Receptor
GF_recep_IV     PF14843.4       132     sp|P00533|EGFR_HUMAN    1.60E-46        157.2   505     636     Growth

Note that the part about comparing columns six was not very clear in your question. I assumed we compare both columns six and print the one for row i if it's a match. If we were to print row i+1 we might end up printing that line twice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM