简体   繁体   中英

Perl substitution not working

I have the following piece of code:

use warnings;
use strict;

my %hash;

open TRADUCTOR, $ARGV[0];
open GTF, $ARGV[1];

while (my $line = <TRADUCTOR>) {
    chomp $line;
    my @chompline = split(/\t/, $line);
    $hash{$chompline[2]} = $chompline[1];
}

while (my $line = <GTF>) {
    my @chompline1 = split(/\t/, $line);
    my @chompline2 = split(/ /, $chompline1[8]);
    my @chompline3 = split(/;/, $chompline2[2]);
    my $transcript = $chompline3[0];
    $transcript =~ s/"//g;
    $transcript =~ s/PAC://g;
    my $transcript2 = $transcript;
    $transcript2 =~ s/(_[jox])/&&&$1/g;
    my @chompline4 = split(/&&&/, $transcript2);
    my $coletilla = $chompline4[1];
    my $transl = $hash{$chompline4[0]};
    if (defined $chompline4[1]) {
        $line =~ s/PAC:$transcript;/$transl.$coletilla;/ee;
    } else {
        $line =~ s/PAC:$transcript;/$hash{$chompline4[0]};/ee;
    }
    print $line;
}

First argument is the following file (first ten lines):

Sb01g017490 Sb01g017490.1   1951419
Sb02g039360 Sb02g039360.1   1959410
Sb01g037620 Sb01g037620.1   1953645
Sb03g003880 Sb03g003880.1   1960464
Sb01g001330 Sb01g001330.1   1949441
Sb01g049890 Sb01g049890.1   1955138
Sb09g030646 Sb09g030646.1   1982110
Sb02g011950 Sb02g011950.1   1956744
Sb04g008540 Sb04g008540.1   1965938

Second argument is the following file (first ten lines):

A01 greenc1.0   exon    5409    5518    .   -   .   gene_id "Bra.XLOC_002074";transcript_id "Bra.TCONS_00002741";transcript_biotype "protein_coding"
A01 greenc1.0   exon    5616    5654    .   -   .   gene_id "Bra.XLOC_002074";transcript_id "Bra.TCONS_00002741";transcript_biotype "protein_coding"
A01 greenc1.0   exon    8307    8530    .   -   .   gene_id "Bra011902";transcript_id "PAC:22703627_j.1";transcript_biotype "protein_coding"
A01 greenc1.0   exon    8426    8530    .   -   .   gene_id "Bra011902";transcript_id "PAC:22703627";transcript_biotype "protein_coding"
A01 greenc1.0   exon    8599    8844    .   -   .   gene_id "Bra011902";transcript_id "PAC:22703627_j.1";transcript_biotype "protein_coding"
A01 greenc1.0   exon    8599    8823    .   -   .   gene_id "Bra011902";transcript_id "PAC:22703627";transcript_biotype "protein_coding"
A01 greenc1.0   exon    8919    9056    .   -   .   gene_id "Bra011902";transcript_id "PAC:22703627_j.1";transcript_biotype "protein_coding"
A01 greenc1.0   exon    8919    9056    .   -   .   gene_id "Bra011902";transcript_id "PAC:22703627";transcript_biotype "protein_coding"
A01 greenc1.0   exon    9151    9413    .   -   .   gene_id "Bra011902";transcript_id "PAC:22703627_j.1";transcript_biotype "protein_coding"
A01 greenc1.0   exon    9151    9413    .   -   .   gene_id "Bra011902";transcript_id "PAC:22703627";transcript_biotype "protein_coding"

I want to replace a string in file 2 for another string coming from file 1. However, text replacement is not working... Nothing is replaced. What is going on?

This is the problematic piece of code:

if (defined $chompline4[1]) {
    $line =~ s/PAC:$transcript;/$transl.$coletilla;/ee;
} else {
    $line =~ s/PAC:$transcript;/$hash{$chompline4[0]};/ee;
}

You are using /ee on your substitution regex. That is horrible practice, and should only be used as a discouraging example of how Perl can be abused. Or encouraging examples of how Perl can be abused by skilled (?) programmers.

You do not need to use evaluation to interpolate variables. You do not need evaluation to concatenate strings. And you are evaluating twice , which means:

$foo = "bar";   # assume
s/.../$foo/ee   # before
s/.../bar/e     # first eval
s/.../bar/      # second eval - bareword becomes string, if you are lucky

The warning issued at the second eval is the same as if you had typed bar into your program:

Unquoted string "bar" may clash with future reserved word
Name "main::bar" used only once: possible typo

Unless you like your data broken, I would suggest using a different strategy. Your current code allows partial matches, removes delimiters, and does not take quotes into consideration where it should. I assume this is why you are mangling your data in preparation for this regex substitution.

I would give you some advice on how to fix the problem, but I do not feel I have enough valid information about what you are trying to do, except that it looks like you are attempting to replace eg

"Bra011902";transcript_id "PAC:22703627";transcript_biotype "protein_coding"

with

Sb01g017490.1   

I've tried running your program, but it's difficult since I have no idea which spaces are spaces and which ones are tabs.

I suggest that instead of using random names such as chomplines , use the actual field names. Also, use /\\s*/ to break your fields which will allow you to get tabs and spaces all at the same time:

chomp $line; # You forgot it the first time around!
my ($f1, $f2, $f3, $f4, $f5, $f6...) = split /\s*/, $line;

Where $f1 , $f2 , etc. are the actual names of the fields being split.

I personally would create a hash keyed by field values:

my @fields = qw(field1 field2 field3 field4 field5 field6); 

Then, split the line and create a hash out of it. (I'd love to figure out a way to do this with map ):

my %values;
my @field_values = split /\s+/, $line;
for my $index ( (0..$#fields) ) {
    $values{ $field[$index] } = $field_values[$index];
}

This will makes your program a lot easier to debug and much easier to keep track with what is happening. Plus, by splitting once on whitespace with /\\s+/ , you don't have to worry about tabs and spaces getting messed up, or two tabs separating a single field. You will need a second split to split fields separated with the semicolon, but using field names instead of array positions will make things a lot easier to work with.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM