简体   繁体   中英

Perl do substitution in substitution itself

I was doing some regex substitution operation with the html snippet using Perl.

This is how I match the wanted part: (class="p_hw"><a href=")(http://[^<>"]*?xxxx\.com\/[^<>"]*[=/])([^<>"]*)(">(?:<b>)?)(.*?)(?=<)

I need to replace the http:// with entry:// followed by certain parameter value of the http url( $3 for that matter) if that value exists in a hash( %hw_f ), or else the first word(or phrase) from $5 will be used when it exists in %hw_f . If all conditions are not matched, the snippet will stay unchanged.

I have tried the following:

s#(class="p_hw"><a href=")(http://[^<>"]*?xxxx\.com\/[^<>"]*[=/])([^<>"]*)(">(?:<b>)?)(.*?)(?=<)#
        my @n = split(/\,|;/, $5);
    my @m = map {s,^\s+|\s+$,,mgr} @n;
    my $new = $3 =~ s/^\s+|\s+$//mgr;
    my $new2 = $new =~ s/\+/ /mgr;
    exists $hw_f{$new2} ? "$1entry://$new2$4$5" : (exists $hw_f{$m[0]} ? "$1entry://$m[0]$4$5" : "$1$2$3$4$5") #eg;

%hw_f is where all conditions will be matched against.

It gives the following error:

Use of uninitialized value $1 in concatenation (.) or string

I need to obtain a new value based on $3 within the substitution, continue with that new value. How could I do that?

I'm not going to try to really fix the logic of what you're trying to accomplish because it's rather ill advised. What I will do is offer some semantic and coding advice.

1: Use Regexp::Common and URI to deal with URLs. It is almost never worth it to write your own regexes. Parsing HTML with regex requires that you seriously know what you're doing. https://metacpan.org/search?q=regexp%3A%3Acommon

2: Always only use {} and // to wrap regex. (A 99% rule)

3: Always immediately copy the numbered variables into meaningfully named my() variables unless the expression is trivial.

4: Modify arrays inplace with postfix foreach.

5: Spread out the code formatting to make it visually appealing.

6: Use sprintf for complicated variable recombinations. It makes it a lot easier to see what variable is used where and for what.

HTH

#  1                        2                                     3        4           5
s{(class="p_hw"><a href=\")(http://[^<>"]*?xxxx\.com/[^<>"]*[=/])([^<>\"]*)(\">(?:<b>)?)(.*?)(?=<)}{
    my ($m1, $m2, $m3, $m4, $m5) = ($1, $2, $3, $4, $5);
    my @n = split /[,|;]/, $m5;
    s/^\s+|\s+$//mg foreach @n;
    (my $new = $m3) =~ s/^\s+|\s+$//mg;
    (my $new2 = $new) =~ s/\+/ /g;
    exists $hw_f{$new2} ?
        sprintf "%sentry://%s%s%s", $m1, $new2, $m4, $m5 :
        exists $hw_f{$n[0]} ? 
        sprintf "%sentry://%s%s%s", $m1, $n[0], $m4, $m5 :
        "$m1$m2$m3$m4$m5";
}ige;

Update:

while (<DICT>) {
s#(class="p_hw"><a href=")(http://[^<>"]*?wordinfo\.info\/[^<>"]*[=/])([^<>"]*)(">(?:<b>)?)(.*?)(?=<)#
        my $one = $1;
    my $two = $2;
    my $three = $3;
    my $four = $4;
    my $five = $5;
        my @n = split(/\,|;/, $5);
    my @m = map {s,^\s+|\s+$,,mgr} @n;
    my $new = $3 =~ s/^\s+|\s+$//mgr;
    my $new2 = $new =~ s/\+/ /mgr;
    exists $hw_f{$new2} ? $one."entry://$new2$four$five" : (exists $hw_f{$m[0]} ? $one."entry://$m[0]$four$five" : "$one$two$three$four$five") #eg;

    print $FH $_;
}

Assigning all the capture variables before all the regex engine invocation as @DavidO in the comment mentioned, it finally works. Thanks.

from your post it is not obvious what you try to achieve. If you would describe the problem in following format it would be easier to understand

--- Example -----------------------

I extract from web page a snippet with <a href="http:\\....... which I would like to convert/transform into following format <a href="http:\\....... .

At least in this way we know what is INPUT and what OUTPUT expected.

--- End of the example ------------

When you apply regex with memory it is easier to store remembered values in an array or better hash

use strict;
use warnings;

use Data::Dumper;

my %href;

$data = shift;

if( $data =~ /<a href="(\w+):\\\\([\w\d\.]+)\\([\w\d\.]+)\\(.+)">([^<]+)</ ) {
    @href{qw(protocol dns dir rest desc)} = ($1,$2,$3,$4,$5);
    print Dumper(\%href);
} else {
    print "No match found\n";
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM