简体   繁体   中英

Perl regex seems to get into infinite loop

I'm trying to figure out why this code won't run on some sites. Here is a working version:

my $url = "http://www.bbc.co.uk/news/uk-36263685";

`curl -L '$url' > ./foo.txt`;

my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
    $html = join "\n", <READPAGE>;
close(READPAGE);

# works ok with the BBC page, and almost all others
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
   print qq|FOO: got header...\n|;
}

..and then this broken version , just seems to lock up: (exactly the same code - just a different URL)

my $url = "http://www.sport.pl/euro2016/1,136510,20049098,euro-2016-polsat-odkryl-karty-24-mecze-w-kanalach-otwartych.html";

`curl -L '$url' > ./foo.txt`;

my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
    $html = join "\n", <READPAGE>;
close(READPAGE);

# Locks up with this regex. Just seems to be some pages it does it on
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
   print qq|FOO: got header...\n|;
}

I can't work out whats going on with it. Any ideas?

Thanks!

UPDATE: For anyone interested, I ended up moving away from the Perl module I was using to extract the info, and went for a more robust HTML::Parser method. Here is the module, if anyone wants to use it as a base:

 package MetaExtractor;
 use base "HTML::Parser";
 use Data::Dumper;

 sub start {
     my ($self, $tag, $attr, $attrseq, $origtext) = @_;

     if ($tag eq "img") {
         #print Dumper($tag,$attr);

         if ($attr->{src} =~ /\.(jpe?g|png)/i) {
            $attr->{src} =~ s|^//|http://|i; # fix urls like //foo.com
            push @{$Links::COMMON->{images}}, $attr->{src};
         }
     }

     if ($tag =~ /^meta$/i &&  $attr->{'name'} =~ /^description$/i) {
         # set if we find <META NAME="DESCRIPTION"
         $Links::COMMON->{META}->{description} = $attr->{'content'};
     } elsif ($tag =~ /^title$/i && !$Links::COMMON->{META}->{title}) {
         $Links::COMMON->{META}->{title_flag} = 1;
     } elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:description$/i) {
         $Links::COMMON->{META}->{og_desc} = $attr->{content}
     } elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:image$/i) {
         $Links::COMMON->{META}->{og_image} = $attr->{content}
     } elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:description$/i) {
         $Links::COMMON->{META}->{tw_desc} = $attr->{content}
     } elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:image:src$/i) {
         $Links::COMMON->{META}->{tw_image} = $attr->{content}
     }
 }

 sub text {
     my ($self, $text) = @_;
     # If we're in <H1>...</H1> or  <TITLE>...</TITLE>, save text
     if ($Links::COMMON->{META}->{title_flag}) { $Links::COMMON->{META}->{title} .= $text; }
 }

 sub end {
     my ($self, $tag, $origtext) = @_;

     #print qq|END TAG: '$tag'\n|;

     # reset appropriate flag if we see </H1> or </TITLE>
     if ($tag =~ /^title$/i) { $Links::COMMON->{META}->{title_flag} = 0; }
 }

It will extract:

Title Meta description (not meta keywords, but its simple enough to use) FB Image FB Description Twitter Image Twitter Description All the images found (it doesn't do anything to fancy with them... ie pages that have relative URLs ... but I'm gonna have a play with that as time permits)

Simply call with:

        my $html;
        open (READPAGE,"<:encoding(UTF-8)","/home/aycrca/public_html/cgi-bin/admin/tmp/$now.txt");

            my $p = new MetaExtractor;
            while (<READPAGE>) {
                $p->parse($_);
            }
            $p->eof;

        close(READPAGE);

It isn't an inifinite loop, it is just slow. It is finding <header> tags too, and for each one it has to go through the rest of the file looking for an ending </head> tag (which isn't there). Change it to:

`m/<head\b.*?>(.*?)<\/head>/gis`

The problem seems exacerbated by treating the non-utf8 file as utf8.

You have found an instance of catastrophic backtracking (qv)

Even for those sites for which your regex pattern works, the matching will be very lengthy and CPU-intensive. You should avoid the .*? where possible and use a negated character class instead

If you use this, all should be well

$html =~ m| <head\b[^<>]*> (.*) </head> |gisx

<head.*?> is supposed to match just one HTML tag, but there is nothing to prevent the regex engine from searching right to the end of the file. Changing this to <head[^<>]*> will only allow it to match non-angle-brackets after head , which will be only a few characters if any

The captured expression is less simple as you presumably want to match tags contained within the <head> element so the negated character class won't work. However, catastrophic backtracking is almost always the result of multiple wildcards acting simultaneously, so every possible match from one wildcard must be matched with every possible match from another, resulting in exponential complexity. With just one wildcard left the regex should work fine

Note also that I have used an alternative delimiter for the regex so that the slash doesn't need to be escaped, and I have added a word boundary \\b after <head to prevent it from matching <header or similar

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM