简体   繁体   English

Perl正则表达式似乎陷入无限循环

[英]Perl regex seems to get into infinite loop

I'm trying to figure out why this code won't run on some sites. 我试图弄清楚为什么这些代码不能在某些站点上运行。 Here is a working version: 这是一个工作版本:

my $url = "http://www.bbc.co.uk/news/uk-36263685";

`curl -L '$url' > ./foo.txt`;

my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
    $html = join "\n", <READPAGE>;
close(READPAGE);

# works ok with the BBC page, and almost all others
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
   print qq|FOO: got header...\n|;
}

..and then this broken version , just seems to lock up: (exactly the same code - just a different URL) ..然后这个损坏的版本似乎被锁定了:(完全相同的代码-只是一个不同的URL)

my $url = "http://www.sport.pl/euro2016/1,136510,20049098,euro-2016-polsat-odkryl-karty-24-mecze-w-kanalach-otwartych.html";

`curl -L '$url' > ./foo.txt`;

my $html;
open (READPAGE,"<:encoding(UTF-8)","./foo.txt");
    $html = join "\n", <READPAGE>;
close(READPAGE);

# Locks up with this regex. Just seems to be some pages it does it on
my $head;
while( $html =~ m/<head.*?>(.*?)<\/head>/gis ) {
   print qq|FOO: got header...\n|;
}

I can't work out whats going on with it. 我无法弄清楚这是怎么回事。 Any ideas? 有任何想法吗?

Thanks! 谢谢!

UPDATE: For anyone interested, I ended up moving away from the Perl module I was using to extract the info, and went for a more robust HTML::Parser method. 更新:对于感兴趣的任何人,我最终都放弃了用于提取信息的Perl模块,并选择了更强大的HTML :: Parser方法。 Here is the module, if anyone wants to use it as a base: 这是模块,如果有人想将其用作基础:

 package MetaExtractor;
 use base "HTML::Parser";
 use Data::Dumper;

 sub start {
     my ($self, $tag, $attr, $attrseq, $origtext) = @_;

     if ($tag eq "img") {
         #print Dumper($tag,$attr);

         if ($attr->{src} =~ /\.(jpe?g|png)/i) {
            $attr->{src} =~ s|^//|http://|i; # fix urls like //foo.com
            push @{$Links::COMMON->{images}}, $attr->{src};
         }
     }

     if ($tag =~ /^meta$/i &&  $attr->{'name'} =~ /^description$/i) {
         # set if we find <META NAME="DESCRIPTION"
         $Links::COMMON->{META}->{description} = $attr->{'content'};
     } elsif ($tag =~ /^title$/i && !$Links::COMMON->{META}->{title}) {
         $Links::COMMON->{META}->{title_flag} = 1;
     } elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:description$/i) {
         $Links::COMMON->{META}->{og_desc} = $attr->{content}
     } elsif ($tag =~ /^meta$/i && $attr->{'property'} =~ /^og:image$/i) {
         $Links::COMMON->{META}->{og_image} = $attr->{content}
     } elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:description$/i) {
         $Links::COMMON->{META}->{tw_desc} = $attr->{content}
     } elsif ($tag =~ /^meta$/i && $attr->{'name'} =~ /^twitter:image:src$/i) {
         $Links::COMMON->{META}->{tw_image} = $attr->{content}
     }
 }

 sub text {
     my ($self, $text) = @_;
     # If we're in <H1>...</H1> or  <TITLE>...</TITLE>, save text
     if ($Links::COMMON->{META}->{title_flag}) { $Links::COMMON->{META}->{title} .= $text; }
 }

 sub end {
     my ($self, $tag, $origtext) = @_;

     #print qq|END TAG: '$tag'\n|;

     # reset appropriate flag if we see </H1> or </TITLE>
     if ($tag =~ /^title$/i) { $Links::COMMON->{META}->{title_flag} = 0; }
 }

It will extract: 它将提取:

Title Meta description (not meta keywords, but its simple enough to use) FB Image FB Description Twitter Image Twitter Description All the images found (it doesn't do anything to fancy with them... ie pages that have relative URLs ... but I'm gonna have a play with that as time permits) 标题元描述(不是元关键字,但使用起来很简单)FB图像FB描述Twitter图像Twitter描述找到的所有图像(对它们没有任何作用……即具有相对URL的页面...但我会在时间允许的情况下发挥作用)

Simply call with: 只需致电:

        my $html;
        open (READPAGE,"<:encoding(UTF-8)","/home/aycrca/public_html/cgi-bin/admin/tmp/$now.txt");

            my $p = new MetaExtractor;
            while (<READPAGE>) {
                $p->parse($_);
            }
            $p->eof;

        close(READPAGE);

It isn't an inifinite loop, it is just slow. 这不是无限循环,只是缓慢。 It is finding <header> tags too, and for each one it has to go through the rest of the file looking for an ending </head> tag (which isn't there). 它也在寻找<header>标记,对于每个标记,它都必须遍历文件的其余部分以寻找结尾的</head>标记(不存在)。 Change it to: 更改为:

`m/<head\b.*?>(.*?)<\/head>/gis`

The problem seems exacerbated by treating the non-utf8 file as utf8. 通过将非utf8文件视为utf8,似乎使问题更加恶化了。

You have found an instance of catastrophic backtracking (qv) 您已经找到了灾难性回溯 (qv)的实例

Even for those sites for which your regex pattern works, the matching will be very lengthy and CPU-intensive. 即使对于那些使用您的正则表达式模式的网站,匹配也会非常冗长且占用大量CPU。 You should avoid the .*? 您应该避免使用.*? where possible and use a negated character class instead 尽可能使用否定的字符类

If you use this, all should be well 如果使用这个,一切应该很好

$html =~ m| <head\b[^<>]*> (.*) </head> |gisx

<head.*?> is supposed to match just one HTML tag, but there is nothing to prevent the regex engine from searching right to the end of the file. <head.*?>应该只与一个HTML标记匹配,但是没有什么可以阻止正则表达式引擎搜索到文件末尾。 Changing this to <head[^<>]*> will only allow it to match non-angle-brackets after head , which will be only a few characters if any 将其更改为<head[^<>]*>只会使其与head后面的非尖括号匹配,如果有的话,将仅包含几个字符

The captured expression is less simple as you presumably want to match tags contained within the <head> element so the negated character class won't work. 捕获的表达式不是那么简单,因为您可能想匹配<head>元素中包含的标记,因此否定的字符类将无法工作。 However, catastrophic backtracking is almost always the result of multiple wildcards acting simultaneously, so every possible match from one wildcard must be matched with every possible match from another, resulting in exponential complexity. 但是,灾难性的回溯几乎总是由多个通配符同时起作用的结果,因此,一个通配符的每个可能匹配都必须与另一个通配符的每个可能匹配相匹配,从而导致指数复杂性。 With just one wildcard left the regex should work fine 只剩下一个通配符,正则表达式应该可以正常工作

Note also that I have used an alternative delimiter for the regex so that the slash doesn't need to be escaped, and I have added a word boundary \\b after <head to prevent it from matching <header or similar 还要注意,我为正则表达式使用了替代定界符,以便不需要转义斜线,并且在<head之后添加了单词边界\\b以防止其与<header或类似内容匹配

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM