如何使用Perl从HTML删除外部链接？

Question

我正在尝试从HTML文档中删除外部链接，但保留锚点，但是运气不佳。 以下正则表达式

$html =~ s/<a href=".+?\.htm">(.+?)<\/a>/$1/sig;

将匹配锚标记的开头和外部链接标记的结尾，例如

<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->

所以我最终一无所有

<a HREF="#FN1" name="01">1</a>
some other html

碰巧所有锚点都具有大写的href属性，因此我知道我可以区分大小写，但是我不想在将来总是这样。

我可以更改某些内容，使其仅与a标签匹配吗？

Answer 1

回应克里斯·卢茨（Chris Lutz）的评论，我希望以下内容表明使用解析器确实非常简单（尤其是如果您希望能够处理尚未看到的输入，例如<a class="external" href="..."> ），而不是使用s///组合脆弱的解决方案。

如果您打算采用s///路线，至少要说实话，请确实将href属性设为大写，而不是摆出灵活性的错觉。

编辑：根据需求；-)，这是使用HTML :: TokeParser :: Simple的版本。 仅使用HTML :: TokeParser查看版本的编辑历史记录。

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

while ( my $token = $parser->get_token ) {
    if ($token->is_start_tag('a')) {
        my $href = $token->get_attr('href');
        if (defined $href and $href !~ /^#/) {
            print $parser->get_trimmed_text('/a');
            $parser->get_token; # discard </a>
            next;
        }
    }
    print $token->as_is;
}

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>

输出：

C:\Temp> hjk
<a HREF="#FN1" name="01">1</a>
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered

<p>Maybe you did not consider click here >>>
either</p>

注意：如果链接的文件具有.html扩展名而不是.htm ，则您检查为“正确”的基于正则表达式的解决方案会中断。 鉴于此，我发现您担心不依赖不必要的大写HREF属性。 如果您真的想要快速又肮脏 ，则不要理会其他任何事情，而应该依靠所有大写HREF来完成它。 但是，如果您想确保代码可以处理更多种类的文档并且使用更长的时间，则应使用适当的解析器。

Answer 2

HTML::Parser更像是SAX类型解析器：

use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::Parser;
use List::Util qw<first>;

my $omitted;

sub tag_handler { 
    my ( $self, $tag_name, $text, $attr_hashref ) = @_;
    if ( $tag_name eq 'a' ) { 
        my $href = first {; defined } @$attr_hashref{ qw<href HREF> };
        $omitted = substr( $href, 0, 7 ) eq 'http://';
        return if $omitted;
    }
    print $text;
}

sub end_handler { 
    my $tag_name = shift;
    if ( $tag_name eq 'a' && $omitted ) { 
        $omitted = false;
        return;
    }
    print shift;
}

my $parser
    = HTML::Parser->new( api_version => 3
                       , default_h   => [ sub { print shift; }, 'text' ]
                       , start_h     => [ \&tag_handler, 'self,tagname,text,attr' ]
                       , end_h       => [ \&end_handler, 'tagname,text' ]
                       );
$parser->parse_file( $path_to_file ) or die $OS_ERROR;

Answer 3

另一个解决方案。 我喜欢HTML :: TreeBuilder和家人。

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $a ($root->find_by_tag_name('a')) {
    if ($a->attr('href') !~ /^#/) {
        $a->replace_with_content($a->as_text);
    }
}
print $root->as_HTML(undef, "\t");

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>

Answer 4

为什么不只删除href属性不是以井号开头的链接？ 像这样：

html =~ s/<a href="[^#][^"]*?">(.+?)<\/a>/$1/sig;

Answer 5

更简单的是，如果您不关心标记属性：

$html =~ s/<a[^>]+>(.+?)<\/a>/$1/sig;

如何使用Perl从HTML删除外部链接？

问题描述

5 个解决方案

解决方案1
11 2009-10-21 01:24:27

解决方案2
6 2009-10-21 02:14:46

解决方案3
1 2009-10-22 20:13:34

解决方案4
0 2009-10-21 00:24:47

解决方案5
0 2016-11-24 13:15:45

如何使用Perl从HTML删除外部链接？

问题描述

5 个解决方案

解决方案1 11 2009-10-21 01:24:27

解决方案2 6 2009-10-21 02:14:46

解决方案3 1 2009-10-22 20:13:34

解决方案4 0 2009-10-21 00:24:47

解决方案5 0 2016-11-24 13:15:45

解决方案1
11 2009-10-21 01:24:27

解决方案2
6 2009-10-21 02:14:46

解决方案3
1 2009-10-22 20:13:34

解决方案4
0 2009-10-21 00:24:47

解决方案5
0 2016-11-24 13:15:45