如何使用HTML :: Parser提取标签之间的文本？

Question

I need to do some parse some data off webpages. 我需要从网页上解析一些数据。 How do I extract text between tags using HTML::Parser? 如何使用HTML :: Parser提取标签之间的文本？

Consider the following sample code: 考虑以下示例代码：

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;
use Data::Dumper;

my $find_title = HTML::Parser->new(
    api_version => 3,
    start_h => [ 
        sub {
             my ($tag, $attr) = @_;
             print Dumper \@_;
            }, 
        'tag'
               ],
  );

my $html = join '',
    "<html><head><title>Extract me!</title></head><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

$find_title->report_tags('title');
$find_title->parse($html);

How do I fix this so I can extract the title? 如何解决此问题，以便提取标题？ This only extracts the tag. 这仅提取标签。

Answer 1

You need a text_h handler to collect the text, and an end_h handler to do something when the </title> tag appears (at which point the text inside the tag has been collected). 你需要一个text_h处理程序来收集文字和end_h处理程序时做一些事情</title>出现标签（此时，在标签内的文本已经被收集）。

HTML::Parser is a fairly low-level module, you may be happier with one of the many modules built on top of it, like HTML::TreeBuilder or HTML::TokeParser . HTML :: Parser是一个相当底层的模块，您可能更高兴使用基于它构建的众多模块之一，例如HTML :: TreeBuilder或HTML :: TokeParser 。

For example, HTML::HeadParser makes extracting the title trivial: 例如， HTML :: HeadParser使提取标题变得简单：

use strict;
use warnings;

use HTML::HeadParser;

my $html = join '',
    "<html><head><title>Extract me!</title></head><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

my $p = HTML::HeadParser->new;
$p->parse($html);

my $title = $p->header('Title');

如何使用HTML :: Parser提取标签之间的文本？

问题描述

1 个解决方案

解决方案1
1 已采纳 2010-12-27 07:55:26

如何使用HTML :: Parser提取标签之间的文本？

问题描述

1 个解决方案

解决方案1 1 已采纳 2010-12-27 07:55:26

解决方案1
1 已采纳 2010-12-27 07:55:26