[英]How can I extract text between tags using HTML::Parser?
I need to do some parse some data off webpages. 我需要从网页上解析一些数据。 How do I extract text between tags using HTML::Parser? 如何使用HTML :: Parser提取标签之间的文本?
Consider the following sample code: 考虑以下示例代码:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
use Data::Dumper;
my $find_title = HTML::Parser->new(
api_version => 3,
start_h => [
sub {
my ($tag, $attr) = @_;
print Dumper \@_;
},
'tag'
],
);
my $html = join '',
"<html><head><title>Extract me!</title></head><body>",
(map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
"</body></html>";
$find_title->report_tags('title');
$find_title->parse($html);
How do I fix this so I can extract the title? 如何解决此问题,以便提取标题? This only extracts the tag. 这仅提取标签。
You need a text_h
handler to collect the text, and an end_h
handler to do something when the </title>
tag appears (at which point the text inside the tag has been collected). 你需要一个text_h
处理程序来收集文字和end_h
处理程序时做一些事情</title>
出现标签(此时,在标签内的文本已经被收集)。
HTML::Parser is a fairly low-level module, you may be happier with one of the many modules built on top of it, like HTML::TreeBuilder or HTML::TokeParser . HTML :: Parser是一个相当底层的模块,您可能更高兴使用基于它构建的众多模块之一,例如HTML :: TreeBuilder或HTML :: TokeParser 。
For example, HTML::HeadParser makes extracting the title trivial: 例如, HTML :: HeadParser使提取标题变得简单:
use strict;
use warnings;
use HTML::HeadParser;
my $html = join '',
"<html><head><title>Extract me!</title></head><body>",
(map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
"</body></html>";
my $p = HTML::HeadParser->new;
$p->parse($html);
my $title = $p->header('Title');
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.