[英]WWW::Mechanize not handling apostrophes or dashes
我一直在尝试从Metacritic中提取信息,但是现在遇到了无法正确提取带有撇号或破折号的文本的问题。
以下代码说明了此问题:
use WWW::Mechanize;
$reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
$Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';
$l = WWW::Mechanize->new();
$l->get($reviewspage);
$k = $l->content;
@Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s;
print "@Review\n";
输出:
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
即使网站上的代码是:
<div class="review_body">
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
</div>
在使用WWW :: Mechanize之前,我已经创建了类似的脚本,但它们都没有替换掉这样的字符。
Metacritic使用utf8字符集:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
因此,要将该内容打印到控制台,必须适应该字符集。
在Windows机器上,我必须在控制台中运行chcp 65001
,然后才能执行perl脚本。 而且我必须指定STDOUT 字符集 :
use strict;
use warnings;
use utf8;
use WWW::Mechanize;
binmode STDOUT, ':utf8'; # output should be in UTF-8
my $url = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';
my $lwp = WWW::Mechanize->new();
$lwp->get($url);
my $data = $lwp->content;
if ($data =~ m{$Review.*?<div class="review_body">(.*?)</div>}s) {
print "$1\n";
} else {
warn "Review not found";
}
输出(添加了换行符):
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins
and others won’t persuade you that Death could have been huge, nor does a
clichéd last-act reunion show. But the film’s alternating inquiry — into
family love, slow compromise and, yes, death — resonates strongly.
显然,这是一个unicode问题。
根据此答案中的建议,我能够使此版本的代码正常工作:
use v5.12 ;
use utf8 ;
use open qw( :encoding(UTF-8) :std ) ;
use WWW::Mechanize ;
my $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews' ;
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.' ;
my $l = WWW::Mechanize->new() ;
$l->get($reviewspage) ;
my $k = $l->content ;
my @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s ;
print "@Review\n" ;
输出:
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.