[英]WWW::Mechanize not handling apostrophes or dashes
我一直在嘗試從Metacritic中提取信息,但是現在遇到了無法正確提取帶有撇號或破折號的文本的問題。
以下代碼說明了此問題:
use WWW::Mechanize;
$reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
$Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';
$l = WWW::Mechanize->new();
$l->get($reviewspage);
$k = $l->content;
@Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s;
print "@Review\n";
輸出:
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
即使網站上的代碼是:
<div class="review_body">
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
</div>
在使用WWW :: Mechanize之前,我已經創建了類似的腳本,但它們都沒有替換掉這樣的字符。
Metacritic使用utf8字符集:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
因此,要將該內容打印到控制台,必須適應該字符集。
在Windows機器上,我必須在控制台中運行chcp 65001
,然后才能執行perl腳本。 而且我必須指定STDOUT 字符集 :
use strict;
use warnings;
use utf8;
use WWW::Mechanize;
binmode STDOUT, ':utf8'; # output should be in UTF-8
my $url = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';
my $lwp = WWW::Mechanize->new();
$lwp->get($url);
my $data = $lwp->content;
if ($data =~ m{$Review.*?<div class="review_body">(.*?)</div>}s) {
print "$1\n";
} else {
warn "Review not found";
}
輸出(添加了換行符):
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins
and others won’t persuade you that Death could have been huge, nor does a
clichéd last-act reunion show. But the film’s alternating inquiry — into
family love, slow compromise and, yes, death — resonates strongly.
顯然,這是一個unicode問題。
根據此答案中的建議,我能夠使此版本的代碼正常工作:
use v5.12 ;
use utf8 ;
use open qw( :encoding(UTF-8) :std ) ;
use WWW::Mechanize ;
my $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews' ;
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.' ;
my $l = WWW::Mechanize->new() ;
$l->get($reviewspage) ;
my $k = $l->content ;
my @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s ;
print "@Review\n" ;
輸出:
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.