簡體   English   中英

WWW :: Mechanize不處理撇號或破折號

[英]WWW::Mechanize not handling apostrophes or dashes

我一直在嘗試從Metacritic中提取信息,但是現在遇到了無法正確提取帶有撇號或破折號的文本的問題。

以下代碼說明了此問題:

use WWW::Mechanize;
 $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
 $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';
 $l = WWW::Mechanize->new();
    $l->get($reviewspage);
    $k = $l->content;
    @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s;
    print "@Review\n";

輸出:

                                Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.

即使網站上的代碼是:

<div class="review_body">
                                Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
                            </div>

在使用WWW :: Mechanize之前,我已經創建了類似的腳本,但它們都沒有替換掉這樣的字符。

Metacritic使用utf8字符集:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

因此,要將該內容打印到控制台,必須適應該字符集。

在Windows機器上,我必須在控制台中運行chcp 65001 ,然后才能執行perl腳本。 而且我必須指定STDOUT 字符集

use strict;
use warnings;
use utf8;

use WWW::Mechanize;

binmode STDOUT, ':utf8';   # output should be in UTF-8

my $url = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';

my $lwp = WWW::Mechanize->new();
$lwp->get($url);
my $data = $lwp->content;

if ($data =~ m{$Review.*?<div class="review_body">(.*?)</div>}s) {
    print "$1\n";
} else {
    warn "Review not found";
}

輸出(添加了換行符):

Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins
and others won’t persuade you that Death could have been huge, nor does a
clichéd last-act reunion show. But the film’s alternating inquiry — into
family love, slow compromise and, yes, death — resonates strongly.

顯然,這是一個unicode問題。

根據此答案中的建議,我能夠使此版本的代碼正常工作:

use v5.12 ;
use utf8 ;
use open qw( :encoding(UTF-8) :std ) ;

use WWW::Mechanize ;
 my $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews' ;
 my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.' ;
 my $l = WWW::Mechanize->new() ;
    $l->get($reviewspage) ;   
    my $k = $l->content ;   
    my @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s ;
    print "@Review\n" ;

輸出:

                                    Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM