繁体   English   中英

WWW :: Mechanize不处理撇号或破折号

[英]WWW::Mechanize not handling apostrophes or dashes

我一直在尝试从Metacritic中提取信息,但是现在遇到了无法正确提取带有撇号或破折号的文本的问题。

以下代码说明了此问题:

use WWW::Mechanize;
 $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
 $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';
 $l = WWW::Mechanize->new();
    $l->get($reviewspage);
    $k = $l->content;
    @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s;
    print "@Review\n";

输出:

                                Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.

即使网站上的代码是:

<div class="review_body">
                                Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
                            </div>

在使用WWW :: Mechanize之前,我已经创建了类似的脚本,但它们都没有替换掉这样的字符。

Metacritic使用utf8字符集:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

因此,要将该内容打印到控制台,必须适应该字符集。

在Windows机器上,我必须在控制台中运行chcp 65001 ,然后才能执行perl脚本。 而且我必须指定STDOUT 字符集

use strict;
use warnings;
use utf8;

use WWW::Mechanize;

binmode STDOUT, ':utf8';   # output should be in UTF-8

my $url = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';

my $lwp = WWW::Mechanize->new();
$lwp->get($url);
my $data = $lwp->content;

if ($data =~ m{$Review.*?<div class="review_body">(.*?)</div>}s) {
    print "$1\n";
} else {
    warn "Review not found";
}

输出(添加了换行符):

Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins
and others won’t persuade you that Death could have been huge, nor does a
clichéd last-act reunion show. But the film’s alternating inquiry — into
family love, slow compromise and, yes, death — resonates strongly.

显然,这是一个unicode问题。

根据此答案中的建议,我能够使此版本的代码正常工作:

use v5.12 ;
use utf8 ;
use open qw( :encoding(UTF-8) :std ) ;

use WWW::Mechanize ;
 my $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews' ;
 my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.' ;
 my $l = WWW::Mechanize->new() ;
    $l->get($reviewspage) ;   
    my $k = $l->content ;   
    my @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s ;
    print "@Review\n" ;

输出:

                                    Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM