简体   繁体   中英

WWW::Mechanize not handling apostrophes or dashes

I've been working on extracting info from Metacritic, but I've now run into the problem of not being able to extract cleanly text that has apostrophes or dashes.

This problem is illustrated in the following code:

use WWW::Mechanize;
 $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
 $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';
 $l = WWW::Mechanize->new();
    $l->get($reviewspage);
    $k = $l->content;
    @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s;
    print "@Review\n";

Outputs:

                                Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.

Even though the coding on the website is:

<div class="review_body">
                                Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
                            </div>

I've created similar scripts before that have used WWW::Mechanize and none of them have substituted out characters like this.

Metacritic uses utf8 character set:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

Therefore to print this content to the console, one must adapt to that character set.

On my windows machine, I must run chcp 65001 in the console before executing my perl script. And I must specify the STDOUT character set :

use strict;
use warnings;
use utf8;

use WWW::Mechanize;

binmode STDOUT, ':utf8';   # output should be in UTF-8

my $url = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';

my $lwp = WWW::Mechanize->new();
$lwp->get($url);
my $data = $lwp->content;

if ($data =~ m{$Review.*?<div class="review_body">(.*?)</div>}s) {
    print "$1\n";
} else {
    warn "Review not found";
}

Outputs (newlines added):

Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins
and others won’t persuade you that Death could have been huge, nor does a
clichéd last-act reunion show. But the film’s alternating inquiry — into
family love, slow compromise and, yes, death — resonates strongly.

Apparently it's a unicode issue.

Per the recommendations in this answer , I was able to get this version of your code to work:

use v5.12 ;
use utf8 ;
use open qw( :encoding(UTF-8) :std ) ;

use WWW::Mechanize ;
 my $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews' ;
 my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.' ;
 my $l = WWW::Mechanize->new() ;
    $l->get($reviewspage) ;   
    my $k = $l->content ;   
    my @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s ;
    print "@Review\n" ;

Output:

                                    Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM