简体   繁体   中英

Perl not printing the special characters

My scrape content is not displaying the special characters.It shows some junk values in place of special characters.(€ printed as -aA).Thanks in advance.

#  !/usr/bin/perl 
use strict;
use warnings;

use HTML::TreeBuilder::XPath;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new(agent => "Mozilla/5.0");
my $req = HTTP::Request->new(GET => 'http://www.infanziabimbo.it/costi-modalita-e-tempi-di-spedizione.html');
my $res = $ua->request($req);

die("error") unless $res->is_success;

my $xp = HTML::TreeBuilder::XPath->new_from_content($res->content);
my @node =  $xp->findnodes_as_strings('//div[@class="mainbox-body"]');
die("node doesn't exist") if $#node == -1; # Line 18
open HTML, ">C:/Users/jeyakuma/Desktop/kjk.html";
foreach(<@node>)
{

print HTML "$_";


}
close HTML;


"

Here are some observations on your code that I hope will help you

  • You must always check that a call to open succeeded, otherwise your program will just continue to run silently without any input or output. Rather than the idiomatic open ... or die $! you may prefer just to add use autodie at the top of your code

  • If the HTTP request fails, it is more informative if your program indicates why it failed instead of just saying "error" . I suggest you write this instead

     $res->is_success or die $res->status_line; 
  • If you don't need any special LWP or parse options, then you can just write

     my $url = 'http://www.infanziabimbo.it/costi-modalita-e-tempi-di-spedizione.html'; my $xp = HTML::TreeBuilder::XPath->new_from_url($url); 

    although that doesn't give you any way to specify the user agent string as you do currently

  • Rather than testing $#node for equality to -1, it is much neater to check for the truth of @node , so

     die "node doesn't exist" unless @node; # Line 18 
  • If your data contains UTF-8 characters then your output file handle must be set to the appropriate mode. You can change the mode using binmode , like this

     open HTML, ">C:/Users/jeyakuma/Desktop/kjk.html"; binmode HTML, ':encoding(utf-8)'; 

    But the best way is to use the preferred three-parameter form of open , which would look like this, assuming that you have use autodie in place at the start of your program

     open HTML, '>:encoding(utf-8)', 'C:/Users/jeyakuma/Desktop/kjk.html'; 
  • Lexical file handles are far superior to the old-fashioned global file handles

  • The loop foreach(<@node>) { ... } is completely wrong because it is equivalent to foreach (glob join ' ', @node) { ... } and only appears to work because, in general, glob will leave a filename untouched if it doesn't contain any wildcards. What you meant was just for (@node) { ... }

  • In addition, it is bad practice to enclose a variable in quotes unless you specifically want to call its stringification method, so "$_" should be just $_

  • You may as well write your final output loop as

     print HTML @node; 

Putting these changes in place, the result looks like this, which I believe will fix your problem

use strict;
use warnings;
use autodie;

use HTML::TreeBuilder::XPath;

my $url = 'http://www.infanziabimbo.it/costi-modalita-e-tempi-di-spedizione.html';
my $xp  = HTML::TreeBuilder::XPath->new_from_url($url);

my @node = $xp->findnodes_as_strings('//div[@class="mainbox-body"]');
die "node doesn't exist" unless @node;

open my $html_fh, '>:encoding(utf-8)', 'C:/Users/jeyakuma/Desktop/kjk.html';
print $html_fh @node;
close $html_fh;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM