My scrape content is not displaying the special characters.It shows some junk values in place of special characters.(€ printed as -aA).Thanks in advance.
# !/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new(agent => "Mozilla/5.0");
my $req = HTTP::Request->new(GET => 'http://www.infanziabimbo.it/costi-modalita-e-tempi-di-spedizione.html');
my $res = $ua->request($req);
die("error") unless $res->is_success;
my $xp = HTML::TreeBuilder::XPath->new_from_content($res->content);
my @node = $xp->findnodes_as_strings('//div[@class="mainbox-body"]');
die("node doesn't exist") if $#node == -1; # Line 18
open HTML, ">C:/Users/jeyakuma/Desktop/kjk.html";
foreach(<@node>)
{
print HTML "$_";
}
close HTML;
"
Here are some observations on your code that I hope will help you
You must always check that a call to open
succeeded, otherwise your program will just continue to run silently without any input or output. Rather than the idiomatic open ... or die $!
you may prefer just to add use autodie
at the top of your code
If the HTTP request fails, it is more informative if your program indicates why it failed instead of just saying "error" . I suggest you write this instead
$res->is_success or die $res->status_line;
If you don't need any special LWP or parse options, then you can just write
my $url = 'http://www.infanziabimbo.it/costi-modalita-e-tempi-di-spedizione.html'; my $xp = HTML::TreeBuilder::XPath->new_from_url($url);
although that doesn't give you any way to specify the user agent string as you do currently
Rather than testing $#node
for equality to -1, it is much neater to check for the truth of @node
, so
die "node doesn't exist" unless @node; # Line 18
If your data contains UTF-8 characters then your output file handle must be set to the appropriate mode. You can change the mode using binmode
, like this
open HTML, ">C:/Users/jeyakuma/Desktop/kjk.html"; binmode HTML, ':encoding(utf-8)';
But the best way is to use the preferred three-parameter form of open
, which would look like this, assuming that you have use autodie
in place at the start of your program
open HTML, '>:encoding(utf-8)', 'C:/Users/jeyakuma/Desktop/kjk.html';
Lexical file handles are far superior to the old-fashioned global file handles
The loop foreach(<@node>) { ... }
is completely wrong because it is equivalent to foreach (glob join ' ', @node) { ... }
and only appears to work because, in general, glob
will leave a filename untouched if it doesn't contain any wildcards. What you meant was just for (@node) { ... }
In addition, it is bad practice to enclose a variable in quotes unless you specifically want to call its stringification method, so "$_"
should be just $_
You may as well write your final output loop as
print HTML @node;
Putting these changes in place, the result looks like this, which I believe will fix your problem
use strict;
use warnings;
use autodie;
use HTML::TreeBuilder::XPath;
my $url = 'http://www.infanziabimbo.it/costi-modalita-e-tempi-di-spedizione.html';
my $xp = HTML::TreeBuilder::XPath->new_from_url($url);
my @node = $xp->findnodes_as_strings('//div[@class="mainbox-body"]');
die "node doesn't exist" unless @node;
open my $html_fh, '>:encoding(utf-8)', 'C:/Users/jeyakuma/Desktop/kjk.html';
print $html_fh @node;
close $html_fh;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.