[英]How can I access the contents of a JavaScript driven web page with Perl?
I was trying to make a little app with Perl to fetch summoner names of League of Legends from LolKing . 我试图用Perl制作一个小应用程序,以从LolKing获取英雄联盟的召唤者名称。
The HTML code has lines like HTML代码中的行如下
<tr data-summonername="MatLife TriHard" class="lb_row_rank_4">
so I was just going with something like 所以我只是想像
use strict;
use warnings;
use LWP::Simple;
use HTML::Parser;
my $find_links = HTML::Parser->new(
start_h => [
sub {
my ($tag, $attr) = @_;
if ($tag eq 'tr' and exists $attr->{'data-summonername'}) {
print "$attr->{'data-summonername'}\n";
}
},
"tag, attr"
]
);
my $html = get('http://www.lolking.net/leaderboards/#/na/1') or die 'nope';
$find_links->parse($html);
but this give me nothing. 但这什么也没给我。 Even with
attr=class
, it give me nothing. 即使使用
attr=class
,它也不会给我任何东西。 I can't fetch the tr
element's class for some reason. 由于某种原因,我无法获取
tr
元素的类。
Using $attr->{data-summonername}
without the single quotes gave me some errors, due to the hyphen I suppose. 由于我想使用连字符,因此在不带单引号的情况下使用
$attr->{data-summonername}
会给我带来一些错误。 If I fetch $attr->{href}
it works just fine. 如果我获取
$attr->{href}
它就可以正常工作。
Can someone help me out? 有人可以帮我吗?
The problem is that the HTML for that page is mostly built by your browser using JavaScript after the page has been downloaded. 问题在于,该页面的HTML主要是由浏览器在下载页面后使用JavaScript构建的。 Using
LWP::Simple::get
will just retrieve the skeleton HTML and the JavaScript code. 使用
LWP::Simple::get
只会检索框架HTML和JavaScript代码。 You can see that if you print $html
instead of parsing it. 您会看到,如果您
print $html
而不是对其进行分析。
The usual solution is to use WWW::Mechanize::Firefox
which gets an installed Firefox to download and build the page which you can then query. 通常的解决方案是使用
WWW::Mechanize::Firefox
,它会安装一个Firefox以下载并构建页面,然后您可以查询该页面。 It's a lot more complex than a simple get
though, as you have to install Firefox if you don't already have it, as well as the Mozilla MozRepl
addon which enables remote control. 它比简单的
get
要复杂得多,因为您必须安装Firefox(如果尚未安装)以及Mozilla MozRepl
附加组件,该附加组件可实现远程控制。 Even then you may still get problems with accessing the contents of the page before the browser has finished building it, so it's not for the faint of heart. 即使这样,在浏览器完成构建页面之前,访问页面内容仍可能会遇到问题,因此这不是出于胆小。
Update 更新资料
For your interest, here is a solution using WWW::Mechanize::Firefox
. 为了您的利益,这是使用
WWW::Mechanize::Firefox
的解决方案。
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use HTML::TreeBuilder::XPath;
my $url = 'http://www.lolking.net/leaderboards/#/na/1';
my $mech = WWW::Mechanize::Firefox->new;
my $resp = $mech->get($url);
die $resp->status_line unless $resp->is_success;
my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->content);
for my $node ( $tree->findnodes('//tr[starts-with(@class, "lb_row_rank")]') ) {
printf "Rank %2d: %s\n",
$node->attr('class') =~ /(\d+)/,
$node->attr('data-summonername');
}
output 输出
Rank 1: Doublelift
Rank 2: F5 Veritas
Rank 3: Life Love Live
Rank 4: MatLife TriHard
Rank 5: TDK Kyle
Rank 6: Liquid FeniX
Rank 7: Liquid Inori TV
Rank 8: dawoofsclaw
Rank 9: who is he
Rank 10: Ohhhq
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.