如何使用Perl访问JavaScript驱动的网页的内容？

Question

I was trying to make a little app with Perl to fetch summoner names of League of Legends from LolKing . 我试图用Perl制作一个小应用程序，以从LolKing获取英雄联盟的召唤者名称。

The HTML code has lines like HTML代码中的行如下

<tr data-summonername="MatLife TriHard" class="lb_row_rank_4">

so I was just going with something like 所以我只是想像

use strict;
use warnings;

use LWP::Simple;
use HTML::Parser;

my $find_links = HTML::Parser->new(
  start_h => [
    sub {
      my ($tag, $attr) = @_;
      if ($tag eq 'tr' and exists $attr->{'data-summonername'}) {
        print "$attr->{'data-summonername'}\n";
      }
    },
    "tag, attr"
  ]
);

my $html = get('http://www.lolking.net/leaderboards/#/na/1') or die 'nope';

$find_links->parse($html);

but this give me nothing. 但这什么也没给我。 Even with attr=class , it give me nothing. 即使使用attr=class ，它也不会给我任何东西。 I can't fetch the tr element's class for some reason. 由于某种原因，我无法获取tr元素的类。

Using $attr->{data-summonername} without the single quotes gave me some errors, due to the hyphen I suppose. 由于我想使用连字符，因此在不带单引号的情况下使用$attr->{data-summonername}会给我带来一些错误。 If I fetch $attr->{href} it works just fine. 如果我获取$attr->{href}它就可以正常工作。

Can someone help me out? 有人可以帮我吗？

Answer 1

The problem is that the HTML for that page is mostly built by your browser using JavaScript after the page has been downloaded. 问题在于，该页面的HTML主要是由浏览器在下载页面后使用JavaScript构建的。 Using LWP::Simple::get will just retrieve the skeleton HTML and the JavaScript code. 使用LWP::Simple::get只会检索框架HTML和JavaScript代码。 You can see that if you print $html instead of parsing it. 您会看到，如果您print $html而不是对其进行分析。

The usual solution is to use WWW::Mechanize::Firefox which gets an installed Firefox to download and build the page which you can then query. 通常的解决方案是使用WWW::Mechanize::Firefox ，它会安装一个Firefox以下载并构建页面，然后您可以查询该页面。 It's a lot more complex than a simple get though, as you have to install Firefox if you don't already have it, as well as the Mozilla MozRepl addon which enables remote control. 它比简单的get要复杂得多，因为您必须安装Firefox（如果尚未安装）以及Mozilla MozRepl附加组件，该附加组件可实现远程控制。 Even then you may still get problems with accessing the contents of the page before the browser has finished building it, so it's not for the faint of heart. 即使这样，在浏览器完成构建页面之前，访问页面内容仍可能会遇到问题，因此这不是出于胆小。

Update 更新资料

For your interest, here is a solution using WWW::Mechanize::Firefox . 为了您的利益，这是使用WWW::Mechanize::Firefox的解决方案。

use strict;
use warnings;

use WWW::Mechanize::Firefox;
use HTML::TreeBuilder::XPath;

my $url = 'http://www.lolking.net/leaderboards/#/na/1';

my $mech = WWW::Mechanize::Firefox->new;
my $resp = $mech->get($url);
die $resp->status_line unless $resp->is_success;

my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->content);

for my $node ( $tree->findnodes('//tr[starts-with(@class, "lb_row_rank")]') ) {
  printf "Rank %2d: %s\n",
      $node->attr('class') =~ /(\d+)/,
      $node->attr('data-summonername');
}

output 输出

Rank  1: Doublelift
Rank  2: F5 Veritas
Rank  3: Life Love Live 
Rank  4: MatLife TriHard
Rank  5: TDK Kyle
Rank  6: Liquid FeniX
Rank  7: Liquid Inori TV
Rank  8: dawoofsclaw
Rank  9: who is he
Rank 10: Ohhhq

如何使用Perl访问JavaScript驱动的网页的内容？

问题描述

1 个解决方案

解决方案1
3 2015-03-19 11:39:20

如何使用Perl访问JavaScript驱动的网页的内容？

问题描述

1 个解决方案

解决方案1 3 2015-03-19 11:39:20

解决方案1
3 2015-03-19 11:39:20