简体   繁体   English

Web爬虫:使用Perl的MozRepl模块来处理Javascript

[英]Web crawler: Using Perl's MozRepl module to deal with Javascript

I am trying to save a couple of web pages by using a web crawler. 我试图通过使用网络爬虫来保存几个网页。 Usually I prefer doing it with perl's WWW::Mechanize modul. 通常我更喜欢使用perl的WWW::Mechanize模块。 However, as far as I can tell, the site I am trying to crawl has many javascripts on it which seem to be hard to avoid. 但是,据我所知,我试图抓取的网站上有许多javascripts,似乎很难避免。 Therefore I looked into the following perl modules 因此,我研究了以下perl模块

The Firefox MozRepl extension itself works perfectly. Firefox MozRepl扩展本身运行良好。 I can use the terminal for navigating the web site just the way it is shown in the developer's tutorial - in theory. 我可以使用终端来按照开发人员教程中显示的方式导航网站 - 理论上。 However, I have no idea about javascript and therefore am having a hard time using the moduls properly. 但是,我不知道javascript,因此很难正确使用模块。

So here is the source i like to start from: Morgan Stanley 所以这里是我喜欢的来源: 摩根士丹利

For a couple of listed firms beneath 'Companies - as of 10/14/2011' I like to save their respective pages. 对于“公司 - 截至10/14/2011”之下的几家上市公司,我喜欢保存各自的页面。 Eg clicking on the first listed company (ie '1-800-Flowers.com, Inc') a javascript function gets called with two arguments -> dtxt('FLWS.O','2011-10-14') , which produces the desired new page. 例如,点击第一个上市公司(即'1-800-Flowers.com,Inc'),javascript函数将被调用两个参数 - > dtxt('FLWS.O','2011-10-14') ,它产生所需的新页面。 The page I now like to save locally. 我现在想要在本地保存的页面。

With perl's MozRepl module I thought about something like this: 使用perl的MozRepl模块,我想到了这样的事情:

use strict;
use warnings;
use MozRepl;

my $repl = MozRepl->new;
$repl->setup; 
$repl->execute('window.open("http://www.morganstanley.com/eqr/disclosures/webapp/coverage")');

$repl->repl_enter({ source => "content" });
$repl->execute('dtxt("FLWS.O", "2011-10-14")');

Now I like to save the produced HTML page. 现在我想保存生成的HTML页面。

So again, the desired code I like to produce should visit for a couple of firms their HTML site and simply save the web page. 所以,我想制作的所需代码应该访问几个公司的HTML网站,只需保存网页即可。 (Here are eg three firms: MMM.N, FLWS.O, SSRX.O) (这里有三家公司:MMM.N,FLWS.O,SSRX.O)

  1. Is it correct, that I cannot go around the page's javascript functions and therefore cannot use WWW::Mechanize ? 这是正确的,我不能绕过页面的javascript函数,因此无法使用WWW::Mechanize
  2. Following question 1, are the mentioned perl modules a plausible approach to take? 问题1之后,提到的perl模块是否采取了合理的方法?
  3. And finally, if you say the first two questions can be anwsered with yes, it would be really nice if you can help me out with the actual coding. 最后,如果你说前两个问题可以用yes表示,那么如果你可以帮我解决实际的编码问题会非常好。 Eg in the above code, the essential part which is missing is a 'save'-command . 例如,在上面的代码中,缺少的基本部分是'save'-command (Maybe using Firefox's saveDocument function?) (也许使用Firefox的saveDocument函数?)

The web works via HTTP requests and responses. Web通过HTTP请求和响应工作。

If you can discover the proper request to send, then you will get the proper response. 如果您能发现正确的发送请求,那么您将得到正确的响应。

If the target site uses JS to form the request, then you can either execute the JS, or analyse what it does so that you can do the same in the language that you are using. 如果目标站点使用JS来形成请求,那么您可以执行JS,也可以分析它的功能,以便您可以使用所使用的语言执行相同操作。

An even easier approach is to use a tool that will capture the resulting request for you, whether the request is created by JS or not, then you can craft your scraping code to create the request that you want. 更简单的方法是使用一个工具来捕获生成的请求,无论请求是否由JS创建,然后您可以制作您的抓取代码来创建您想要的请求。

The "Web Scraping Proxy" from AT&T is such a tool. AT&T的“Web Scraping Proxy”就是这样一个工具。

You set it up, then navigate the website as normal to get to the page you want to scrape, and the WSP will log all requests and responses for you. 您进行了设置,然后正常浏览网站以进入您想要抓取的页面,WSP将为您记录所有请求和响应。

It logs them in the form of Perl code, which you can then modify to suit your needs. 它以Perl代码的形式记录它们,然后您可以根据需要进行修改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM