刮擦动态表单WWW :: Mechanize Perl

Question

I am attempting to scrape a form and its fields from a page using the WWW::Mechanize module. 我正在尝试使用WWW :: Mechanize模块从页面上抓取表单及其字段。

Because the main body of the webpage is created using document.write JS calls, the form methods from this module aren't finding the form I am looking for, and a call to the content method returns the page source. 由于网页的主体是使用document.write JS调用创建的，因此该模块中的form方法未找到我要查找的表单，并且对content方法的调用返回了页面源。 I need to access the resulting HTML from the document.write calls. 我需要从document.write调用访问生成的HTML。

Is this possible using the mechanize module and if so how would I go about doing it? 使用机械化模块是否可行？如果可以，我将如何进行呢？ If not, are there other Perl modules that would help me? 如果没有，还有其他Perl模块对我有帮助吗？ Thanks! 谢谢！

Answer 1

I know that you are for the Perl solution, but you might consider Ruby. 我知道您支持Perl解决方案，但您可以考虑使用Ruby。 I have done multiple web scraping scripts in both Perl and Ruby. 我在Perl和Ruby中都完成了多个Web抓取脚本。 I found out that Ruby does a better web scraping job then Perl. 我发现Ruby比Perl做得更好。

Since you are running on Linux, Ruby should be either already installed or should be a simple installation (assuming you are allowed to do installations on the server). 由于您在Linux上运行，因此Ruby应该已经安装或应该是简单的安装（假设允许您在服务器上进行安装）。

You can use these threeruby gems for automation: 您可以将以下threeruby宝石用于自动化：

require 'watir-webdriver'
require 'selenium-webdriver'
require 'headless'

These do a very good job at web scraping. 这些在网页抓取方面做得非常好。

刮擦动态表单WWW :: Mechanize Perl

问题描述

1 个解决方案

解决方案1
0 2014-08-06 18:05:21

刮擦动态表单WWW :: Mechanize Perl

问题描述

1 个解决方案

解决方案1 0 2014-08-06 18:05:21

解决方案1
0 2014-08-06 18:05:21