从javascript网页中提取数据

Question

I need to build a system to extract vast amounts of data from a collection of webpages. 我需要建立一个从网页集合中提取大量数据的系统。 A lot of these sites (mayabe 90% or so) are powered by various different javascript systems. 这些网站中的很多（mayabe 90％左右）都由各种不同的javascript系统提供支持。 I am wondering what is the most efficient method to extract this data? 我想知道什么是提取此数据最有效的方法？

Since every site is different I am looking for a flexible solution, and since there are many sites I am looking for a solution that'll put as little stress on my network as possible. 由于每个站点都不相同，因此我正在寻找一种灵活的解决方案，并且由于有许多站点，我正在寻找一种解决方案，该解决方案将对我的网络造成的压力尽可能小。

Most of my programming experience is in C, C++ and Perl, but I'm happy to whatever gives the best result. 我大部分的编程经验是C，C ++和Perl，但是我很高兴能提供最好的结果。

The webpages have constantly updating numbers and statistics that I wish to extract and perform some analysis on, so I need to be able to easily store them in a database. 这些网页不断更新我希望提取并进行分析的数字和统计信息，因此我需要能够轻松地将它们存储在数据库中。

I've done some research of my own, but I'm really coming up blank here. 我已经做了一些自己的研究，但是我在这里真的空白了。 I'm hoping someone else can help me! 我希望其他人可以帮助我！ :) :)

Answer 1

You will need a browser that interprets the JavaScript, and does the actual requests for you. 您将需要一个浏览器来解释JavaScript，并为您执行实际的请求。 You will then need to take a DOM snapshot of the interpreted result. 然后，您需要对解释结果进行DOM快照。 It's not going to be trivial, and it's going to be impossible in pure PHP. 这不会是琐碎的事，而且在纯PHP中将是不可能的。

I have no own experience with it, but maybe the Selenium Suite can help. 我没有自己的经验，但是Selenium Suite可以提供帮助。 It's an automation suite used for software testing, but according to this article , to some extent can also be used for scraping. 它是用于软件测试的自动化套件，但是根据本文，在某种程度上也可以用于抓取。

Answer 2

Maybe you should try PHP DOMDocument class. 也许您应该尝试PHP DOMDocument类。 For example this code will "steal" all the table tags from the url. 例如，此代码将“窃取” URL中的所有表标记。

$data=array();    
$url='your.site.com';
$out=file_get_contents($url);
$dom=new DOMDocument();
$dom->loadHTML($out);
foreach($dom->getElementsByTagName('table') as $table){
data[]=$table->nodeValue;
}
print_r($data);

You can take and manipulate all the DOM and parse all the html document. 您可以获取和操作所有DOM并解析所有html文档。 Consider calling this script asynchronously with an AJAX approach. 考虑使用AJAX方法异步调用此脚本。

从javascript网页中提取数据

问题描述

2 个解决方案

解决方案1
0 2011-04-25 13:12:24

解决方案2
-1 2011-04-25 13:08:14

从javascript网页中提取数据

问题描述

2 个解决方案

解决方案1 0 2011-04-25 13:12:24

解决方案2 -1 2011-04-25 13:08:14

解决方案1
0 2011-04-25 13:12:24

解决方案2
-1 2011-04-25 13:08:14