简体   繁体   English

从javascript网页中提取数据

[英]Extracting data from javascript webpages

I need to build a system to extract vast amounts of data from a collection of webpages. 我需要建立一个从网页集合中提取大量数据的系统。 A lot of these sites (mayabe 90% or so) are powered by various different javascript systems. 这些网站中的很多(mayabe 90%左右)都由各种不同的javascript系统提供支持。 I am wondering what is the most efficient method to extract this data? 我想知道什么是提取此数据最有效的方法?

Since every site is different I am looking for a flexible solution, and since there are many sites I am looking for a solution that'll put as little stress on my network as possible. 由于每个站点都不相同,因此我正在寻找一种灵活的解决方案,并且由于有许多站点,我正在寻找一种解决方案,该解决方案将对我的网络造成的压力尽可能小。

Most of my programming experience is in C, C++ and Perl, but I'm happy to whatever gives the best result. 我大部分的编程经验是C,C ++和Perl,但是我很高兴能提供最好的结果。

The webpages have constantly updating numbers and statistics that I wish to extract and perform some analysis on, so I need to be able to easily store them in a database. 这些网页不断更新我希望提取并进行分析的数字和统计信息,因此我需要能够轻松地将它们存储在数据库中。

I've done some research of my own, but I'm really coming up blank here. 我已经做了一些自己的研究,但是我在这里真的空白了。 I'm hoping someone else can help me! 我希望其他人可以帮助我! :) :)

You will need a browser that interprets the JavaScript, and does the actual requests for you. 您将需要一个浏览器来解释JavaScript,并为您执行实际的请求。 You will then need to take a DOM snapshot of the interpreted result. 然后,您需要对解释结果进行DOM快照。 It's not going to be trivial, and it's going to be impossible in pure PHP. 这不会是琐碎的事,而且在纯PHP中将是不可能的。

I have no own experience with it, but maybe the Selenium Suite can help. 我没有自己的经验,但是Selenium Suite可以提供帮助。 It's an automation suite used for software testing, but according to this article , to some extent can also be used for scraping. 它是用于软件测试的自动化套件,但是根据本文 ,在某种程度上也可以用于抓取。

Maybe you should try PHP DOMDocument class. 也许您应该尝试PHP DOMDocument类。 For example this code will "steal" all the table tags from the url. 例如,此代码将“窃取” URL中的所有表标记。

$data=array();    
$url='your.site.com';
$out=file_get_contents($url);
$dom=new DOMDocument();
$dom->loadHTML($out);
foreach($dom->getElementsByTagName('table') as $table){
data[]=$table->nodeValue;
}
print_r($data);

You can take and manipulate all the DOM and parse all the html document. 您可以获取和操作所有DOM并解析所有html文档。 Consider calling this script asynchronously with an AJAX approach. 考虑使用AJAX方法异步调用此脚本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM