简体   繁体   中英

Extracting data from javascript webpages

I need to build a system to extract vast amounts of data from a collection of webpages. A lot of these sites (mayabe 90% or so) are powered by various different javascript systems. I am wondering what is the most efficient method to extract this data?

Since every site is different I am looking for a flexible solution, and since there are many sites I am looking for a solution that'll put as little stress on my network as possible.

Most of my programming experience is in C, C++ and Perl, but I'm happy to whatever gives the best result.

The webpages have constantly updating numbers and statistics that I wish to extract and perform some analysis on, so I need to be able to easily store them in a database.

I've done some research of my own, but I'm really coming up blank here. I'm hoping someone else can help me! :)

You will need a browser that interprets the JavaScript, and does the actual requests for you. You will then need to take a DOM snapshot of the interpreted result. It's not going to be trivial, and it's going to be impossible in pure PHP.

I have no own experience with it, but maybe the Selenium Suite can help. It's an automation suite used for software testing, but according to this article , to some extent can also be used for scraping.

Maybe you should try PHP DOMDocument class. For example this code will "steal" all the table tags from the url.

$data=array();    
$url='your.site.com';
$out=file_get_contents($url);
$dom=new DOMDocument();
$dom->loadHTML($out);
foreach($dom->getElementsByTagName('table') as $table){
data[]=$table->nodeValue;
}
print_r($data);

You can take and manipulate all the DOM and parse all the html document. Consider calling this script asynchronously with an AJAX approach.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM