Extracting data from javascript webpages

Question

I need to build a system to extract vast amounts of data from a collection of webpages. A lot of these sites (mayabe 90% or so) are powered by various different javascript systems. I am wondering what is the most efficient method to extract this data?

Since every site is different I am looking for a flexible solution, and since there are many sites I am looking for a solution that'll put as little stress on my network as possible.

Most of my programming experience is in C, C++ and Perl, but I'm happy to whatever gives the best result.

The webpages have constantly updating numbers and statistics that I wish to extract and perform some analysis on, so I need to be able to easily store them in a database.

I've done some research of my own, but I'm really coming up blank here. I'm hoping someone else can help me! :)

Answer 1

You will need a browser that interprets the JavaScript, and does the actual requests for you. You will then need to take a DOM snapshot of the interpreted result. It's not going to be trivial, and it's going to be impossible in pure PHP.

I have no own experience with it, but maybe the Selenium Suite can help. It's an automation suite used for software testing, but according to this article , to some extent can also be used for scraping.

Answer 2

Maybe you should try PHP DOMDocument class. For example this code will "steal" all the table tags from the url.

$data=array();    
$url='your.site.com';
$out=file_get_contents($url);
$dom=new DOMDocument();
$dom->loadHTML($out);
foreach($dom->getElementsByTagName('table') as $table){
data[]=$table->nodeValue;
}
print_r($data);

You can take and manipulate all the DOM and parse all the html document. Consider calling this script asynchronously with an AJAX approach.

Extracting data from javascript webpages

Question

2 answers

solution1
0 2011-04-25 13:12:24

solution2
-1 2011-04-25 13:08:14

Extracting data from javascript webpages

Question

2 answers

solution1 0 2011-04-25 13:12:24

solution2 -1 2011-04-25 13:08:14

solution1
0 2011-04-25 13:12:24

solution2
-1 2011-04-25 13:08:14