Fetching web pages with javascript links from Java

Question

I have a web crawler application in Java that needs to access all links in a web page. The problem is that in some pages, links are generated by a javascript function. Something like:

<a href="someJavascriptFunction()"> Lorem Ipsum </a>

I'm aware of HtmlUnit . But in my tests, it was just way too slow for my purposes. A local page (in http://localhost/test.html ) took almost 2 seconds to be fetched. Other remote web pages took much more time.

I would like the simpliest/fastest way to find all links in a web page, even the javascript ones in Java. (Solutions in C/C++ are welcome). I'm also aware that Nutch (the crawler) has a link extractor from Javascript, but I'm not sure if that code could be "extracted" out of Nutch to be used in another context.

Answer 1

Seems possible to extract usefull code from Nutch:

http://www.docjar.com/html/api/org/apache/nutch/parse/js/JSParseFilter.java.html

Look at how the main method can be used as standalone JS link extractor.

Fetching web pages with javascript links from Java

Question

1 answers

solution1
0 2010-11-09 14:49:22

Fetching web pages with javascript links from Java

Question

1 answers

solution1 0 2010-11-09 14:49:22

solution1
0 2010-11-09 14:49:22