简体   繁体   中英

Fetching web pages with javascript links from Java

I have a web crawler application in Java that needs to access all links in a web page. The problem is that in some pages, links are generated by a javascript function. Something like:

<a href="someJavascriptFunction()"> Lorem Ipsum </a>

I'm aware of HtmlUnit . But in my tests, it was just way too slow for my purposes. A local page (in http://localhost/test.html ) took almost 2 seconds to be fetched. Other remote web pages took much more time.

I would like the simpliest/fastest way to find all links in a web page, even the javascript ones in Java. (Solutions in C/C++ are welcome). I'm also aware that Nutch (the crawler) has a link extractor from Javascript, but I'm not sure if that code could be "extracted" out of Nutch to be used in another context.

Seems possible to extract usefull code from Nutch:

Look at how the main method can be used as standalone JS link extractor.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM