简体   繁体   中英

Does nutch crawl over forms?

I was wondering if nutch 1.4 crawls over forms right out of the box. For example if there is a drop-down list, would it try to get all possible pages combined from the items in the drop-down list ??

Thanks

Nutch gets the html source of the desired page via HTTP request. Now the html source of the page can contain drop-down list coded inside it. If that is coded using complex scripting like dojo / ajax then it wont be able to INTERPRET it as a browser would do. If the outlinks of the drop-down list are seen right away in the html source, then nutch will get those pages crawled. Apart from normal textual content, Nutch also does parsing for Java script portions of the html page.

Now for verifying this, open the page in bowser / wget it. View the page source in text editor like notepad / vi. Can you see the outlinks to drop-down box there ? if yes, then nutch will crawl those outlinks else not.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM