简体繁体中英

Does nutch crawl over forms?

原文 2012-05-10 15:07:03 2 1 solr/ lucene/ web-crawler/ nutch

I was wondering if nutch 1.4 crawls over forms right out of the box. For example if there is a drop-down list, would it try to get all possible pages combined from the items in the drop-down list ??

Thanks

1 answers

Nutch gets the html source of the desired page via HTTP request. Now the html source of the page can contain drop-down list coded inside it. If that is coded using complex scripting like dojo / ajax then it wont be able to INTERPRET it as a browser would do. If the outlinks of the drop-down list are seen right away in the html source, then nutch will get those pages crawled. Apart from normal textual content, Nutch also does parsing for Java script portions of the html page.

Now for verifying this, open the page in bowser / wget it. View the page source in text editor like notepad / vi. Can you see the outlinks to drop-down box there ? if yes, then nutch will crawl those outlinks else not.

Nutch Crawl does not working

Nutch does not crawl all links in form

Nutch 1.11 crawl Issue

Getting status of a Nutch crawl?

How to crawl images in Nutch?

Nutch crawl command

Nutch Crawl Script

Nutch Crawl - Deleting segments on each crawl implications

Nutch 2.3.1 in crawl Deep Web

Crawl Image using Apache Nutch

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Nutch Crawl does not working Nutch does not crawl all links in form Nutch 1.11 crawl Issue Getting status of a Nutch crawl? How to crawl images in Nutch? Nutch crawl command Nutch Crawl Script Nutch Crawl - Deleting segments on each crawl implications Nutch 2.3.1 in crawl Deep Web Crawl Image using Apache Nutch

Related Tags

Does nutch crawl over forms?

Question

1 answers

solution1 1 2012-05-11 03:16:33

solution1
1 2012-05-11 03:16:33