Nutch regex for crawl

Question

I am using Apache Nutch to crawl the web page. I want to crawl the web page when i search for particular name like if i search bill gates i want to get the results links of that search result. I have url like

www.mysite.com/search?name=bill+gates

but in crawling it displays no more url to fetch. actually it does not fetch any results.

Is there any option to crawl that page? i have added in regex-urlfilter.txt to accept everything. How would i crawl the link? Thanks in advance.

Answer 1

In my memory nutch got an extra setting for cutting off url parameters like ?q=bill+gates. I'll think this setting is located in automaton-urlfilter.txt:

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

So you got to change this line.

Hope I could help you

Nutch regex for crawl

Question

1 answers

solution1
1 ACCPTED 2013-05-23 10:53:57

Nutch regex for crawl

Question

1 answers

solution1 1 ACCPTED 2013-05-23 10:53:57

solution1
1 ACCPTED 2013-05-23 10:53:57