简体   繁体   中英

Nutch regex for crawl

I am using Apache Nutch to crawl the web page. I want to crawl the web page when i search for particular name like if i search bill gates i want to get the results links of that search result. I have url like

www.mysite.com/search?name=bill+gates

but in crawling it displays no more url to fetch. actually it does not fetch any results.

Is there any option to crawl that page? i have added in regex-urlfilter.txt to accept everything. How would i crawl the link? Thanks in advance.

In my memory nutch got an extra setting for cutting off url parameters like ?q=bill+gates. I'll think this setting is located in automaton-urlfilter.txt:

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

So you got to change this line.

Hope I could help you

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM