Exclude urls without 'www' from Nutch 1.7 crawl

Question

I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.

Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both 'www' and '' urls such as:

www.mywebsite.com
mywebsite.com
www.mywebsite.com/page1.html
mywebsite.com/page1.html

My understanding is that the url filtering aka regex-urlfilter.txt needs modification. Are there any regex/nutch experts that could suggest a solution?

Here is the code on pastebin .

Answer 1

There are at least a couple solutions.

1.) urlfilter-regex plugin

If you don't want to crawl the non-www pages at all, or else filter them at a later stage such as at index time, that is what the urlfilter-regex plugin is for. It lets you mark any URLs matching the regex patterns starting with "+" to be crawled. Anything that does not match a regex prefixed with a "+" will not be crawled. Additionally in case you want to specify a general pattern but exclude certain URLs, you can use a "-" prefix to specify URLs to subsequently exclude.

In your case you would use a rule like:

+^(https?://)?www\.

This will match anything that starts with:

https://www.
http://www.
www.

and therefore will only allow such URLs to be crawled.

Based on the fact that the URLs listed were already not being excluded given your regex-urlfilter, it means either the plugin wasn't turned on in your nutch-site.xml, or else it is not pointed at that file.

In nutch-site.xml you have to specify regex-urlfilter in the list of plugins, eg:

<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-basic|query-(basic|site|url)|response-(json|xml)|urlnormalizer-(pass|regex|basic)</value>
</property>

Additionally check that the property specifying which file to use is not over-written in nutch-site.xml and is correct in nutch-default.xml. It should be:

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing regular expressions
  used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>

and regex-urlfilter.txt should be in the conf directory for nutch.

There is also the option to only perform the filtering at different steps, eg, index-time, if you only want to filter than.

2.) solrdedup command

If the URLs point to the exact same page, which I am guessing is the case here, they can be removed by running the nutch command to delete duplicates after crawling: http://wiki.apache.org/nutch/bin/nutch%20solrdedup

This will use the digest values computed from the text of each indexed page to find any pages that were the same and delete all but one.

However you would have to modify the plugin to change which duplicate is kept if you want to specifically keep the "www" ones.

3.) Write a custom indexing filter plugin

You can write a plugin that reads the URL field of a nutch document and converts it in any way you want before indexing. This would give you more flexible than using an existing plugin like urlnormalize-regex.

It is actually very easy to make plugins and add them to Nutch, which is one of the great things about it. As a starting point you can copy and look at one of the other plugins including with nutch that implement IndexingFilter, such as the index-basic plugin.

You can also find a lot of examples: http://wiki.apache.org/nutch/WritingPluginExample http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Exclude urls without 'www' from Nutch 1.7 crawl

Question

1 answers

solution1
1 2013-11-01 21:45:49

Exclude urls without 'www' from Nutch 1.7 crawl

Question

1 answers

solution1 1 2013-11-01 21:45:49

solution1
1 2013-11-01 21:45:49