简体繁体 English

如何在Nutch到Solr索引期间跳过具有空内容字段的文档？

[英]How to skip documents with empty content field during Nutch to Solr indexing?

原文 2013-10-15 18:39:04 0 1 apache/ solr/ indexing/ nutch/ web-crawler

During solrindex, how to tell Nutch to skip indexing those documents with an empty content field? 在solrindex期间，如何告诉Nutch跳过使用空内容字段索引这些文档？

I found http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ , but the index-omit plugin will only allow Nutch to filter those documents without certain metatag fields, not general fields such as content. 我找到了http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ ，但index-omit插件只允许Nutch过滤那些没有某些元标记字段的文档，而不是内容等常规字段。

1 个解决方案

You might need to implement a new Nutch filter that discards the document if the content is empty. 您可能需要实现一个新的Nutch过滤器，如果内容为空，则会丢弃该文档。

You can get more information on how to write a plugin following this link: https://wiki.apache.org/nutch/AboutPlugins 您可以通过以下链接获取有关如何编写插件的更多信息： https ： //wiki.apache.org/nutch/AboutPlugins

EDIT: 编辑：
I wrote a simple plugin just as an example. 我写了一个简单的插件就是一个例子。 It looks at the "content" field and if it's empty it will ignore the document and not index it. 它查看“内容”字段，如果它是空的，它将忽略该文档而不对其进行索引。

You can get it from here: https://github.com/nimeshjm/index-discardemptycontent 你可以从这里得到它： https ： //github.com/nimeshjm/index-discardemptycontent