简体   繁体   English

如何在Nutch到Solr索引期间跳过具有空内容字段的文档?

[英]How to skip documents with empty content field during Nutch to Solr indexing?

During solrindex, how to tell Nutch to skip indexing those documents with an empty content field? 在solrindex期间,如何告诉Nutch跳过使用空内容字段索引这些文档?

I found http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ , but the index-omit plugin will only allow Nutch to filter those documents without certain metatag fields, not general fields such as content. 我找到了http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ ,但index-omit插件只允许Nutch过滤那些没有某些元标记字段的文档,而不是内容等常规字段。

You might need to implement a new Nutch filter that discards the document if the content is empty. 您可能需要实现一个新的Nutch过滤器,如果内容为空,则会丢弃该文档。

You can get more information on how to write a plugin following this link: https://wiki.apache.org/nutch/AboutPlugins 您可以通过以下链接获取有关如何编写插件的更多信息: https//wiki.apache.org/nutch/AboutPlugins

EDIT: 编辑:
I wrote a simple plugin just as an example. 我写了一个简单的插件就是一个例子。 It looks at the "content" field and if it's empty it will ignore the document and not index it. 它查看“内容”字段,如果它是空的,它将忽略该文档而不对其进行索引。

You can get it from here: https://github.com/nimeshjm/index-discardemptycontent 你可以从这里得到它: https//github.com/nimeshjm/index-discardemptycontent

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM