简体   繁体   English

Apache Lucene TokenStream过滤器

[英]Apache Lucene TokenStream Filters

I have some questions regarding Apache Lucene Library: 我对Apache Lucene库有一些疑问:

1) How can I concatenate two TokenStream objects into one TokenStream object ? 1)如何将两个TokenStream对象连接成一个TokenStream对象?

2) Which filter can be used to remove all duplicate tokens (with same value) from a TokenStream object ? 2)哪些过滤器可以用来除去来自物体的TokenStream所有重复标记(具有相同的值)?

Thanks in Advance 提前致谢

As far as concatenating from two sources, just add two Field instances with the same name to the Document . 至于从两个源进行连接,只需将两个具有相同名称的Field实例添加到Document This is guaranteed to be the same as a single field with the value concatenated. 这保证是相同的,与级联的值的单个字段。

As far as eliminating duplicated terms, this is not really necessary. 至于消除重复的术语,这实际上不是必需的。 Lucene will only count the term frequency for a document in order to score them higher. Lucene只会计算文档的术语频率以使其得分更高。 If you don't need that, you can define your own Similarity instance that implements tf as a constant of 1. 如果不需要,可以定义自己的Similarity实例,该实例将tf实现为常数1。

Or, if you need to disable term frequency per field only, you can instantiate the Field with Field.TermVector.NO . 或者,如果仅需要禁用每个字段的词频,则可以使用Field.TermVector.NO实例化Field

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM