简体   繁体   English

多值字段和标记化字段之间的得分差异

[英]Difference in scoring between multivalued field and tokenized field

For example I have several tags per document. 例如,每个文档有几个标签。 I can 我可以

  • index them as single text string spliting by space uisng WhiteSpaceTokenizer. 将它们编为单个文本字符串,并按空格隔开WhiteSpaceTokenizer。 (example "tag1 tag2 tag3") (例如“ tag1 tag2 tag3”)
  • add them separatly to single field name multiple times using KeywordAnalyzer ( example doc.addField("tags1", "tag1"); doc.addField("tags", "tag2"); doc.addField("tags", "tag23) ) 使用KeywordAnalyzer多次将它们分别添加到单个字段名称中(例如doc.addField("tags1", "tag1"); doc.addField("tags", "tag2"); doc.addField("tags", "tag23)

Both approaches will work. 两种方法都行得通。 The question is how different will be scoring for those types of indexing? 问题是,这些类型的索引在得分上有何不同? (ie field normalization factor, tf/idf count, field length calucaltion, slope factor etc) (即场归一化因子,tf / idf计数,场长计算,斜率因子等)

Lucene will concatenate all the values for a multivalued filed behind the scene anyway, so it'd not be much different than your first case, if at all. 无论如何,Lucene都会将多值的所有值串联起来,因此与您的第一种情况并没有太大不同。 If you use tags only as filters (give me all docs tagged with tag2), then you definitely won't see any difference. 如果您仅将标签用作过滤器(将标记有tag2的所有文档给我),那么您肯定不会看到任何区别。

I would think the multi-value would be more accurate. 我认为多值会更准确。

imagine a tokenized string "spider web developer" 想象一个标记化字符串“ spider web developer”

vs VS

multi-value field with the values "spider" and "web developer" 包含“蜘蛛”和“网络开发人员”值的多值字段

a search for "web developer" would match both fields but the match vs the multi-value field could be seen as more accurate. 搜索“网络开发人员”会匹配两个字段,但匹配与多值字段会被视为更准确。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM