简体繁体 English

混合搜索和索引：Solr中的单词和标记元数据

[英]Hybrid search and indexing: words and token metadata in Solr

原文 2014-07-04 23:48:32 8 1 solr/ metadata/ token

I am building a set of plugins for Solr to enable a "hybrid" search which would match either words or token (not document!) metadata (specific ID numbers). 我正在为Solr构建一组插件，以启用“混合”搜索，该搜索将匹配单词或令牌（不是文档！）元数据（特定ID号）。 Same words may have different ID numbers in different context, generated in indexing time by an external application. 相同的单词在不同的上下文中可能具有不同的ID号，这些ID号是在外部应用程序编制索引时生成的。 Such as, "run" may have 12345 in one case and 54321 in another (depends on the context). 例如，“运行”在一种情况下可能具有12345，而在另一种情况下可能具有54321（取决于上下文）。 The ID numbers should have more weight in the search. ID号在搜索中应具有更大的权重。 (They will be provided in the query in search time by the same external application.) （它们将在搜索时由同一外部应用程序在查询中提供。）

I read about custom fields for documents and I was wondering if we could store a blob there with these IDs, but I am not sure how to include it in the search. 我阅读了有关文档的自定义字段的信息，我想知道是否可以在其中存储带有这些ID的Blob，但是我不确定如何将其包含在搜索中。

Or should I just pretend these IDs are "synonyms" (maybe surrounding them in some kind of unique marking, like [:12345:] ) and use the synonym factory tokenizers? 还是我应该假装这些ID是“同义词”（也许以某种独特的标记将它们括起来，例如[：12345：] ）并使用同义词工厂标记器？

I am new to Solr but I have read the relevant documentation so I think I understand how it all works conceptually. 我是Solr的新手，但是我已经阅读了相关文档，所以我认为我从概念上理解这一切。 Performance does not matter at this stage , this is a PoC. 在此阶段性能无关紧要 ，这就是PoC。 Looks like somewhat similar to: Search different tokens on different fields in Solr but not exactly. 看起来有点类似于：在Solr中的不同字段上搜索不同的标记，但不完全相同。 Oh, and I want to tokenise the text myself, too, but that's not an issue. 哦，我也想自己标记文本，但这不是问题。

EDIT: [removed the bit about payloads, it is irrelevant here. 编辑：[删除了有关有效负载的位，这是不相关的。 Sorry about the confusion] 抱歉造成混淆]

1 个解决方案

Unless I've misunderstood, as you've already generated the magic tokens, the only requirement is to see if the magic token value is present in a field, and if it is, score the field higher. 除非我误解了，否则您已经生成了魔术令牌，因此唯一的要求是查看字段中是否存在魔术令牌值，如果是，则对该字段评分更高。

Index the magic token values to one field, and the textual values to another. 将魔术标记值索引到一个字段，将文本值索引到另一个字段。 Use boosting to prioritise matches in the magic token field over a match in the textual values field. 使用增强将魔术标记字段中的匹配优先于文本值字段中的匹配。 The magic token field can probably be an integer field based on tint from your description. 魔术标记字段可能是基于描述中的tint的整数字段。

When searching, you can generate the search string as: 搜索时，可以将搜索字符串生成为：

q=(token:12345^5 OR text:run) AND (token:32145^5 OR text:fast)

This should give a match in the token a five times better score than a match in the text field. 这应该使令牌中的匹配得分比文本字段中的匹配高五倍。 If you don't care if you match 12345 in the text field as well, you can use: 如果您不在乎是否在文本字段中也匹配12345，则可以使用：

q=12345 run 32145 fast&qf=text token^5

You might have to tweak mm to give the required number of hits, depending on what your application needs. 您可能需要调整mm才能提供所需的命中次数，具体取决于您的应用程序需求。