简体   繁体   English

对多个搜索字段使用通用查询约定

[英]Using a common query convention for multiple search fields

Imagine that I am building a hashtag search. 想象一下,我正在建立一个标签搜索。 My main indexed type is called Post, which has a list of Hashtag items, which are marked as IndexedEmbedded. 我的主索引类型称为Post,它有一个Hashtag项列表,标记为IndexedEmbedded。 Separately, every post has a list of Comment objects, each of which, again, contains a list of Hashtag objects. 另外,每个帖子都有一个Comment对象列表,每个对象都包含一个Hashtag对象列表。

On the search side, I am using a MultiFieldQueryParser, to which I pass a long list of possible search fields, including some nested fields like: 在搜索方面,我使用MultiFieldQueryParser,我传递了一长串可能的搜索字段,包括一些嵌套字段,如:

hashTags.value and coments.hashTags.value hashTags.valuecoments.hashTags.value

Now, the interesting thing happens when I want to search for something, say #architecture. 现在,有趣的事情发生在我想搜索某些东西时,比如#architecture。 I figure out where the hashtags are, so the simplest logical thing to do would be to convert a query of the type #architecture, into one of the type hashTags.value:architecture or comments.hashTags.value:architecture Although possible, this is very inflexible. 我找出了主题标签的位置,因此最简单的逻辑操作是将#architecture类型的查询转换为hashTags.value类型hashTags.value:architecture or comments.hashTags.value:architecture虽然可能,但这是非常不灵活。 What if I come up with yet another field that contains hashtags? 如果我想出另一个包含主题标签的字段怎么办? I'd have to include that too. 我也必须把它包括在内。

Is there a general way to do this? 有没有一般的方法来做到这一点?

PS Please have in mind that the root type I am searching for is Post, because this is the kind of results I'd like to achieve PS请记住,我正在搜索的根类型是Post,因为这是我想要实现的结果

Hashtags are keywords, and you should let Lucene handle the text analysis to extract the hashtags from your main text and store them in a custom field. Hashtags是关键字,您应该让Lucene处理文本分析以从主文本中提取主题标签并将其存储在自定义字段中。

You can do this very easily with Hibernate Search by defining your text to be indexed in two different @Field (using @Fields annotation). 通过定义要在两个不同的@Field索引的文本(使用@Fields注释),您可以使用Hibernate Search轻松完成此操作。 You could have one field named comments and another commentsHashtags . 你可以有一个名为comments的字段和另一个commentsHashtags

You then apply a custom Analyser to commentsHashtags which does some standard tokenization and discards any term not starting with # ; 然后,您将自定义分析器应用于commentsHashtags ,它执行一些标准的标记化并丢弃任何不以开头的术语; you can define one easily by taking the standard tokenizer and apply a custom filter. 您可以通过使用标准标记器并应用自定义过滤器轻松定义一个。

When you run a query, you don't have to write custom code to look for hashtags in the query input, let it be processed by the same Analyser (which is the default anyway) and target both fields, you can even boost the hashtags more if that makes sense. 当您运行查询时,您不必编写自定义代码来查找查询输入中的主题标签,让它由相同的分析器处理(无论如何都是默认值)并定位两个字段,您甚至可以增加主题标签更多,如果这是有道理的。

With this solution you 有了这个解决方案你

  • take advantage of the high efficiency of Search's text analysis 利用Search的文本分析的高效率
  • avoid entities and tables on the database containing the hashtags: useless overhead 避免包含主题标签的数据库上的实体和表:无用的开销
  • avoid messing with free text extraction 避免搞乱自由文本提取

It gets you another strong win point: you can then open a raw IndexReader and load the termvector from commentsHashtags to get both a list of all used tags, and metrics about them. 它为您带来另一个强大的胜利点:然后您可以打开原始IndexReader并从commentsHashtags加载termvector,以获取所有已使用标记的列表以及有关它们的指标。 Cool to do some data mining, or just visualize a tag cloud. 很酷的做一些数据挖掘,或只是可视化标签云。

Instead of treating the fields as different and the top-level document as Post, why not store both Posts and Comments as Lucene documents? 不是将字段视为不同的字段,而是将顶级文档视为Post,为什么不将Posts Comments都存储为Lucene文档? That way, you can just have a single field called "hashtags" that you search. 这样,您可以只搜索一个名为“hashtags”的字段。 You should also have a field called "type" or something to differentiate between comments and posts. 您还应该有一个名为“type”的字段或用于区分评论和帖子的字段。

Search results may be either comments of posts. 搜索结果可能是帖子的评论。 You can filter by type if users want to search only posts or comments. 如果用户只想搜索帖子或评论,您可以按类型过滤。 Or you can show them differently in your UI. 或者您可以在UI中以不同方式显示它们。

If you want to add another concept that also uses hashtags (like ... I dunno... splanks or whatever silly name we all give to Internet communications in the future), then you can add it alongside the existing Post and Comment documents simply my indexing your new type with a "hashtags" field. 如果你想添加另一个也使用主题标签的概念(比如...我不知道... splanks或将来我们都给予互联网通信的任何愚蠢的名字),那么你可以将它与现有的帖子和评论文件一起添加我用“hashtags”字段索引你的新类型。 You'll have to do plenty of work to add the splanks, anyway, so adding a handler for that new type of search result shouldn't be too much of an inconvenience. 无论如何,你必须做大量的工作来添加splanks,所以为这种新类型的搜索结果添加处理程序不应该带来太大的不便。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM