简体   繁体   English

使用ElasticSearch自动生成标签(或Thinking Sphinx / pg-search)

[英]Auto-generate Tags with ElasticSearch (or Thinking Sphinx / pg-search )

I've thought about this a bit (and looked at every "auto-generate tags for content" type post on StackOverflow). 我已经考虑了一下(并查看了StackOverflow上每个“自动生成内容的标记”类型的文章)。

I have an Article (body:string) with multiple Tags (joined through Taggings). 我有一个带有多个标签(通过标签加入)的文章(正文:字符串)。

Right now in the app, in order to suggest tags for the Article, pgsearch searches other Articles's body text for the text included in body (stemming words in the text) and suggests tags based on those related articles' tags. 现在,在应用程序中,为了建议文章的标签,pgsearch在其他文章的正文中搜索包含在正文中的文本(在文本中插入单词),并根据那些相关文章的标签来推荐标签。 Of course this only works if similar articles have been tagged, and as more articles are tagged in the database, perhaps there are better tags to use. 当然,只有在对相似的文章进行了标记的情况下,这才起作用,并且随着数据库中对更多文章的标记,也许还有更好的标记可供使用。

Is there a smarter way, using say ElasticSearch, to automatically find the popular words from other Articles body text (unique and stemmed) and auto-generate a list of these tags. 是否有一种更聪明的方法(使用ElasticSearch)来自动从其他Articles正文(唯一和词干)中查找流行单词并自动生成这些标签的列表。

If I were to do this myself, are there any examples to follow for doing this efficiently? 如果我本人要这样做,是否有任何示例可以效仿?

You can use the more-like-this query to find similar articles, and a terms facet to find the popular tags: 您可以使用类似查询的查询来查找相似的文章,并使用术语方面来查找热门标签:

curl -XGET 'http://127.0.0.1:9200/myindex/article/_search?pretty=1'  -d '
{
   "query" : {
      "more_like_this_field" : {
         "body" : {
            "min_doc_freq" : 1,
            "like_text" : "BODY OF THE NEW ARTICLE",
            "min_term_freq" : 1,
            "percent_terms_to_match" : 0.2
         }
      }
   },
   "facets" : {
      "tags" : {
         "terms" : {
            "field" : "tags"
         }
      }
   }
}
'

Depending on the size of your corpus, you may need to play around with the parameters to more_like_this_field to get the best matches. 根据语料库的大小,您可能需要使用more_like_this_field的参数来获得最佳匹配。

The best way to do this is to use the elasticsearch Percolator API. 最好的方法是使用elasticsearch Percolator API。 Check out this answer: 看看这个答案:

Elasticsearch - use a "tags" index to discover all tags in a given string Elasticsearch-使用“标签”索引来发现给定字符串中的所有标签

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM