简体繁体 English

Elasticsearch：从点击中学习（搜索结果排名）

[英]Elasticsearch: Learning from clicks (Search result ranking)

原文 2014-11-03 12:28:43 1 3 java/ search/ elasticsearch/ machine-learning/ relevance

I have read over the chapter "Learning from clicks" in the book Programming Collective Intelligence and liked the idea: The search engine there learns on which results the user clicked and use this information to improve the ranking of results. 我已阅读“ 编程集体智慧 ”一书中的“从点击中学习”这一章，并且喜欢这个想法：搜索引擎可以了解用户点击了哪些结果并使用此信息来提高结果的排名。

I think it would improve the quality of the search ranking a lot in my Java/Elasticsearch application if I could learn from the user clicks. 我认为如果我可以从用户点击中学习，它将在我的Java / Elasticsearch应用程序中大大提高搜索排名的质量。

In the book, they build a multiplayer perceptron (MLP) network to use the learned information even for new search phrases. 在本书中，他们构建了一个多人感知器（MLP）网络，即使对于新的搜索短语，也可以使用学到的信息。 They use Python with a SQL database to calculate the search ranking. 他们使用Python和SQL数据库来计算搜索排名。

Has anybody implemented something like this already with Elasticsearch or knows an example project? 有没有人已经使用Elasticsearch实现了这样的东西或知道一个示例项目？ It would be great, if I could manage the clicking information directly in Elasticsearch without needing an extra SQL database. 如果我可以直接在Elasticsearch中管理点击信息而不需要额外的SQL数据库，那就太棒了。

3 个解决方案

In the field of Information Retrieval (the general academic field of search and recommendations) this is more generally known as Learning to Rank . 在信息检索领域（搜索和推荐的一般学术领域），这通常被称为学习排名。 Whether its clicks, conversions, or other forms of sussing out what's a "good" or "bad" result for a keyword search, learning to rank uses either a classifier or regression process to learn what features of the query and document correlate with relevance. 无论是点击次数，转化次数还是其他形式的搜索关键字搜索的“好”或“坏”结果，学习排名都会使用分类器或回归过程来了解查询和文档的哪些功能与相关性相关联。

Clicks? 点击？

For clicks specifically, there's reasons to be skeptical that optimizing clicks is ideal. 对于具体的点击，有理由怀疑优化点击是理想的。 There's a paper from Microsoft Research I'm trying to dig up that claims that in their case, clicks are only 45% correlated with relevance. 微软研究院的一篇论文我试图挖掘出这样的说法，即在他们的情况下，点击率与相关性只有45％相关。 Click+dwell is often a more useful general-purpose indicator of relevance. Click + dwell通常是一个更有用的通用相关指标。

There's also the risk of self-reinforcing bias in search, as I talk about in this blog article . 正如我在这篇博客文章中所谈到的那样，搜索中也存在自我强化偏见的风险。 There's a chance that if you're already showing a user mediocre results, and they keep clicking on those mediocre results, you'll end up reinforcing search to keep showing users mediocre results. 如果您已经显示用户平庸的结果，并且他们不断点击那些平庸的结果，您最终可能会加强搜索以继续向用户显示平庸的结果。

Beyond clicks, there's often domain-specific considerations for what you should measure. 除了点击之外，通常会针对您应该衡量的内容进行特定于域的考虑。 For example, clasically in e-commerce, conversions matter. 例如，在电子商务中，转换很重要。 Perhaps a search result click that led to such a purchase should count more. 也许导致这种购买的搜索结果点击应该更多。 Netflix famously tries to suss out what it means when you watch a movie for 5 minutes and go back to the menu vs 30 minutes and exit. 当你观看5分钟的电影并回到菜单30分钟后退出时，Netflix就会试图说出它意味着什么。 Some search use cases are informational: clicking may mean something different when you're researching and clicking many search results vs when you're shopping for a single item. 一些搜索用例是信息性的：当您在研究和点击许多搜索结果时，与您购买单个项目时相比，点击可能意味着不同。

So sorry to say it's not a silver bullet. 很遗憾地说这不是一颗银弹。 I've heard of many successful and unsuccessful attempts at doing Learning to Rank and it mostly boils down to how successful you are at measuring what your users consider relevant. 我听说过许多成功和不成功的学习排名的尝试，它主要归结为你在衡量用户认为相关的方面取得的成功。 The difficulty of this problem surprises a lot of peop.le 这个问题的难度让很多人感到惊讶

For Elasticsearch... 对于Elasticsearch ......

For Elasticsearch specifically, there's this plugin (disclaimer I'm the author). 特别是对于Elasticsearch，有这个插件（免责声明我是作者）。 Which is documented here . 这里记录了这一点。 Once you've figured out how to "grade" a document for a specific query (whether its clicks or something more) you can train a model that can be then fed into Elasticsearch via this plugin for your ranking. 一旦你弄清楚如何为特定查询“评分”文档（无论是点击还是更多），你就可以训练一个模型，然后通过这个插件将其输入Elasticsearch进行排名。

What you would need to do is store information about the clicks in a field inside the Elasticsearch index. 您需要做的是在Elasticsearch索引中的字段中存储有关点击的信息。 Every click would result in an update of a document. 每次点击都会导致文档更新。 Since an update action is actually a delete and insert Update API , you need to make sure your document text is stored , not only indexed . 由于更新操作实际上是删除并插入Update API ，因此您需要确保存储文档文本，而不仅仅是索引。 You can then use a Function Score Query to build a score function reflecting the value stored in the index. 然后，您可以使用“ 功能分数查询”来构建反映存储在索引中的值的分数函数。

Alternatively, you could store the information in a separate database and use a script function inside the score function to access the database. 或者，您可以将信息存储在单独的数据库中，并使用score函数内的脚本函数来访问数据库。 I wouldn't suggest this solution due to performance issues. 由于性能问题，我不建议使用此解决方案。

I get the point of your question. 我明白了你的问题。 You want to build learning to rank model within Elasticsearch framework. 您希望构建学习以在Elasticsearch框架内对模型进行排名。 The relevance of each doc to the query is computed online. 每个文档与查询的相关性在线计算。 You want to combine query and doc to compute the score, so a custom function to compute _score is needed. 您希望将查询和doc组合起来计算得分，因此需要一个自定义函数来计算_score。 I am new in elasticsearch, and I'm finding a way to solve the problem. 我是弹性搜索的新手，我正在寻找解决问题的方法。

Lucene is a more general search engine which is open to define your own scorer to compute the relevance, and I have developed several applications on it before. Lucene是一个更通用的搜索引擎，它可以定义你自己的得分手来计算相关性，之前我已经开发了几个应用程序。

This article describes the belief understanding of customizing scorer. 本文描述了定制记分员的信念理解。 However, on elasticsearch, I haven't found related articles. 但是，关于elasticsearch，我还没有找到相关的文章。 Welcome to discuss with me about your progress on elasticsearch. 欢迎与我讨论您在弹性搜索方面的进展。