简体繁体 English

Elasticsearch 如何搜索文档？ ES中如何自定义预处理管道和评分函数？

[英]How does Elasticsearch search documents? How to customize preprocess pipeline and scoring functions in ES?

原文 2021-11-03 08:15:21 8 1 python/ elasticsearch

I want to implement Elasticsearch on a customized corpus.我想在自定义语料库上实现 Elasticsearch。 I have installed elasticsearch of version 7.5.1 and I do all my work in python using the official client.我已经安装了7.5.1版本的7.5.1并且我使用官方客户端在python完成了我的所有工作。

Here I have a few questions:这里我有几个问题：

How to customize preprocess pipeline?如何自定义预处理管道？ For example, I want to use a BertTokenizer to convert strings to tokens instead of ngrams例如，我想使用 BertTokenizer 将字符串转换为令牌而不是 ngrams
How to customize scoring function of each document wrt the query?如何自定义查询每个文档的评分功能？ For example, I want to compare effects of tf-idf with bm25 , or even using some neural models for scoring.例如，我想比较tf-idf与bm25 ，甚至使用一些神经模型进行评分。

If there is great tutorial in python , please share with me.如果有很棒的 python教程，请与我分享。 Thanks in advance.提前致谢。

1 个解决方案

You can customize the similarity function when creating an index.创建索引时可以自定义相似度函数。 See the Similarity Module section of the documentation.请参阅文档的相似性模块部分。 You can find a good article that compares classical TF_IDF with BM25 on the OpenSource Connections site .您可以在OpenSource Connections 站点上找到一篇比较经典 TF_IDF 与 BM25 的好文章。

It sounds like you want to use vector fields for scoring, there is a good article on the elastic blog that explains how you can achieve that.听起来您想使用向量场进行评分，弹性博客上有一篇很好的文章解释了如何实现这一目标。 Be aware that as of now Elasticsearch is not using vector fields for retrieval, only for scoring, if you want to use vector fields for retrieval you have to use a plugin, or the OpenSearch fork, or wait for version 8.请注意，目前Elasticsearch 不使用矢量字段进行检索，仅用于评分，如果您想使用矢量字段进行检索，则必须使用插件或OpenSearch fork，或等待版本 8。

In my opinion, using ANN in real-time during search is too slow and expensive, and i have yet to see improvements in relevancy with normal search requests.在我看来，在搜索过程中实时使用 ANN 太慢而且成本太高，而且我还没有看到与正常搜索请求的相关性的改进。

I would do the preprocessing of your documents in your own python environment before indexing and not use any Elasticsearch pipelines or plugins.在编制索引之前，我会在您自己的 Python 环境中对您的文档进行预处理，并且不使用任何 Elasticsearch 管道或插件。 It is easier to debug and iterate outside of Elasticsearch.在 Elasticsearch 之外进行调试和迭代更容易。

You could also take a look at the Haystack Project , it might have a lot of the functionality that you are looking for, already build in.你也可以看看Haystack Project ，它可能有很多你正在寻找的功能，已经内置了。