简体繁体 English

使用单独命名空间的 ANN 性能

[英]ANN Performance Using Separate Namespaces

原文 2022-06-09 20:39:17 3 1 vespa

I am trying to perform ANN, but my data is split into partitions or "tenants."我正在尝试执行 ANN，但我的数据被分成多个分区或“租户”。 Searches are always restricted to a single tenant, which represents a small percentage of the total documents.搜索始终仅限于单个租户，该租户仅占总文档的一小部分。

I first tried implementing this using a filter on a tenant string attribute.我首先尝试使用租户字符串属性上的过滤器来实现这一点。 However, I encountered this piece of documentation , that suggests the performance will be poor:但是，我遇到了这个文档，这表明性能会很差：

There is a small problem here however.然而这里有一个小问题。 If the eligibility list is small in relation to the number of items in the graph, skipping occurs with a high probability.如果资格列表相对于图表中的项目数较小，则很可能会发生跳过。 This means that the algorithm needs to consider an exponentially increasing number of candidates, slowing down the search significantly.这意味着该算法需要考虑呈指数增长的候选者数量，从而显着减慢搜索速度。 To solve this, Vespa.ai switches over to a brute-force search when this occurs.为了解决这个问题，Vespa.ai 在发生这种情况时切换到蛮力搜索。 The result is a efficient ANN search when combined with filters.结果是与过滤器结合使用时的有效 ANN 搜索。

What's the best way to solve my problem?解决我的问题的最佳方法是什么？ Will partitioning my data into separate namespaces trigger the creation of a separate HNSW graph per namespace?将我的数据分区到单独的命名空间会触发为每个命名空间创建单独的 HNSW 图吗？

1 个解决方案

Performance will be fine, the query planner will just choose to not use the ANN index for these queries.性能会很好，查询规划器只会选择不为这些查询使用 ANN 索引。 You'll find lots of details on this topic, including how to tune this, in this blog post: https://blog.vespa.ai/constrained-approximate-nearest-neighbor-search/您将在此博客文章中找到有关此主题的许多详细信息，包括如何对其进行调整： https ://blog.vespa.ai/constrained-approximate-nearest-neighbor-search/

If all your queries are towards a single tenant which is a small percentage of the total documents I don't think you necessarily need to create an HNSW index at all, but this depends on the absolute numbers and the largest "small percentage".如果您的所有查询都是针对单个租户的，该租户占总文件的一小部分，我认为您根本不需要创建 HNSW 索引，但这取决于绝对数字和最大的“小百分比”。

(Namespaces are not relevant here - their only purpose is to safely add a string to ids so that you can have multiple sources of ids and still be guaranteed global uniqueness.) （命名空间在这里不相关——它们的唯一目的是安全地将字符串添加到 id 中，这样您就可以拥有多个 id 来源并且仍然保证全局唯一性。）