简体繁体 English

关于Elasticsearch的查询

[英]Query about Elasticsearch

原文 2013-12-12 20:46:06 4 1 database/ elasticsearch/ scalability

I am writing a service that will be creating and managing user records. 我正在编写一项将创建和管理用户记录的服务。 100+ million of them. 其中有100+百万。 For each new user, service will generate a unique user id and write it in database. 对于每个新用户，服务将生成一个唯一的用户ID并将其写入数据库。 Database is sharded based on unique user id that gets generated. 根据生成的唯一用户标识对数据库进行分片。

Each user record has several fields. 每个用户记录都有几个字段。 Now one of the requirement is that the service be able to search if there exists a user with a matching field value. 现在的要求之一是，该服务能够搜索是否存在具有匹配字段值的用户。 So those fields are declared as index in database schema. 因此，这些字段在数据库架构中被声明为索引。

However since database is sharded based on primary key ( unique user id ). 但是，由于数据库是基于主键（唯一用户ID）进行分片的。 I will need to search on all shards to find a user record that matches a particular column. 我将需要搜索所有分片以找到与特定列匹配的用户记录。

So to make that lookup fast. 因此，可以快速进行查找。 One thing i am thinking of doing is setting up an ElasticSearch cluster. 我正在考虑做的一件事是设置ElasticSearch集群。 Service will write to the ES cluster every time it creates a new user record. 服务每次创建新的用户记录时都会写入ES群集。 ES cluster will index the user record based on the relevant fields. ES群集将根据相关字段为用户记录建立索引。

My question is : 我的问题是：

-- What kind of performance can i expect from ES here ? -我可以从这里获得ES什么样的性能？ Assuming i have 100+million user records where 5 columns of each user record need to be indexed. 假设我有100+百万条用户记录，其中每个用户记录的5列都需要索引。 I know it depends on hardware config as well. 我知道这也取决于硬件配置。 But please assume a well tuned hardware. 但是，请假定硬件经过良好调整。

-- Here i am trying to use ES as a memcache alternative that provides multiple keys. -在这里，我尝试使用ES作为提供多个密钥的内存缓存替代方案。 So i want all dataset to be in memory and does not need to be durable. 所以我希望所有数据集都在内存中，并且不需要持久。 Is ES right tool to do that ? ES是正确的工具吗？

Any comment/recommendation based on experience with ElasticSearch for large dataset is very much appreciated. 非常感谢任何基于ElasticSearch大型数据集经验的评论/建议。

1 个解决方案

ES is not explicitly designed to run completely in memory - you normally wouldn't want to do that with large unbounded datasets in a Java application (though you can using off-heap memory). ES没有明确设计为完全在内存中运行-您通常不希望对Java应用程序中的大型无边界数据集执行此操作（尽管您可以使用堆外内存）。 Rather, it'll cache what it can and rely on the OS's disk cache for the rest. 而是，它将缓存其所能存储的内容，其余部分将依赖于操作系统的磁盘缓存。

100+ million records shouldn't be an issue at all even on a single machine. 即使在一台机器上，100亿条以上的记录也不是问题。 I run an index consisting 15 million records of ~100 small fields (no large text fields) amounting to 65Gb of data on disk on a single machine. 我运行的索引包含一千五百万条记录，其中包含约100个小字段（无大文本字段），一台机器上磁盘上的数据总计为65Gb。 Fairly complex queries that just return id/score execute in less than 500ms, queries that require loading the documents return in 1-1.5 seconds on a warmed up vm against a single SSD. 仅返回id /分数的相当复杂的查询会在不到500毫秒内执行，需要加载文档的查询会在一个预热的vm上针对单个SSD返回1-1.5秒。 I tend to given the JVM 12-16GB of memory - any more and I find it's just better to scale up via a cluster than a single huge vm. 我倾向于给JVM 12-16GB的内存-再多一点，我发现通过群集进行扩展比单个大型vm更好。