简体繁体 English

ElasticSearch在Lucene +自定义集群解决方案上的开销

[英]ElasticSearch overhead over Lucene + custom clustering solution

原文 2017-06-13 21:18:44 1 1 java/ elasticsearch/ lucene/ hazelcast

I've experience of working on project where full-text search speed was boosted by replacement of ElasticSearch with Lucene + Hazelcast. 我有一个项目工作经验，通过用Lucene + Hazelcast取代ElasticSearch来提高全文搜索速度。

What may be the reasons of ElasticSearch overhead over Lucene + Hazelcast? 与Lucene + Hazelcast相比，ElasticSearch开销的原因是什么？ Which ElasticSearch configs may cause for significant slowdown with the same resources? 哪些ElasticSearch配置可能导致相同资源的显着减速？

Provided arguments for Lucene + Hazelcast 为Lucene + Hazelcast提供了参数

ElasticSearch has big overhead over Lucene ElasticSearch比Lucene有很大的开销
Lucene is more flexible in indexing than ElasticSearch Lucene在索引方面比ElasticSearch更灵活

My considerations 我的考虑

Which overheads? 哪些管理费用？ As I know you can hack ElasticSearch to communicate with him through internal TCP API instead of REST. 据我所知，您可以通过内部TCP API而不是REST来破解ElasticSearch与他进行通信。 Any other overheads? 还有其他开销吗？ Are they only about replication (you can turn off initial load replication)? 它们只是关于复制（您可以关闭初始加载复制）吗？ OR about index auto-merging? 或关于索引自动合并？ Maybe due to ElasticSearch tried to merge indexes automatically and made them so big they doesn't feet FS cache? 也许是由于ElasticSearch试图自动合并索引并使它们如此之大，以至于它们不会支持FS缓存？
Why Lucene API is more flexible? 为什么Lucene API更灵活？ AFAIK, ElasticSearch has all the same indexes plus additional features like parent-child or nested objects. AFAIK，ElasticSearch具有所有相同的索引以及父子或嵌套对象等附加功能。 Since it's not a case for this project. 因为这不是这个项目的案例。 (See indexing/querying schema) （请参阅索引/查询架构）

Lucene + Hazelcast indexing/querying schema: Lucene + Hazelcast索引/查询架构：

You have 100-10.000 of huge string files compressed as AVRO in HDFS (in summary gigabytes or even terabytes of data). 在HDFS中有100-10.000个以AVRO压缩的巨大字符串文件（总结为千兆字节或甚至数TB的数据）。 You should index them that way that you can find all files containing specific string. 您应该以可以找到包含特定字符串的所有文件的方式对它们编制索引。
Submit index task with Hazelcast to each cluster node 使用Hazelcast向每个群集节点提交索引任务
Each index task use IndexWriter to write separate index for each node working only with a local file system. 每个索引任务都使用IndexWriter为每个仅使用本地文件系统的节点编写单独的索引。 Means each AVRO file will form one index per node. 意味着每个AVRO文件将为每个节点形成一个索引。 Each file row is a separate StringField 每个文件行都是一个单独的StringField
After indexing is finished on all nodes - indexes are never changed. 在所有节点上完成索引之后 - 索引永远不会更改。 Means no write payloads anymore. 意味着不再有写入有效载荷。 The amount of indexes equals to the amount of files. 索引量等于文件量。 Files a pretty big and their amount is not so hight - so no merging of indexes. 文件相当大，其数量不是那么高 - 所以没有合并索引。
Search with simple Term query specifying paths to all indexes where the data may be present. 使用简单的术语查询进行搜索，指定可能存在数据的所有索引的路径。

1 个解决方案

My reasons for using ES in this case would be 我在这种情况下使用ES的原因是

Future needs for project to explore data in more ways 项目未来需要以更多方式探索数据
Feature rich Aggregations API 功能丰富的Aggregations API
Support for Indexing using Spark / Hive etc - very easy to do and we can use pre processing of data efficiently. 使用Spark / Hive等支持索引 - 非常容易，我们可以有效地使用数据的预处理。
Auto Scaling / Adjust # of replications based on demand Auto Scaling /根据需求调整复制次数

and of course , not maintaining codebase to do all these. 当然，不保持代码库来完成所有这些。 This thread will be good discussion if you can add some expectations on flexibility from your end. 如果您可以从最终添加对灵活性的一些期望，那么这个主题将是很好的讨论。