简体繁体 English

将大量数据编入Elasticsearch中

[英]Index large amount of data into elasticsearch

原文 2016-12-15 10:43:28 1 2 elasticsearch/ apache-spark/ indexing/ mapreduce/ hbase

I've got more than 6 billion of social media data in HBase (including content/time/author and other possible fields) with 4100 regions in 48 servers , and I need to flush these data into Elasticsearch now. 我在HBase中拥有超过60亿个社交媒体数据（包括内容/时间/作者和其他可能的字段），在48个服务器中有4100个区域，我现在需要将这些数据刷新到Elasticsearch中。

I'm clear about the bulk API of ES, and using bulk in Java with MapReduce still cost many days (at least a week or so). 我对ES的批量API很清楚，并且在Java中使用MapReduce批量使用仍然需要花费很多天（至少一周左右）。 I can use spark instead but I don't think it will help a lot. 我可以改用spark，但我认为不会有太大帮助。

I'm wondering is there any other tricks to write these large data into ElasticSearch ? 我想知道是否还有其他技巧可以将这些大数据写入ElasticSearch？ Like manually write to es index files and using some kinds of recover to load the files in local file system ? 像手动写入es索引文件并使用某种恢复将文件加载到本地文件系统中一样？

Appreciate any possible advice, thanks. 感谢任何可能的建议，谢谢。

============== ==============

Some details about my cluster environments: 有关集群环境的一些详细信息：

spark 1.3.1 standalone (I can change it on yarn to use Spark 1.6.2 or 1.6.3) spark 1.3.1独立（我可以在纱线上更改它以使用Spark 1.6.2或1.6.3）

Hadoop 2.7.1 (HDP 2.4.2.258) Hadoop 2.7.1（HDP 2.4.2.258）

ElasticSearch 2.3.3 ElasticSearch 2.3.3

2 个解决方案

AFAIK Spark is best option for indexing out of below 2 options. AFAIK Spark是从以下2个选项中进行索引的最佳选择。 along with that below are approaches I'd offer : 以下是我提供的方法：

Divide (input scan criteria) and conquer 6 billion of social media data : 划分（输入扫描条件）并征服60亿社交媒体数据：

Id recommend create multiple Spark/Mapreduce jobs with different search criteria(to divide 6 billion of social media data in 6 pieces based on category or something else) and trigger them in parallel. ID建议使用不同的搜索条件创建多个Spark / Mapreduce作业（根据类别或其他内容将60亿个社交媒体数据分为6部分）并并行触发它们。 For example based on data capture Time Range(scan.setTimeRange(t1, t2)) or else with some fuzzy row logic(FuzzyRowFilter), should definitely speed up things. 例如，基于数据捕获时间范围（scan.setTimeRange（t1，t2））或具有一些模糊行逻辑（FuzzyRowFilter）的速度，肯定可以加快速度。

OR 要么

Kind of Streaming approach : 串流方式：

You can also consider as and when you are inserting data through spark or mapreduce you can simultaneously create indexes for them. 您还可以考虑在通过spark或mapreduce插入数据时同时为它们创建索引。

For example in case of SOLR : clouder has NRT hbase lily indexer... ie as and when hbase table is populated based on WAL (write ahead log) entries simultaneously it will create solr indexes. 例如，在SOLR的情况下， clouder具有NRT hbase lily indexer ...，即当同时基于WAL（预写日志）条目填充hbase表时，它将创建solr索引。 check any thing is there like that for Elastic search. 检查弹性搜索是否有类似的东西。

Even if its not there for ES as well, don't have to bother, while ingesting data it self using Spark/Mapreduce program you can create by yourself. 即使它也不适合ES，也不必费心，而使用Spark / Mapreduce程序自行提取数据时，您可以自己创建。

Option 1 : 选项1 ：

Id suggest if you are okay with spark it is good solution Spark Supports native integration of ES from hadoop 2.1. ID建议如果您对Spark没问题，那是一个很好的解决方案。Spark支持hadoop 2.1中ES的本机集成。 see 看到

elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark, in the form of an RDD (Resilient Distributed Dataset) (or Pair RDD to be precise) that can read data from Elasticsearch. elasticsearch-hadoop以RDD（弹性分布式数据集）（准确地说是Pair RDD）的形式提供了Elasticsearch和Apache Spark之间的本机集成，可以从Elasticsearch读取数据。 The RDD is offered in two flavors: one for Scala (which returns the data as Tuple2 with Scala collections) and one for Java (which returns the data as Tuple2 containing java.util collections). RDD有两种形式：一种用于Scala（将数据作为带有Scala集合的Tuple2返回），另一种用于Java（将数据作为包含java.util集合的Tuple2返回）。

Samples here with different spark version from 1.3 on wards 这里的样品与1.3不同的火花版本上病房
More samples other than Hbase Hbase以外的其他样本

Option 2 : As you are aware bit slow than spark 选项2：您知道比火花慢

Writing data to Elasticsearch With elasticsearch-hadoop, Map/Reduce jobs can write data to Elasticsearch making it searchable through indexes. 将数据写入Elasticsearch借助elasticsearch-hadoop，Map / Reduce作业可以将数据写入Elasticsearch，使其可通过索引进行搜索。 elasticsearch-hadoop supports both (so-called) old and new Hadoop APIs. elasticsearch-hadoop支持（所谓的）旧的和新的Hadoop API。

I found a practical trick to improve the bulk index performance by myself. 我发现了一个提高自己的大容量索引性能的实用技巧。

I can calculate the hash routing in my client and make sure that each bulk request containing all index requests with the same routing. 我可以在客户端中计算哈希路由，并确保每个批量请求都包含具有相同路由的所有索引请求。 According to the routing result and shard info with ip, I directly send the bulk request to corresponding shard node. 根据路由结果和ip的分片信息，我将批量请求直接发送到相应的分片节点。 This trick can avoid the bulk reroute cost and cut down the bulk request thread pool occupation which may cause EsRejectedException. 此技巧可以避免大量重新路由的成本，并减少可能导致EsRejectedException的大量请求线程池占用。

For example, I have 48 nodes in different machines. 例如，我在不同的计算机上有48个节点。 Assuming that I send a bulk request containing 3000 index requests to any node, these index requests will be rerouted to other nodes (usually all the nodes) according by routing. 假设我将包含3000个索引请求的批量请求发送到任何节点，这些索引请求将根据路由重新路由到其他节点（通常是所有节点）。 And the client thread has to wait for the whole process finished, including processing local bulk and waiting for other nodes' bulk responses. 客户端线程必须等待整个过程完成，包括处理本地批量和等待其他节点的批量响应。 However, without the reroute phase, the network costs are gone (except for forwarding to the replica nodes), and the client just need to wait less time. 但是，没有重新路由阶段，网络成本就消失了（转发到副本节点除外），客户端只需等待更少的时间。 Meanwhile, assuming that I have only 1 replica, the total occupation of bulk threads are 2 only. 同时，假设我只有1个副本，则批量线程的总占用仅为2个。 ( client-> primary shard and primary shard -> replica shard ) （客户端->主碎片和主碎片->复制碎片）

Routing hash: 路由哈希：

shard_num = murmur3_hash (_routing) % num_primary_shards shard_num = murmur3_hash （_routing）％num_primary_shards

Try to take a look into: org.elasticsearch.cluster.routing.Murmur3HashFunction 尝试看一下： org.elasticsearch.cluster.routing.Murmur3HashFunction

Client can get the shards and index aliases by request to cat apis. 客户端可以根据对Cat api的请求获得分片和索引别名。

shard info url: cat shards 碎片信息网址：猫碎片

aliases mapping url: cat aliases 别名映射网址： cat别名

Some attentions: 注意事项：

ES may change default hash function in different version, which means the client code may not be version compatible. ES可能会更改不同版本的默认哈希函数，这意味着客户端代码可能与版本不兼容。
This trick is based on the assumption that the hash results are basically balanced. 该技巧基于哈希结果基本平衡的假设。
Client should think about fault tolerance such as connection timeout to the corredponding shard node. 客户应考虑容错能力，例如到相应分片节点的连接超时。