简体繁体 English

用于批量操作的 Kafka Elasticsearch 连接器

[英]Kafka Elasticsearch Connector for bulk operations

原文 2020-12-15 20:16:56 3 1 elasticsearch/ apache-kafka/ apache-kafka-connect

I am using the Elasticsearch Sink Connector for operations (index, update, delete) on single records.我正在使用 Elasticsearch 接收器连接器对单个记录进行操作（索引、更新、删除）。

Elasticsearch also has a /_bulk endpoint which can be used to create, update, index, or delete multiple records at once. Elasticsearch 还有一个 /_bulk 端点，可用于一次创建、更新、索引或删除多条记录。 Documentation here .文档在这里。

Does the Elasticsearch Sink Connector support these types of bulk operations? Elasticsearch 接收器连接器是否支持这些类型的批量操作？ If so, what is the configuration I need, or is there any sample code I can review?如果是这样，我需要什么配置，或者我可以查看任何示例代码？

1 个解决方案

Internally the Elasticsearch sink connector creates a bulk processor that is used to send records in a batch. Elasticsearch 接收器连接器在内部创建一个批量处理器，用于批量发送记录。 To control this processor you need to configure the following properties:要控制此处理器，您需要配置以下属性：

batch.size : The number of records to process as a batch when writing to Elasticsearch. batch.size ：写入 Elasticsearch 时作为批处理的记录数。
max.in.flight.requests : The maximum number of indexing requests that can be in-flight to Elasticsearch before blocking further requests. max.in.flight.requests ：在阻止进一步请求之前，可以对 Elasticsearch 进行的索引请求的最大数量。
max.buffered.records : The maximum number of records each task will buffer before blocking acceptance of more records. max.buffered.records ：每个任务在阻止接受更多记录之前将缓冲的最大记录数。 This config can be used to limit the memory usage for each task.此配置可用于限制每个任务的 memory 使用。
linger.ms : Records that arrive in between request transmissions are batched into a single bulk indexing request, based on the batch.size configuration. linger.ms ：根据batch.size配置，在请求传输之间到达的记录被批处理到单个批量索引请求中。 Normally this only occurs under load when records arrive faster than they can be sent out.通常，这仅在记录到达速度快于发送速度时才会在负载下发生。 However it may be desirable to reduce the number of requests even under light load and benefit from bulk indexing.然而，即使在轻负载下也可能需要减少请求的数量并从批量索引中受益。 This setting helps accomplish that - when a pending batch is not full, rather than immediately sending it out the task will wait up to the given delay to allow other records to be added so that they can be batched into a single request.此设置有助于实现这一点 - 当待处理的批次未满时，任务将等待给定的延迟时间，以允许添加其他记录，以便可以将它们批处理到单个请求中，而不是立即将其发送出去。
flush.timeout.ms : The timeout in milliseconds to use for periodic flushing, and when waiting for buffer space to be made available by completed requests as records are added. flush.timeout.ms ：用于定期刷新的超时时间（以毫秒为单位），以及在添加记录时等待已完成请求提供缓冲区空间时的超时时间。 If this timeout is exceeded the task will fail.如果超过此超时，任务将失败。