在 Kafka Connect 分布式模式下为多个主题配置连接器

Question

We have producers that are sending the following to Kafka:我们有生产者向 Kafka 发送以下内容：

topic=syslog, ~25,000 events per day topic=syslog，每天约 25,000 个事件
topic=nginx, ~5,000 events per day topic=nginx，每天约 5,000 个事件
topic=zeek.xxx.log, ~100,000 events per day (total). topic=zeek.xxx.log，每天约 100,000 个事件（总计）。 In this last case there are 20 distinct zeek topics, such as zeek.conn.log and zeek.http.log在最后一种情况下，有 20 个不同的 zeek 主题，例如 zeek.conn.log 和 zeek.http.log

kafka-connect-elasticsearch instances function as consumers to ship data from Kafka to Elasticsearch. kafka-connect-elasticsearch实例作为消费者将数据从 Kafka 传送到 Elasticsearch。 The hello-world Sink configuration for kafka-connect-elasticsearch might look like this: kafka-connect-elasticsearch的 hello-world Sink 配置可能如下所示：

# elasticsearch.properties
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=24
topics=syslog,nginx,zeek.broker.log,zeek.capture_loss.log,zeek.conn.log,zeek.dhcp.log,zeek.dns.log,zeek.files.log,zeek.http.log,zeek.known_services.log,zeek.loaded_scripts.log,zeek.notice.log,zeek.ntp.log,zeek.packet_filtering.log,zeek.software.log,zeek.ssh.log,zeek.ssl.log,zeek.status.log,zeek.stderr.log,zeek.stdout.log,zeek.weird.log,zeek.x509.log
topic.creation.enable=true
key.ignore=true
schema.ignore=true
...

And can be invoked with bin/connect-standalone.sh .并且可以用bin/connect-standalone.sh调用。 I realized that running or attempting to run tasks.max=24 when work is performed in a single process is not ideal.我意识到在单个进程中执行工作时运行或尝试运行tasks.max=24并不理想。 I know that using distributed mode would be a better alternative, but am unclear on the performance-optimal way to submit connectors to distributed mode.我知道使用分布式模式会是一个更好的选择，但我不清楚将连接器提交到分布式模式的最佳性能方式。 Namely,即，

In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?在分布式模式下，我是否还想通过单个 API 调用只提交一个elasticsearch.properties ？ Or would it be best to break up multiple .properties configs + connectors (eg one for syslog, one for nginx, one for zeek.**) and submit them separately?或者最好将多个.properties配置 + 连接器（例如一个用于 syslog，一个用于 nginx，一个用于 zeek.**）并分别提交？
I understand that tasks be equal to the number of topics x number of partitions, but what dictates the number of workers?我知道tasks数等于主题数 x 分区数，但是什么决定了工作人员的数量？
Is there anywhere in the documentation that walks through best practices for a situation such as this where there is a noticeable imbalance of throughput for different topics?文档中是否有任何地方介绍了针对不同主题的吞吐量明显不平衡的情况的最佳实践？

Answer 1

In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?在分布式模式下，我是否仍想通过单个 API 调用仅提交单个 elasticsearch.properties？

It'd be a JSON file, but yes.它会是一个 JSON 文件，但是是的。

what dictates the number of workers?什么决定了工人的数量？

Up to you.由你决定。 JVM usage is one factor that you can monitor and scale on JVM 使用情况是您可以监控和扩展的因素之一

Not really any documentation that I am aware of并不是我所知道的任何文件

在 Kafka Connect 分布式模式下为多个主题配置连接器

问题描述

1 个解决方案

解决方案1
0 2021-11-08 20:02:45

在 Kafka Connect 分布式模式下为多个主题配置连接器

问题描述

1 个解决方案

解决方案1 0 2021-11-08 20:02:45

解决方案1
0 2021-11-08 20:02:45