简体   繁体   English

kafka-connect 在分布式模式下返回 409

[英]kafka-connect returning 409 in distributed mode

I'm running a kafka-connect distributed setup.我正在运行 kafka-connect 分布式设置。

I was testing with a single machine/process setup (still in distributed mode), which worked fine, now I'm working with 3 nodes (and 3 connect processes), logs do not contain errors, but when I submit an s3-connector request through the rest-api, it returns: {"error_code":409,"message":"Cannot complete request because of a conflicting operation (eg worker rebalance)"} .我正在测试单个机器/进程设置(仍在分布式模式下),它工作正常,现在我正在使用 3 个节点(和 3 个连接进程),日志不包含错误,但是当我提交 s3-connector通过 rest-api 请求,它返回: {"error_code":409,"message":"Cannot complete request because of a conflicting operation (eg worker rebalance)"}

When I stop the kafka-connect process on one of the nodes, I can actually submit the job and everything is running fine.当我在其中一个节点上停止 kafka-connect 进程时,我实际上可以提交作业并且一切正常。

I have 3 brokers in my cluster, the partition number of the topic is 32.我的集群中有 3 个代理,主题的分区号是 32。

This is the connector I'm trying to launch:这是我要启动的连接器:

{
    "name": "s3-sink-new-2",
    "config": {
        "connector.class": "io.confluent.connect.s3.S3SinkConnector",
        "tasks.max": "32",
        "topics": "rawEventsWithoutAttribution5",
        "s3.region": "us-east-1",
        "s3.bucket.name": "dy-raw-collection",
        "s3.part.size": "64000000",
        "flush.size": "10000",
        "storage.class": "io.confluent.connect.s3.storage.S3Storage",
        "format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
        "schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
        "partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
        "partition.duration.ms": "60000",
        "path.format": "\'year\'=YYYY/\'month\'=MM/\'day\'=dd/\'hour\'=HH",
        "locale": "US",
        "timezone": "GMT",
        "timestamp.extractor": "RecordField",
        "timestamp.field": "procTimestamp",
        "name": "s3-sink-new-2"
    }
}

Nothing in the logs indicate a problem, and I'm really lost here.日志中没有任何内容表明有问题,我真的迷路了。

I had the same problem with my setup on Kubernetes. 我在Kubernetes上进行安装时遇到了同样的问题。 The issue was that I had CONNECT_REST_ADVERTISED_HOST_NAME set to same value on each of 16 nodes. 问题是我在16个节点的每个节点上将CONNECT_REST_ADVERTISED_HOST_NAME设置为相同的值。 It causes constant rebalancing issue. 这会引起持续的重新平衡问题。 Have unique value and you should be fine. 有独特的价值,你应该没事。

The solution for K8S, which works for me: 适用于我的K8S解决方案:

- env:
  - name: CONNECT_REST_ADVERTISED_HOST_NAME
    valueFrom:
      fieldRef:
        fieldPath: status.podIP

Same as for @OmriManor, in my case it was an issue with one of the nodes, causing a rebalance loop. 与@OmriManor相同,在我的情况下,这是一个节点之一的问题,导致重新平衡循环。 What I did was to pause the connector , then I stopped all nodes accept for one, then i was able to delete the connector since the single node did not cause the rebalance loop. 我要做的是暂停连接器 ,然后停止所有节点接受一个,然后删除连接器,因为单个节点没有引起重新平衡循环。

As Wojciech Sznapka has said, CONNECT_REST_ADVERTISED_HOST_NAME ( rest.advertised.host.name if you're not using Docker) is the issue here. 正如Wojciech Sznapka所说, CONNECT_REST_ADVERTISED_HOST_NAME (如果您不使用Docker, rest.advertised.host.name )是这里的问题。 It needs to be set not just to a unique value but the correct hostname of the worker and that can be resolved from the other workers . 它不仅需要设置为唯一值,还需要设置工作程序的正确主机名, 并且可以从其他工作程序中解析出来

rest.advertised.host.name is used by Kafka Connect to determine how to contact the other workers - for example when it needs to forward on a REST request to a worker if it is not the leader. Kafka Connect使用rest.advertised.host.name来确定如何与其他工作人员联系-例如,如果它不是领导者,则需要将REST请求转发给工作人员。 If this config is not set correctly then problems ensue. 如果此配置设置不正确,则会出现问题。

If you have a cluster of workers and you shut all but one down and suddenly things work, that's because by shutting the others down you've guaranteed that the remaining worker is the leader and thus won't have to forward the request on. 如果您有一群工作人员,并且只关闭了一个工作组,然后突然工作正常,那是因为通过关闭其他工作组,您可以确保剩下的工作组领导者,因此不必继续转发请求。

For more details see https://rmoff.net/2019/11/22/common-mistakes-made-when-configuring-multiple-kafka-connect-workers/ 有关更多详细信息,请参见https://rmoff.net/2019/11/22/common-mistakes-made-when-configuring-multiple-kafka-connect-workers/

In my case, when I delete old topics used for Kafka Connect, the 409 error disappeared!就我而言,当我删除用于 Kafka Connect 的旧主题时,409 错误消失了!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM