简体繁体 English

Kafka 连接器接收器用于事先未知的主题

[英]Kafka connector sink for topic not known in advance

原文 2022-01-10 17:26:45 1 1 python/ apache-kafka/ apache-kafka-connect

Generic explanation: My application consumes messages from a topic and then splits them into separate topics according to their id, so the topics are named like topic_name_id.通用解释：我的应用程序从一个主题中消费消息，然后根据它们的 id 将它们拆分为单独的主题，因此主题命名为 topic_name_id。 My goal is to connect those new topics to a certain sink (s3 or snowflake, haven't decided) so that the messages published in those topics will end up there.我的目标是将这些新主题连接到某个接收器（s3 或雪花，尚未决定），以便在这些主题中发布的消息最终会在那里。 However, i've only found ways to do this using a configuration file, where you connect the sink to a topic that already exists and which you know the name of.但是，我只找到了使用配置文件执行此操作的方法，您可以在其中将接收器连接到已经存在并且您知道名称的主题。 But here the goal would be to connect the sink to the topic created during the process.但这里的目标是将接收器连接到在此过程中创建的主题。 Is there a way this can be achieved?有没有办法可以实现？

If the above is not possible, is there a way to connect to the common topic with all the messages, but create different tables (in snowflake) or s3 directories according to the message ID?如果以上不可行，有没有办法将所有消息连接到公共主题，但根据消息 ID 创建不同的表（在雪花中）或 s3 目录？ Adding to that, in case of s3, the messages are added as individual json files, right?除此之外，在 s3 的情况下，消息作为单独的 json 文件添加，对吗？ No way to combine them into one file?没有办法将它们合并到一个文件中？

Thanks谢谢

1 个解决方案

The outgoing IDs are known, right?传出的 ID 是已知的，对吗？

Kafka Connect uses a REST API that you generate a JSON HTTP body using those IDs and finalized topic names, then use requests , for example, to publish and start connectors for those topics. Kafka Connect uses a REST API that you generate a JSON HTTP body using those IDs and finalized topic names, then use requests , for example, to publish and start connectors for those topics. You can do that directly from the process directly before starting the producer, or you can send a request with the ID/topic name to a lambda job instead, which communicates with the Connect API您可以在启动生产者之前直接从流程中执行此操作，或者您可以将带有 ID/主题名称的请求发送到 lambda 作业，该作业与 Connect API 通信

When using different topics with the S3 Sink connector, there will be separate S3 paths and separate files, based on the number of partitions in the topic and the other partitioner settings defined in your connector property.当通过 S3 Sink 连接器使用不同的主题时，将有单独的 S3 路径和单独的文件，具体取决于主题中的分区数和连接器属性中定义的其他分区器设置。 Most S3 processes are able to read full S3 prefixes, though, so I don't imagine that being an issue不过，大多数 S3 进程都能够读取完整的 S3 前缀，所以我认为这不是问题

I don't have experience with the Snowflake connector to know how it handles different topic names.我没有使用 Snowflake 连接器的经验，不知道它如何处理不同的主题名称。