简体   繁体   English

当主题有多个分区时,KTable-KTable 外键连接不生成所有消息

[英]KTable-KTable foreign-key join not producing all messages when topics have more than one partition

See Update below to show potential workaround请参阅下面的更新以显示潜在的解决方法

Our application consumes 2 topics as KTables, performs a left join, and outputs to a topic.我们的应用程序使用 2 个主题作为 KTables,执行左连接,并输出到一个主题。 During testing, we found that this works as expected when our output topic has only 1 partition.在测试过程中,我们发现当我们的 output 主题只有 1 个分区时,这会按预期工作。 When we increase the number of partitions, we notice that the number of messages that get produced to the output topic decreases.当我们增加分区数量时,我们注意到生成到 output 主题的消息数量减少了。

We tested this theory with multiple partition configurations prior to starting the app.在启动应用程序之前,我们使用多个分区配置测试了这一理论。 With 1 partition, we see 100% of the messages.使用 1 个分区,我们可以看到 100% 的消息。 With 2, we see some messages (less than 50%).对于 2,我们看到一些消息(少于 50%)。 With 10, we see barely any (less than 10%).有 10 个时,我们几乎看不到(不到 10%)。

Because we are left joining, every single message that is consumed from Topic 1 should get written to our output topic, but we're finding that this is not happening.因为我们正在加入,所以从主题 1 使用的每条消息都应该写入我们的 output 主题,但我们发现这并没有发生。 It seems like messages are getting stuck in the "intermediate" topics created from the foreign key join of the Ktables, but there are no error messages.消息似乎卡在了从 Ktables 的外键连接创建的“中间”主题中,但没有错误消息。

Any help would be greatly appreciated!任何帮助将不胜感激!

Service.java服务.java

@Bean
public BiFunction<KTable<MyKey, MyValue>, KTable<MyOtherKey, MyOtherValue>, KStream<MyKey, MyEnrichedValue>> process() {

    return (topicOne, topicTwo) ->
            topicOne
                    .leftJoin(topicTwo,
                            value -> MyOtherKey.newBuilder()
                                    .setFieldA(value.getFieldA())
                                    .setFieldB(value.getFieldB())
                                    .build(),
                            this::enrich)
                    .toStream();
}

build.gradle build.gradle

plugins {
    id 'org.springframework.boot' version '2.3.1.RELEASE'
    id 'io.spring.dependency-management' version '1.0.9.RELEASE'
    id 'com.commercehub.gradle.plugin.avro' version '0.9.1'
}

...

ext {
    set('springCloudVersion', "Hoxton.SR6")
}

...

implementation 'org.springframework.cloud:spring-cloud-stream-binder-kafka-streams'
implementation 'io.confluent:kafka-streams-avro-serde:5.5.1'

Note: We are excluding the org.apache.kafka dependencies due to a bug in the versions included in spring-cloud-stream注意:由于 spring-cloud-stream 中包含的版本存在错误,我们排除了 org.apache.kafka 依赖项

application.yml应用.yml

spring:
  application:
    name: app-name
    stream:
      bindings:
        process-in-0:
          destination: topic1
          group: ${spring.application.name}
        process-in-1:
          destination: topic2
          group: ${spring.application.name}
        process-out-0:
          destination: outputTopic
      kafka:
        streams:
          binder:
            applicationId: ${spring.application.name}
            brokers: ${KAFKA_BROKERS}
            configuration:
              commit.interval.ms: 1000
              producer:
                acks: all
                retries: 20
              default:
                key:
                  serde: io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde
                value:
                  serde: io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde
            min-partition-count: 2

Test Scenario:测试场景:

To provide a concrete example, if I publish the following 3 messages to Topic 1:举一个具体的例子,如果我向主题 1 发布以下 3 条消息:

{"fieldA": 1, "fieldB": 1},,{"fieldA": 1, "fieldB": 1}
{"fieldA": 2, "fieldB": 2},,{"fieldA": 2, "fieldB": 2}
{"fieldA": 3, "fieldB": 3},,{"fieldA": 3, "fieldB": 3}
{"fieldA": 4, "fieldB": 4},,{"fieldA": 4, "fieldB": 4}

The output topic will only receive 2 messages. output主题只会收到2条消息。

{"fieldA": 2, "fieldB": 2},,{"fieldA": 2, "fieldB": 2}
{"fieldA": 3, "fieldB": 3},,{"fieldA": 3, "fieldB": 3}

What happened to the other 2?另外2个人怎么了? It seems certain key/value pairs are just unable to get written to the output topic.似乎某些键/值对无法写入 output 主题。 Retrying these "lost" messages does not work either.重试这些“丢失”的消息也不起作用。

Update:更新:

I was able to get this functioning properly by consuming Topic 1 as a KStream instead of a KTable and calling toTable() before going on to do the KTable-KTable join.我能够通过将主题 1 作为 KStream 而不是 KTable 使用并在继续执行 KTable-KTable 连接之前调用 toTable toTable()来正常运行。 I am still not sure why my original solution does not work, but hopefully this workaround can shed some light on the actual issue.我仍然不确定为什么我的原始解决方案不起作用,但希望此解决方法可以阐明实际问题。

@Bean
public BiFunction<KStream<MyKey, MyValue>, KTable<MyOtherKey, MyOtherValue>, KStream<MyKey, MyEnrichedValue>> process() {

    return (topicOne, topicTwo) ->
            topicOne
                    .map(...)
                    .toTable()
                    .leftJoin(topicTwo,
                            value -> MyOtherKey.newBuilder()
                                    .setFieldA(value.getFieldA())
                                    .setFieldB(value.getFieldB())
                                    .build(),
                            this::enrich)
                    .toStream();
}

Given the description of the problem, it seems that the data in the (left) KTable input topic is not correctly partitioned by its key.鉴于问题的描述,似乎(左)KTable 输入主题中的数据未按其键正确分区。 For a single partitioned topic, well, there is only one partition and all data goes to this one partition and the join result is complete.对于一个单独的分区主题,好吧,只有一个分区,所有数据都进入这个分区,并且连接结果是完整的。

However, for a multi-partitioned input topic, you need to ensure that the data is partitioned by key, otherwise, two records with the same key might end up in different partitions and thus the join fails (as the join is done on a per-partition basis).但是,对于多分区的输入主题,您需要确保数据按 key 分区,否则具有相同 key 的两条记录可能最终在不同的分区中,因此连接失败(因为连接是在每个-分区基础)。

Note that even if a foreign-key join does not require that both input topics are co-partitioned, it is still required that each input topic itself is partitioned by its key!请注意,即使外键连接不要求两个输入主题是共同分区的,但仍然需要每个输入主题本身都按其键进行分区!

If you use a map().toTable() you basically trigger an internal repartitioning of the data that ensures that the data gets partitioned by the key, and this fixes the problem.如果您使用map().toTable() ,则基本上会触发数据的内部重新分区,以确保数据按键进行分区,从而解决了问题。

I had a similar issue.我有一个类似的问题。 I have two incoming KStreams, which I converted to KTables, and performed a KTable-KTable FK join.我有两个传入的 KStreams,我将其转换为 KTables,并执行了 KTable-KTable FK 连接。 Kafka streams produced absolutely no records, the joined were never performed. Kafka 流完全没有产生任何记录,连接从未执行过。

Repartitioning the KStreams didn't work for me.重新分区 KStreams 对我不起作用。 Instead I had to manually set the partition size to 1.相反,我不得不手动将分区大小设置为 1。

Here's a stripped down example of what doesn't work:这是一个不起作用的精简示例:

Note I'm using Kotlin, with some extension helper functions注意我使用的是 Kotlin,带有一些扩展辅助函数

fun enrichUsersData(
  userDataStream: KStream<UserId, UserData>,
  environmentDataStream: KStream<RealmId, EnvironmentMetaData>,
) {

  // aggregate all users on a server into an aggregating DTO
  val userDataTable: KTable<ServerId, AggregatedUserData> =
    userDataStream
      .groupBy { _: UserId, userData: UserData -> userData.serverId }
      .aggregate({ AggregatedUserData }) { serverId: ServerId, userData: UserData, usersAggregate: AggregatedUserData ->
        usersAggregate
          .addUserData(userData)
          .setServerId(serverId)
        return@aggregate usersAggregate
      }

  // convert all incoming environment data into a KTable
  val environmentDataTable: KTable<RealmId, EnvironmentMetaData> =
    environmentDataStream
      .toTable()

  // Now, try to enrich the user's data with the environment data
  // the KTable-KTable FK join is correctly configured, but...
  val enrichedUsersData: KTable<ServerId, AggregatedUserData> =
    userDataTable.join(
      other = environmentDataTable,
      tableJoined = tableJoined("enrich-user-data.join"),
      materialized = materializedAs(
        "enriched-user-data.store",
        jsonMapper.serde(),
        jsonMapper.serde(),
      ),
      foreignKeyExtractor = { usersData: AggregatedUserData -> usersData.realmId },
    ) { usersData: AggregatedUserData, environmentData: EnvironmentMetaData ->
      usersData.enrichUserData(environmentData)
      // this join is never called!!
      return@join usersData
    }
}

If I manually set the partition size to 1, then it works.如果我手动将分区大小设置为 1,那么它就可以工作。

fun enrichUsersData(
  userDataStream: KStream<UserId, UserData>,
  environmentDataStream: KStream<RealmId, EnvironmentMetaData>,
) {

  // manually set the partition size to 1 *before* creating the table
  val userDataTable: KTable<ServerId, AggregatedUserData> =
    userDataStream
      .repartition(
        repartitionedAs(
          "user-data.pre-table-repartition",
          jsonMapper.serde(),
          jsonMapper.serde(),
          numberOfPartitions = 1,
        )
      )
      .groupBy { _: UserId, userData: UserData -> userData.serverId }
      .aggregate({ AggregatedUserData }) { serverId: ServerId, userData: UserData, usersAggregate: AggregatedUserData ->
        usersAggregate
          .addUserData(userData)
          .setServerId(serverId)
        return@aggregate usersAggregate
      }

  // again, manually set the partition size to 1 *before* creating the table
  val environmentDataTable: KTable<RealmId, EnvironmentMetaData> =
    environmentDataStream
      .repartition(
        repartitionedAs(
          "environment-metadata.pre-table-repartition",
          jsonMapper.serde(),
          jsonMapper.serde(),
          numberOfPartitions = 1,
        )
      )
      .toTable()

  // this join now works as expected!
  val enrichedUsersData: KTable<ServerId, AggregatedUserData> =
    userDataTable.join(
      other = environmentDataTable,
      tableJoined = tableJoined("enrich-user-data.join"),
      materialized = materializedAs(
        "enriched-user-data.store",
        jsonMapper.serde(),
        jsonMapper.serde(),
      ),
      foreignKeyExtractor = { usersData: AggregatedUserData -> usersData.realmId },
    ) { usersData: AggregatedUserData, environmentData: EnvironmentMetaData ->
      usersData.enrichUserData(environmentData)
      return@join usersData
    }
}

Selecting the key on joined topic might help.选择加入主题的键可能会有所帮助。 Partition configuration of topics should be same.主题的分区配置应该相同。

return (topicOne, topicTwo) ->
        topicOne
            .leftJoin(topicTwo,
                value -> MyOtherKey.newBuilder()
                    .setFieldA(value.getFieldA())
                    .setFieldB(value.getFieldB())
                    .build(),
                this::enrich)
            .toStream().selectKey((key, value) -> key);

This is a strange issue, I have never heard of a number of output topic partitions controlling the data write frequency.这是一个奇怪的问题,我从来没有听说过一些 output 主题分区控制数据写入频率。 However I do know that toStream() writes the data to downstream only when the cache is full, so try setting cache.max.bytes.buffering = 0 .但是我知道toStream()仅在缓存已满时才将数据写入下游,因此请尝试设置cache.max.bytes.buffering = 0 Also, KTable keeps only the latest record for each key, so if you have multiple values against the same key, only latest value would stay and be written downstream.此外,KTable 仅保留每个键的最新记录,因此如果您对同一个键有多个值,则只有最新值会保留并写入下游。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM