Flink KeyBy 性能

Question

I'm performance benchmarking my Flink application that reads data from Kafka, transforms it and dumps it into another Kafka topic.我正在对我的 Flink 应用程序进行性能基准测试，该应用程序从 Kafka 读取数据，对其进行转换并将其转储到另一个 Kafka 主题中。 I need to keep the context so messages with same order-id are not treated as brand new orders.我需要保留上下文，这样具有相同订单 ID 的消息不会被视为全新订单。 I'm extending RichFlatMapFunction class with ValueState to achieve that.我正在使用 ValueState 扩展 RichFlatMapFunction class 来实现这一点。 As I understand, I'll need to use KeyStream before I can call flatMap:据我了解，在调用 flatMap 之前我需要使用 KeyStream：

env.addSource(source()).keyBy(Order::getId).flatMap(new OrderMapper()).addSink(sink());

The problem is keyBy is taking very long time from my prespective (80 to 200 ms).问题是 keyBy 从我的角度来看需要很长时间（80 到 200 毫秒）。 I say keyBy is taking because if I remove keyBy and replace flatMap with a map function, 90th percentile of latency is about 1ms.我说 keyBy 正在使用，因为如果我删除 keyBy 并将 flatMap 替换为 map function，则第 90 个百分位数的延迟约为 1 毫秒。 Is there a way to use state/context without using keyBy or maybe make keyBy fast somehow?有没有办法在不使用 keyBy 的情况下使用状态/上下文，或者以某种方式使 keyBy 快速？

Answer 1

The keyBy is expensive because it requires a.network shuffle -- every record is serialized, sent to the downstream instance responsible for that key, and then deserialized. keyBy很昂贵，因为它需要网络随机播放——每条记录都被序列化，发送到负责该密钥的下游实例，然后反序列化。

For the pipeline you've described, this is unavoidable.对于您所描述的管道，这是不可避免的。 But your choice of serializer can make a big difference.但是您选择的序列化程序可能会产生很大的不同。

For more ideas about how to reduce latency, see Flink optimal configuration for minimum Latency .有关如何减少延迟的更多想法，请参阅Flink 最佳配置以实现最小延迟。

As for the choice of key, if you need to deduplicate by orderId, then you'll have to key by the orderId.至于key的选择，如果需要通过orderId去重，那么就必须通过orderId来作为key。

Flink KeyBy 性能

问题描述

1 个解决方案

解决方案1
0 2022-12-01 10:08:05

Flink KeyBy 性能

问题描述

1 个解决方案

解决方案1 0 2022-12-01 10:08:05

解决方案1
0 2022-12-01 10:08:05