简体   繁体   English

kafka-> flink-性能问题

[英]kafka -> flink - performance issues

Im looking at some kafka topics that generate ~30K messages / second. 我正在查看一些kafka主题,这些主题每秒产生约3万条消息。 I have a flink topology setup to read one of these, aggregate a bit (5 second window) and then (eventually) write to a DB. 我有一个flink拓扑设置,可以读取其中一个,聚合一点(5秒窗口),然后(最终)写入数据库。

When I run my topology and remove everything but the read -> aggregate steps I can only get ~30K messages per minute . 当我运行拓扑并删除除read->聚合步骤之外的所有内容时,我每分钟只能收到3万条消息。 There isn't anywhere for backpressure to occur. 没有任何地方会发生背压。

What am I doing wrong? 我究竟做错了什么?


Edit: 编辑:

  1. I can't change anything about the topic space. 我无法更改有关主题空间的任何内容。 Each topic has a single partition and there are hundreds of them. 每个主题都有一个分区,并且有数百个。
  2. Each message is a compressed thrift object averaging 2-3Kb 每个消息都是平均2-3Kb的压缩节俭对象

It appears that I'm only able to get ~1.5 MB/s. 看来我只能达到〜1.5 MB / s。 Not v close to the 100MB/s mentioned. 不能接近所提到的100MB / s。

The current code path: 当前代码路径:

DataStream<byte[]> dataStream4 = env.addSource(new FlinkKafkaConsumer081<>("data_4", new RawSchema(), parameterTool.getProperties())).setParallelism(1);  
DataStream<Tuple4<Long, Long, Integer, String>> ds4 = dataStream4.rebalance().flatMap(new mapper2("data_4")).setParallelism(4);

public class mapper2 implements FlatMapFunction<byte[], Tuple4<Long, Long, Integer, String>> {
    private String mapId;
    public mapper2(String mapId) {
        this.mapId = mapId;
    }

    @Override
    public void flatMap(byte[] bytes, Collector<Tuple4<Long, Long, Integer, String>> collector) throws Exception {
        TimeData timeData = (TimeData)ts_thriftDecoder.fromBytes(bytes);
        Tuple4 tuple4 = new Tuple4<Long, Long, Integer, String>();
        tuple4.f0 = timeData.getId();
        tuple4.f1 = timeData.getOtherId();
        tuple4.f2 = timeData.getSections().size();
        tuple4.f3 = mapId;

        collector.collect(tuple4);
    }
}

From the code, I see two potential components which could cause the performance issues: 从代码中,我看到了两个可能导致性能问题的潜在组件:

  • The FlinkKafkaConsumer FlinkKafka消费者
  • The Thrift deserializer 节俭解串器

In order to understand where the bottleneck is, I would first measure the raw read performance of Flink reading from the Kafka topic. 为了了解瓶颈所在,我首先将测量来自Kafka主题的Flink读取的原始读取性能。

Therefore, can you run the following code on your cluster? 因此,您可以在集群上运行以下代码吗?

public class RawKafka {

private static final Logger LOG = LoggerFactory.getLogger(RawKafka.class);

public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    ParameterTool parameterTool = ParameterTool.fromArgs(args);
    DataStream<byte[]> dataStream4 = env.addSource(new FlinkKafkaConsumer081<>("data_4", new RawSchema(), parameterTool.getProperties())).setParallelism(1);

    dataStream4.flatMap(new FlatMapFunction<byte[], Integer>() {
        long received = 0;
        long logfreq = 50000;
        long lastLog = -1;
        long lastElements = 0;

        @Override
        public void flatMap(byte[] element, Collector<Integer> collector) throws Exception {
            received++;
            if (received % logfreq == 0) {
                // throughput over entire time
                long now = System.currentTimeMillis();

                // throughput for the last "logfreq" elements
                if(lastLog == -1) {
                    // init (the first)
                    lastLog = now;
                    lastElements = received;
                } else {
                    long timeDiff = now - lastLog;
                    long elementDiff = received - lastElements;
                    double ex = (1000/(double)timeDiff);
                    LOG.info("During the last {} ms, we received {} elements. That's {} elements/second/core. GB received {}",
                            timeDiff, elementDiff, elementDiff*ex, (received * 2500) / 1024 / 1024 / 1024);
                    // reinit
                    lastLog = now;
                    lastElements = received;
                }
            }
        }
    });

    env.execute("Raw kafka throughput");
}
}

This code is measuring the time between 50k elements from Kafka and logging the number of elements read from Kafka. 该代码测量从Kafka读取50k元素之间的时间,并记录从Kafka读取的元素数量。 On my local machine I got a throughput of ~330k elements/core/second: 在我的本地计算机上,我的吞吐量约为33万个元素/核心/秒:

16:09:34,028 INFO  RawKafka                                                      - During the last 88 ms, we received 30000 elements. That's 340909.0909090909 elements/second/core. GB received 0
16:09:34,028 INFO  RawKafka                                                      - During the last 86 ms, we received 30000 elements. That's 348837.20930232556 elements/second/core. GB received 0
16:09:34,028 INFO  RawKafka                                                      - During the last 85 ms, we received 30000 elements. That's 352941.17647058825 elements/second/core. GB received 0
16:09:34,028 INFO  RawKafka                                                      - During the last 88 ms, we received 30000 elements. That's 340909.0909090909 elements/second/core. GB received 0
16:09:34,030 INFO  RawKafka                                                      - During the last 90 ms, we received 30000 elements. That's 333333.3333333333 elements/second/core. GB received 0
16:09:34,030 INFO  RawKafka                                                      - During the last 91 ms, we received 30000 elements. That's 329670.3296703297 elements/second/core. GB received 0
16:09:34,030 INFO  RawKafka                                                      - During the last 85 ms, we received 30000 elements. That's 352941.17647058825 elements/second/core. GB received 0

I'm really interested to see which throughput you are achieving reading from Kafka. 我真的很想知道您从Kafka读取的吞吐量是多少。

I've never used Flink or it's KafkaConsumer, but I have experience with Kafka in a Storm environment. 我从未使用过Flink或KafkaConsumer,但是我在Storm环境中有使用Kafka的经验。 Here are some thoughts that I have. 这是我的一些想法。 There's a lot of variables at play with how Kafka speed is determined. 如何确定卡夫卡速度的因素很多。 Here are some things to think about and investigate, add more details to your question when you have them. 这是一些要考虑和调查的事情,有问题时可以在问题中添加更多详细信息。

  • Adding more partitions should increase your throughput. 添加更多分区将增加吞吐量。 So yes, adding more partitions and consumers should see a somewhat linear jump in performance. 因此,是的,添加更多的分区和使用方应该会看到性能线性上升。
  • Kafka throughput is relative to message size. Kafka吞吐量与消息大小有关。 So if you have big messages the throughput will suffer accordingly. 因此,如果您收到大量邮件,则吞吐量将因此受到影响。
  • Do you have any evidence to support your expectation that Kafka Consumer should be faster? 您是否有证据支持您对Kafka Consumer应该更快的期望? While I would agree that 30K msg/min is really slow, do you have evidence to back up your expectation? 我同意30K msg / min的速度确实很慢,但您是否有证据支持您的预期? Like a general speed test using the FlinkKafkaConsumer (something like this ), or using the plain Kafka consumer to see what the speed of consuming is and then comparing that to Flink's Consumer? 就像使用FlinkKafkaConsumer(类似这样 )进行常规速度测试,或使用普通的Kafka使用者查看消耗速度是多少,然后将其与Flink的使用者进行比较?

There could be a lot of reasons why its consuming slowly, I've tried to highlight some of the general Kafka related stuff. 它消耗缓慢的原因可能有很多,我试图强调一些与Kafka相关的常规知识。 I'm sure there are probably things you can do in Flink to speed up consuming that I don't know about because I've never used it. 我确信您可能在Flink中可以做一些事情来加快我不知道的消费,因为我从未使用过它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM