改善Kafka流的最佳实践是什么

Question

我正在使用流从一个主题A到另一个B生成数据，但这非常慢。 主题A拥有约1.3亿条记录的数据。

我们正在过滤具有特定日期的消息并生成主题B.是否有加速的方法？

以下是我正在使用的配置：

streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "test");


    // Where to find Kafka broker(s).
    streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
    // Where to find the schema registry instance(s)
    streamsConfiguration.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
    // streamsConfiguration.put(StreamsConfig.APPLICATION_SERVER_CONFIG, "localhost:" + port);
    // streamsConfiguration.put(StreamsConfig.APPLICATION_SERVER_CONFIG,  "localhost:8088");
    streamsConfiguration.put(StreamsConfig.RETRIES_CONFIG, 10);
    streamsConfiguration.put(StreamsConfig.RETRY_BACKOFF_MS_CONFIG, (10 * 1000L));
    streamsConfiguration.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, DefaultBugsnagExceptionHandler.getInstance().getClass());

  //  streamsConfiguration.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler);

    // Specify (de)serializers for record keys and for record values.
    streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);

    streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG, stateDir);
    streamsConfiguration.put(StreamsConfig.producerPrefix(ProducerConfig.ACKS_CONFIG), "all");
    streamsConfiguration.put(StreamsConfig.producerPrefix(ProducerConfig.LINGER_MS_CONFIG), "10000");

    streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
    // Records should be flushed every 10 seconds. This is less than the default
    // in order to keep this example interactive.
    ///Messages will be forwarded either when the cache is full or when the commit interval is reached
    streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 500);
    streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
    streamsConfiguration.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);

 StreamsConfig config = new StreamsConfig(streamsConfiguration);


    StreamsBuilder builder = new StreamsBuilder();

    String start_date = "2018-05-10";
    String end_date = "2018-05-16";
    //DateFormat format = new SimpleDateFormat("yyyy-MM-dd");
    //LocalDate dateTime;
 //   builder.stream("topicA").to("topicB");
    KStream<String, avroschems> source = builder.stream("topicA");
    source
           .filter((k, value) -> LocalDate.parse(value.getDay()).isAfter(LocalDate.parse(start_date))  && LocalDate.parse (value.getDay()).isBefore(LocalDate.parse(end_date)))
    .to("bugSnagIntegration_mobileCrashError_filtered");
    System.out.println("Starting Kafka Stream");
    return new KafkaStreams(builder.build(), config);

我正在尝试将邮件复制到某个日期范围内的topicB。不确定这是否会导致速度变慢？

如何实现并发？

Answer 1

“极慢”不是一个非常具体的术语。 您应该共享一些具体的吞吐量数字。

关于多线程：增加StreamsConfig.NUM_STREAM_THREADS_CONFIG是正确的。 但是，这仅在CPU成为瓶颈时才有用。 如果网络是瓶颈，则需要在不同的计算机上启动多个应用程序实例（即多次部署某些应用程序）。 对于这种情况，所有实例也将组成使用者组并分担负载。 我建议阅读文档以获取更多详细信息： https : //docs.confluent.io/current/streams/architecture.html#parallelism-model

此外，您还可以配置内部使用的使用者和生产者客户端。 这也可能有助于增加吞吐量。 cf. https://docs.confluent.io/current/streams/developer-guide/config-streams.html#kafka-consumers-producer-and-admin-client-configuration-parameters

改善Kafka流的最佳实践是什么

问题描述

1 个解决方案

解决方案1
0 2018-05-16 16:49:58

改善Kafka流的最佳实践是什么

问题描述

1 个解决方案

解决方案1 0 2018-05-16 16:49:58

解决方案1
0 2018-05-16 16:49:58