简体   繁体   中英

What are the best practices to improve kafka streams

I am producing data from one topic A to another B using streams.But it is extremely slow. The topic A has data of ~130M records.

We are filtering messages with specific date and producing to Topic B.Is there a way to speed up?

Below are the configs i am using:

streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "test");


    // Where to find Kafka broker(s).
    streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
    // Where to find the schema registry instance(s)
    streamsConfiguration.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
    // streamsConfiguration.put(StreamsConfig.APPLICATION_SERVER_CONFIG, "localhost:" + port);
    // streamsConfiguration.put(StreamsConfig.APPLICATION_SERVER_CONFIG,  "localhost:8088");
    streamsConfiguration.put(StreamsConfig.RETRIES_CONFIG, 10);
    streamsConfiguration.put(StreamsConfig.RETRY_BACKOFF_MS_CONFIG, (10 * 1000L));
    streamsConfiguration.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, DefaultBugsnagExceptionHandler.getInstance().getClass());

  //  streamsConfiguration.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler);

    // Specify (de)serializers for record keys and for record values.
    streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);

    streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG, stateDir);
    streamsConfiguration.put(StreamsConfig.producerPrefix(ProducerConfig.ACKS_CONFIG), "all");
    streamsConfiguration.put(StreamsConfig.producerPrefix(ProducerConfig.LINGER_MS_CONFIG), "10000");

    streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
    // Records should be flushed every 10 seconds. This is less than the default
    // in order to keep this example interactive.
    ///Messages will be forwarded either when the cache is full or when the commit interval is reached
    streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 500);
    streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
    streamsConfiguration.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);

 StreamsConfig config = new StreamsConfig(streamsConfiguration);


    StreamsBuilder builder = new StreamsBuilder();

    String start_date = "2018-05-10";
    String end_date = "2018-05-16";
    //DateFormat format = new SimpleDateFormat("yyyy-MM-dd");
    //LocalDate dateTime;
 //   builder.stream("topicA").to("topicB");
    KStream<String, avroschems> source = builder.stream("topicA");
    source
           .filter((k, value) -> LocalDate.parse(value.getDay()).isAfter(LocalDate.parse(start_date))  && LocalDate.parse (value.getDay()).isBefore(LocalDate.parse(end_date)))
    .to("bugSnagIntegration_mobileCrashError_filtered");
    System.out.println("Starting Kafka Stream");
    return new KafkaStreams(builder.build(), config);

I am trying to copy messages to topicB that is within some date range .Not sure if that is causing the slowness?

How to achieve concurrency?

"Extremely slow" is not a very specific term. You should share some concrete throughput numbers.

About multi-threading: Increasing StreamsConfig.NUM_STREAM_THREADS_CONFIG is correct. However, this only helps if CPU is the bottleneck. If network is the bottleneck, you need to start multiple application instances on different machines (ie, deploy the exact some application multiple times); for this case, all instances will also forma consumer group and share the load. I would recommend to read the docs for more details: https://docs.confluent.io/current/streams/architecture.html#parallelism-model

Additionally, you are able to configure the internally used consumer and producer clients. This might also help to increase throughput. Cf. https://docs.confluent.io/current/streams/developer-guide/config-streams.html#kafka-consumers-producer-and-admin-client-configuration-parameters

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM