简体   繁体   中英

Storm KafkaSpout failed tuples duplicated

I am using storm-kafka-1.1.1-plus and storm 1.1.1. And configured using BaseRichBolt, one KafkaSpout and two bolts bolt-A, Bolt-B the tuples are anchored in bolt-A once the bolt-B acknowledged it will be considered as a successfully processed tuple and it will be committed. But, the problem is for some reason some failed message got duplicated in KafkaSpout .

For Example

KafkaSpout emitted 1000 tuples while processing it for some reason nearly 20 tuples were got failed (at bolt-B ). those 20 tuples were replays continuously, at some point worker got killed and supervisor restarts the worker and again those 20 tuples were replays and this time it successfully processed but it processed multiple times( duplicated ).

在此处输入图片说明

But, I want those tuples must be processed only once (successfully). I have set the topology.enable.message.timeouts as false . And my another question is where does the Storm stores those failed Kafka offset details . I didn't find it on zookeeper it only has below detail.

{"topology":{"id":"test_Topology-12-1508938595","name":"test_Topology"},"offset":505,"partition":2,"broker":{"host":"127.0.0.1","port":9092},"topic":"test_topic_1"}

Disabling message timeouts can cause message loss, you may want to reconsider disabling it if you need all messages to be processed.

Storm provides an at-least-once processing guarantee when acking is enabled. You might want to look at whether you can make your bolts idempotent so replays don't cause you issues. Alternatively you can look at https://storm.apache.org/releases/1.1.1/Trident-tutorial.html , which offers exactly-once state updates.

Edit: You might need to rethink your problem. As far as I'm aware no stream processing system offers exactly-once processing in the sense it sounds like you want.

The exactly-once semantics offered by Trident are that Trident will help you make state updates idempotent, so it will "look like" the messages are only processed once, from the point of view of your data store. Processing is still at-least-once. See the section "Transactional spouts" (and probably the rest of the page) at https://storm.apache.org/releases/2.0.0-SNAPSHOT/Trident-state.html for the intuition about how this would work. The basic idea is to store information in the data store about which messages have already been written, that way if they are repeated the state update code can ignore them.

You might also want to read https://streaml.io/blog/exactly-once . I want to say that Flink implements something like the distributed snapshot algorithm described there, which is a different way to simulate exactly-once in an at-least-once system.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM