Below I include text explaining the issue I face in Storm. Any way, I know it is a long post (just a heads up) and any comment/indication is more than welcome. There goes the description:
I have installed Storm 0.9.4 and ZooKeeper 3.4.6 on a single server (2 sockets with Intel Xeon 8-core chips, 96 GB ram running CentOS) and I have setup a pseudo-distributed, single node Storm runtime. My configuration consists of 1 zookeeper server, 1 nimbus process, 1 supervisor process, and 1 worker process (when topologies are submitted), all running on the same machine. The purpose of my experiment is to see Storm's behavior on a single node setting, when input load is dynamically distributed among executor threads.
For the purpose of my experiment I have input tuples that consist of 1 long and 1 integer value. The input data come from two spouts that read tuples from disk files and I control the input rates to follow the pattern: 200 tuples/second for the first 24 seconds (time 0 - 24 seconds) 800 tuples/second for the next 12 seconds (24 - 36 seconds) 200 tuples/sec for 6 more seconds (time 36 - 42 seconds) Turning to my topology, I have two types of bolts: a) a Dispatcher bolt that receives input from the two spouts, and (b) a Consumer bolt that performs an operation on the tuples and maintains some tuples as state. The parallelism hint for the Dispatcher is one (1 executor/thread), since I have examined that it never reaches even 10% of its capacity. For the Consumer bolt I have a parallelism hint of two (2 executors/threads for that bolt). The input rates I previously mentioned are picked so that I monitor end-to-end latency less than 10 msecs using the appropriate number of executors on the Consumer bolt. In detail, I have run the same topology with one Consumer executor and it can handle an input rate of 200 tuples/sec with end-to-end latency < 10 msec. Similarly, if I add one more Consumer executor (2 executors in total) the topology can consume 800 tuples/sec with < 10 msecs end-to-end latency. At this point, I have to say that if I use 1 consumer executor for 800 tuples/sec the end-to-end latency reaches up to 2 seconds. By the way, I have to mention that I measure end-to-end latency using the ack() function of my bolts and see how much time it takes between sending a tuple in the topology, until its tuple tree is fully acknowledged.
As you realize by now, the goal is to see if I can maintain end-to-end latency < 10 msec for the input spike, by simulating the addition of another Consumer executor.In order to simulate the addition of processing resources for the input spike, I use direct grouping and before the spike, I send tuples only to one of the two Consumer executors. When the spike is detected on the Dispatcher, it starts sending tuples to the other Consumer also, so that the input load is balanced between two threads. Hence, I expect that when I start sending tuples to the additional Consumer thread, the end-to-end latency will drop back to its acceptable value. However, the previous does not happen.
In order to verify my hypothesis that two Consumer executors are able to maintain < 10 msec latency during a spike, I execute the same experiment, but this time, I send tuples to both executors (threads) for the whole lifetime of the experiment. In this case, the end-to-end latency remains stable and in acceptable levels. So, I do not know what really happens in my simulation. I can not really figure out what causes the deterioration of the end-to-end latency in the case where input load is re-directed to the additional Consumer executor.
In order to figure out more about the mechanics of Storm, I did the same setup on a smaller machine and did some profiling. I saw that most of the time is spent in the BlockingWaitStrategy of the lmax disruptor and it dominates the CPU. My actual processing function (in the Consumer bolt) takes only a fraction of the lmax BlockingWaitStrategy. Hence, I think that it is an I/O issue between queues and not something that has to do with the processing of tuples in the Consumer.
Any idea about what goes wrong and I get this radical/baffling behavior?
Thank you.
First, thanks for the detailed and well formulated question! There are multiple comments from my side (not sure if this is already an answer...):
Hope this helps. If you have more questions and/or information I can refine my answer later on.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.