简体繁体 English

元组的动态重定向的Apache Storm问题（对端到端延迟的影响令人困惑）

[英]Apache Storm issue with Dynamic redirection of tuples (baffling impact on end-to-end latency)

原文 2015-11-26 03:16:39 4 1 java/ apache-storm

Below I include text explaining the issue I face in Storm. 在下面，我包含解释我在Storm中面临的问题的文本。 Any way, I know it is a long post (just a heads up) and any comment/indication is more than welcome. 无论如何，我知道这是一篇很长的文章（请多加注意），任何评论/指示都值得欢迎。 There goes the description: 描述如下：

I have installed Storm 0.9.4 and ZooKeeper 3.4.6 on a single server (2 sockets with Intel Xeon 8-core chips, 96 GB ram running CentOS) and I have setup a pseudo-distributed, single node Storm runtime. 我已经在一台服务器上安装了Storm 0.9.4和ZooKeeper 3.4.6（2个插槽，装有Intel Xeon 8核芯片，运行CentOS的96 GB内存），并且已经设置了一个伪分布式的单节点Storm运行时。 My configuration consists of 1 zookeeper server, 1 nimbus process, 1 supervisor process, and 1 worker process (when topologies are submitted), all running on the same machine. 我的配置包括1个Zookeeper服务器，1个nimbus进程，1个主管进程和1个工作进程（提交拓扑时），它们都在同一台计算机上运行。 The purpose of my experiment is to see Storm's behavior on a single node setting, when input load is dynamically distributed among executor threads. 我的实验的目的是查看输入负载在执行程序线程之间动态分配时，Storm在单节点设置上的行为。

For the purpose of my experiment I have input tuples that consist of 1 long and 1 integer value. 出于实验目的，我输入了由1个long和1个整数组成的元组。 The input data come from two spouts that read tuples from disk files and I control the input rates to follow the pattern: 200 tuples/second for the first 24 seconds (time 0 - 24 seconds) 800 tuples/second for the next 12 seconds (24 - 36 seconds) 200 tuples/sec for 6 more seconds (time 36 - 42 seconds) Turning to my topology, I have two types of bolts: a) a Dispatcher bolt that receives input from the two spouts, and (b) a Consumer bolt that performs an operation on the tuples and maintains some tuples as state. 输入数据来自两个从磁盘文件读取元组的喷嘴，我控制输入速率以遵循该模式：前24秒（时间0-24秒）为200元/秒，接下来的12秒为800元/秒（ 24-36秒）200个元组/秒，持续6秒（时间36-42秒）。回到我的拓扑结构，我有两种类型的螺栓：a）一个分派器螺栓，它接收来自两个喷嘴的输入，以及（b）一个消费者螺栓，它在元组上执行操作并保持一些元组为状态。 The parallelism hint for the Dispatcher is one (1 executor/thread), since I have examined that it never reaches even 10% of its capacity. Dispatcher的并行性提示是一个（1个执行程序/线程），因为我已经检查过它永远不会达到其容量的10％。 For the Consumer bolt I have a parallelism hint of two (2 executors/threads for that bolt). 对于Consumer螺栓，我有两个并行提示（该螺栓有2个执行程序/线程）。 The input rates I previously mentioned are picked so that I monitor end-to-end latency less than 10 msecs using the appropriate number of executors on the Consumer bolt. 选择了我前面提到的输入速率，以便我可以使用Consumer螺栓上的适当数量的执行器来监视不到10毫秒的端到端延迟。 In detail, I have run the same topology with one Consumer executor and it can handle an input rate of 200 tuples/sec with end-to-end latency < 10 msec. 详细地说，我使用一个Consumer执行程序运行了相同的拓扑，它可以处理200个元组/秒的输入速率，端到端延迟<10毫秒。 Similarly, if I add one more Consumer executor (2 executors in total) the topology can consume 800 tuples/sec with < 10 msecs end-to-end latency. 同样，如果我再添加一个消费者执行器（总共2个执行器），则拓扑可以消耗800个元组/秒，端到端延迟小于10毫秒。 At this point, I have to say that if I use 1 consumer executor for 800 tuples/sec the end-to-end latency reaches up to 2 seconds. 在这一点上，我不得不说，如果我使用1个消费者执行器以800个元组/秒的速度运行，则端到端延迟将达到2秒。 By the way, I have to mention that I measure end-to-end latency using the ack() function of my bolts and see how much time it takes between sending a tuple in the topology, until its tuple tree is fully acknowledged. 顺便说一句，我不得不提到，我使用螺栓的ack（）函数来测量端到端的延迟，并查看从发送拓扑中的元组到完全确认其元组树之间需要花费多少时间。

As you realize by now, the goal is to see if I can maintain end-to-end latency < 10 msec for the input spike, by simulating the addition of another Consumer executor.In order to simulate the addition of processing resources for the input spike, I use direct grouping and before the spike, I send tuples only to one of the two Consumer executors. 到现在为止，您的目标是通过模拟另一个Consumer执行器的添加来查看我是否可以将输入尖峰的端到端延迟保持在<10毫秒以内，以便模拟为输入添加处理资源峰值，我使用直接分组，在峰值之前，我仅将元组发送给两个消费者执行器之一。 When the spike is detected on the Dispatcher, it starts sending tuples to the other Consumer also, so that the input load is balanced between two threads. 当在Dispatcher上检测到峰值时，它也开始将元组发送到另一个Consumer，以便在两个线程之间平衡输入负载。 Hence, I expect that when I start sending tuples to the additional Consumer thread, the end-to-end latency will drop back to its acceptable value. 因此，我希望当我开始将元组发送到附加的Consumer线程时，端到端的延迟将降至其可接受的值。 However, the previous does not happen. 但是，以前没有发生。

In order to verify my hypothesis that two Consumer executors are able to maintain < 10 msec latency during a spike, I execute the same experiment, but this time, I send tuples to both executors (threads) for the whole lifetime of the experiment. 为了验证我的假设，即两个Consumer执行器能够在峰值期间保持小于10毫秒的延迟，我执行了相同的实验，但是这次，我在整个实验过程中将元组发送给两个执行器（线程）。 In this case, the end-to-end latency remains stable and in acceptable levels. 在这种情况下，端到端延迟保持稳定并处于可接受的水平。 So, I do not know what really happens in my simulation. 因此，我不知道模拟中真正发生了什么。 I can not really figure out what causes the deterioration of the end-to-end latency in the case where input load is re-directed to the additional Consumer executor. 在将输入负载重定向到附加的Consumer执行程序的情况下，我无法真正找出导致端到端延迟恶化的原因。

In order to figure out more about the mechanics of Storm, I did the same setup on a smaller machine and did some profiling. 为了进一步了解Storm的机制，我在较小的计算机上进行了相同的设置并进行了性能分析。 I saw that most of the time is spent in the BlockingWaitStrategy of the lmax disruptor and it dominates the CPU. 我看到大部分时间都花在了lmax干扰器的BlockingWaitStrategy中，它支配着CPU。 My actual processing function (in the Consumer bolt) takes only a fraction of the lmax BlockingWaitStrategy. 我的实际处理功能（在Consumer螺栓中）仅占lmax BlockingWaitStrategy的一小部分。 Hence, I think that it is an I/O issue between queues and not something that has to do with the processing of tuples in the Consumer. 因此，我认为这是队列之间的I / O问题，而不是与Consumer中元组的处理有关。

Any idea about what goes wrong and I get this radical/baffling behavior? 关于出什么问题的任何想法，我都会得到这种激进/令人困惑的行为？

Thank you. 谢谢。

1 个解决方案

First, thanks for the detailed and well formulated question! 首先，感谢您提出的详细而有条理的问题！ There are multiple comments from my side (not sure if this is already an answer...): 我这边有很多评论（不确定是否已经是答案...）：

your experiment is rather short (time ranges below 1 minute) which I think might not reveal reliable numbers. 您的实验相当短（时间范围小于1分钟），我认为这可能无法显示可靠的数字。
How do you detect the spike? 您如何检测峰值？
Are you awe of the internal buffer mechanisms in Storm (have a look here: http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/ ) 您是否对Storm中的内部缓冲机制感到敬畏（请在此处查看： http : //www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/ ）
How many ackers did you configure? 您配置了多少个acker？
I assume that during your spike period, before you detect the spike, the buffers are filled up and it takes some time to empty them. 我假设在您的峰值期间，在检测到峰值之前，缓冲区已被填满，清空它们需要一些时间。 Thus the latency does not drop immediately (maybe extending you last period resolve this). 因此，延迟不会立即下降（也许延长您的上一周期即可解决）。
Using the ack mechanism is done by many people, however, it is rather imprecise. 许多人都使用ack机制，但这是不精确的。 First, it shows an average value (a mean, quantile, or max would be much better to use. Furthermore, the measure value is not what should be considered the latency after all. For example, if you hold a tuple in an internal state for some time and do not ack it until the tuple is removed from the state, Storm's "latency" value would increase what does not make sense for a latency measurement. The usual definition of latency would be to take the output timestamp of an result tuple and subtract the emit timestamp the source tuple (if there a multiple source tuples, you use the youngest---ie, maximum---timestamp over all source tuples). The tricky part is to figure out the corresponding source tuples for each output tuple... As an alternative, some people inject dummy tuples that carry their emit timestamp as data. This dummy tuples are forwarded by each operator immediately and the sink operator can easily compete a latency value as it has access to the emit timestamp that is carried arou 首先，它显示一个平均值 （使用平均值，分位数或最大值会更好。此外，量度值根本不应该视为等待时间。例如，如果您将一个元组保持在内部状态在一段时间内，直到元组从状态中删除之前，不要确认它，Storm的“等待时间”值会增加对延迟测量没有意义的值。延迟的通常定义是获取结果元组的输出时间戳并减去源元组的发射时间戳（如果存在多个源元组，则在所有源元组上使用最小的时间戳，即最大时间戳）。棘手的部分是找出每个输出对应的源元组元组...作为替代方案，有些人将带有其发射时间戳记的伪元组作为数据注入，每个操作员都会立即转发此伪元组，而接收器操作员可以轻松访问延迟值，因为它可以访问发射时间戳记。引起 nd. nd。 This is a quite good approximation of the actual latency as described before. 如前所述，这是实际延迟的一个很好的近似值。