简体   繁体   English

Storm如何处理Bolt中的nextTuple

[英]How does the Storm handle nextTuple in the Bolt

I am newbie to Storm and have created a program to read the incremented numbers for certain time. 我是Storm的新手,并创建了一个程序来读取增加的数字一段时间。 I have used a counter in Spout and in the " nextTuple() " method the counter is being emitted and incremented 我在Spout中使用了一个计数器,在“ nextTuple() ”方法中,计数器正在被发射并递增

_collector.emit(new Values(new Integer(currentNumber++))); 
/* how this method is being continuously called...*/

and in the execute() method of the Tuple class has 并且在Tuple类的execute()方法中有

public void execute(Tuple input) {
int number = input.getInteger(0);
logger.info("This number is (" + number + ")");
_outputCollector.ack(input);
}
/*this part I am clear as Bolt would receive the input from Spout*/

In my Main class execution I have the following code 在我的Main类执行中,我有以下代码

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("NumberSpout", new NumberSpout());
builder.setBolt("NumberBolt", new PrimeNumberBolt())
            .shuffleGrouping("NumberSpout");
Config config = new Config();
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("NumberTest", config, builder.createTopology());
Utils.sleep(10000);
localCluster.killTopology("NumberTest");
localCluster.shutdown();

The programs Perfectly works fine. 程序完美正常。 What currently I am looking here is how does the Storm framework internally calls the nextTuple() method continuously . 目前我在这里看到的是Storm框架如何在内部连续调用nextTuple()方法 I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework. 我确信我的理解在这里遗漏了一些东西,由于这个差距,我无法连接到这个框架的内部逻辑。

Can anyone of you guys help me in understanding this portion clearly then it would be a great help for me as I will have to implement this concept in my project. 你们中的任何人都可以帮助我清楚地理解这部分,然后这将对我有很大的帮助,因为我必须在我的项目中实现这个概念。 If I am conceptually clear here then I can make a significant progress. 如果我在概念上清楚,那么我可以取得重大进展。 Appreciate if anyone can quickly assist me over here. 感谢是否有人可以在这里快速协助我。 Awaiting responses... 等待回复......

how does the Storm framework internally calls the nextTuple() method continuously. Storm框架如何在内部连续调用nextTuple()方法。

I believe this actually involves a very detail discussion about the entire life cycle of a storm topology as well as a clear concepts of different entities like workers, executors, tasks etc. The actual processing of a topology is carried out by the StormSubmitter class with its submitTopology method. 我相信这实际上涉及一个关于风暴拓扑的整个生命周期的非常详细的讨论,以及不同实体的明确概念,如工作者,执行者,任务等。拓扑的实际处理由StormSubmitter类执行,其中submitTopology方法。

The very first thing it does is start uploading the jar using Nimbus's Thrift interface and then calls the submitTopology which eventually submit the topology to Nimbus. 它首先要做的是使用Nimbus的Thrift接口开始上传jar,然后调用submitTopology ,最终将拓扑提交给Nimbus。

The Nimbus then start by normalizing the topology ( from doc: The main purpose of normalization is to ensure that every single task will have the same serialization registrations, which is critical for getting serialization working correctly ) followed by serialization , zookeeper hand shaking , supervisor and worker process startup and so on. 然后,Nimbus首先对拓扑进行规范化( 来自doc:规范化的主要目的是确保每个任务都具有相同的序列化注册,这对于使序列化正常工作至关重要 ),然后是序列化zookeeper握手主管工人流程启动等。 Its too broad to discuss but If you really want to dig more you can go through the life cycle of storm topology where it explain nicely the step by step actions performs during the entire time. 它太宽泛而无法讨论但是如果你真的想要挖掘更多,你可以经历风暴拓扑生命周期 ,它可以很好地解释整个过程中一步一步的动作。
( quick note from the documentation ) 文档中的快速说明

First a couple of important notes about topologies: 首先是关于拓扑的几个重要说明:

The actual topology that runs is different than the topology the user specifies. 运行的实际拓扑与用户指定的拓扑不同。 The actual topology has implicit streams and an implicit "acker" bolt added to manage the acking framework (used to guarantee data processing). 实际拓扑具有隐式流,并添加了隐式“acker”螺栓来管理acking框架(用于保证数据处理)。

The implicit topology is created via the system-topology! 隐式拓扑是通过系统拓扑创建的! function. 功能。 system-topology! 系统拓扑! is used in two places: 用于两个地方:
- - when Nimbus is creating tasks for the topology code - - 当Nimbus为拓扑代码创建任务时
- - in the worker so it knows where it needs to route messages to code - - 在worker中,它知道将消息路由到代码所需的位置

Now here's few clue I could try to share ... 现在这里有一些线索我可以尝试分享......
Spouts or Bolts are actually the components which does the real processing (the logic). Spouts或Bolts实际上是执行实际处理(逻辑)的组件。 In storm terminology they executes as many tasks across the structure. 在风暴术语中,他们在整个结构中执行尽可能多的任务。
From the doc page : Each task corresponds to one thread of execution 从doc页面: 每个任务对应一个执行线程

Now, among many others, one typical responsibility of a worker process (read here ) in storm is to monitor weather a topology is active or not and stored that particular state in a variable named storm-active-atom . 现在,在许多其他人中,风暴中的worker process此处阅读)的一个典型职责是监视拓扑是否活动的拓扑,并将该特定状态存储在名为storm-active-atom的变量中。 This variable is used by the tasks to determine whether or not to call the nextTuple method.. So as long as your topology is live (you haven't put your spout code but assuming) till the time your timer is active (as you said for certain time) it will keep calling the nextTuple method. 任务使用此变量来确定是否调用nextTuple方法。所以只要你的拓扑是实时的(你没有把你的nextTuple代码但假设),直到你的计时器处于活动状态(如你所说)一定时间)它将继续调用nextTuple方法。 You can dig even further to understand the storm's Acking framework implementation to understand how it understand and acknowledge once a tuple is successfully processed and Guarantee-message-processing 您可以进一步深入了解风暴的Acking框架实现,以了解一旦元组成功处理后它如何理解和确认并保证消息处理

I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework 我确信我的理解在这里遗漏了一些东西,由于这个差距,我无法连接到这个框架的内部逻辑

Having said this I think its more important to get a clear understanding of how to work with storm rather than how to understand storm in the early stage. 话虽如此,我认为更清楚地了解如何使用风暴而不是如何在早期阶段了解风暴更为重要。 eg instead of learning the internal mechanism of storm its important to realize that if we set a spout to read a file line by line then it keep on emitting each lines using the _collector.emit method till it reaches EOF. 例如,不是学习风暴的内部机制,重要的是要认识到如果我们设置一个喷口逐行读取文件,那么它继续使用_collector.emit方法发出每一行,直到它达到EOF。 And the bolt connected to it receive the same in its execute(tuple input) method 并且连接到它的螺栓在其execute(tuple input)方法中接收相同的螺栓

Hope this help you share more with us in future 希望这有助于您将来与我们分享更多

Ordinary Spouts 普通的喷口

There is a loop in the storm's executor daemon that repeatedly calls nextTuple (as well as ack and fail when appropriate) on the corresponding spout instance. 有一个在风暴的一个循环executor守护进程反复调用nextTuple (以及ackfail上的相应适当的时候) spout实例。

There is no waiting for tuples being processed. 没有等待处理元组。 Spout simply receives fail for tuples that did not manage to be processed in given timeout. 对于在给定超时内无法处理的元组,Spout只会收到fail This can be easily simulated with a topology of a fast spout and a slow processing bolt: the spout will receive a lot of fail calls. 这可以通过快速喷口和缓慢处理螺栓的拓扑结构轻松模拟:喷口将接收大量fail调用。

See also the ISpout javadoc : 另请参见ISpout javadoc

nextTuple, ack, and fail are all called in a tight loop in a single thread in the spout task. nextTuple,ack和fail都是在spout任务中的单个线程中的紧密循环中调用的。 When there are no tuples to emit, it is courteous to have nextTuple sleep for a short amount of time (like a single millisecond) so as not to waste too much CPU. 当没有元组发出时,让nextTuple在很短的时间内睡觉是有礼貌的(比如一毫秒),这样就不会浪费太多的CPU。


Trident Spouts 三叉戟喷口

The situation is completely different for Trident-spouts : Trident-spouts的情况完全不同:

By default, Trident processes a single batch at a time, waiting for the batch to succeed or fail before trying another batch. 默认情况下,Trident一次处理一个批处理,等待批处理成功或失败,然后再尝试另一个批处理。 You can get significantly higher throughput – and lower latency of processing of each batch – by pipelining the batches . 通过对批次进行流水线操作,您可以获得更高的吞吐量 - 并降低每批处理的延迟。 You configure the maximum amount of batches to be processed simultaneously with the topology.max.spout.pending property. 您可以使用topology.max.spout.pending属性配置要同时处理的最topology.max.spout.pending

Even while processing multiple batches simultaneously, Trident will order any state updates taking place in the topology among batches. 即使在同时处理多个批次时,Trident 也会在批次之间订购拓扑中发生的任何状态更新

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM