简体繁体 English

Spark流中的已处理批处理与RDD

[英]Processed batch vs RDD in Spark Streaming

原文 2017-04-25 22:29:30 4 1 apache-spark/ spark-streaming/ rdd

I saw several answers(eg here ) in SO suggest that records in a batch will become a single RDD. 我在SO中看到了几个答案（例如here ），这表明批处理中的记录将成为单个RDD。 I doubt it because suppose a batchInterval is 1 minute, then a single RDD will contain all data from last minute? 我对此表示怀疑，因为假设batchInterval为1分钟，那么单个RDD将包含最后一分钟的所有数据？

NOTE: I'm not directly comparing batch to RDD but rather the batch processed by Spark internally. 注意：我不是直接将批处理与RDD进行比较，而是将Spark内部处理的批处理进行比较。

1 个解决方案

Let me quote Spark Streaming guide 让我引用Spark Streaming指南

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. 离散流或DStream是Spark Streaming提供的基本抽象。 It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. 它表示连续的数据流，可以是从源接收的输入数据流，也可以是通过转换输入流生成的已处理数据流。 Internally, a DStream is represented by a continuous series of RDDs, which is Spark's abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). 在内部，DStream由一系列连续的RDD表示，这是Spark对不可变的分布式数据集的抽象（有关更多详细信息，请参见Spark编程指南）。 Each RDD in a DStream contains data from a certain interval, as shown in the following figure. DStream中的每个RDD都包含来自特定间隔的数据，如下图所示。

在此处输入图片说明

As you can see - single batch = single RDD. 如您所见-单批=单RDD。 This is why adjusting batch interval depending on your the data flow is crucial for the stability of your application. 这就是为什么根据数据流调整批处理间隔对于确保应用程序稳定性至关重要的原因。