Spark Streaming DStream元素与RDD

Question

I'm using Spark's Streaming API, I just wanted to get a better understanding for how to best design the code. 我正在使用Spark的Streaming API，我只是想更好地了解如何最佳设计代码。

I'm currently using Kafka Consumer (in pyspark) from pyspark.streaming.kafka.createDirectStream 我目前正在使用pyspark.streaming.kafka.createDirectStream中的Kafka Consumer（在pyspark中）

According to http://spark.apache.org/docs/latest/streaming-programming-guide.html 根据http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. Spark Streaming提供了称为离散流或DStream的高级抽象，它表示连续的数据流。 DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. 可以根据来自Kafka，Flume和Kinesis等来源的输入数据流来创建DStream，也可以通过对其他DStream应用高级操作来创建DStream。 Internally, a DStream is represented as a sequence of RDDs. 在内部，DStream表示为RDD序列。

Essentially, I want to apply a set of functions to each of the elements in the DStream. 本质上，我想对DStream中的每个元素应用一组函数。 Currently, I'm using the "map" function for pyspark.streaming.DStream. 目前，我正在为pyspark.streaming.DStream使用“地图”功能。 According to documentation, my approach seems correct. 根据文档，我的方法似乎是正确的。 http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream

map(f, preservesPartitioning=False) Return a new DStream by applying a function to each element of DStream. map（f，servesPartitioning = False）通过将函数应用于DStream的每个元素来返回新的DStream。

Should I be using map, or would the right approach be to apply functions/transformations to the RDDs (Since DStream uses RDD)?? 我应该使用map还是正确的方法是将功能/转换应用于RDD（因为DStream使用RDD）？

foreachRDD(func) Apply a function to each RDD in this DStream. foreachRDD（func）对此DStream中的每个RDD应用一个函数。

More Docs: http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html 更多文档： http : //spark.apache.org/docs/latest/api/python/pyspark.streaming.html

Answer 1

DirectStream.map is a correct choice here. DirectStream.map是此处的正确选择。 Following map : 以下map ：

stream.map(f)

is equivalent to: 等效于：

stream.transform(lambda rdd: rdd.map(f))

DirectStream.foreachRDD from the other hand is an output action and creates an output DStream . 另一方面， DirectStream.foreachRDD是一个输出动作，并创建一个输出DStream 。 Function you use with foreachRDD is not expected to return anything, same as the method itself. 与方法本身相同，与foreachRDD使用的函数不应返回任何内容。 It is obvious when take a look at the Scala signature: 看一下Scala签名很明显：

def foreachRDD(foreachFunc: RDD[T] => Unit): Unit

Spark Streaming DStream元素与RDD

问题描述

1 个解决方案

解决方案1
1 2016-02-29 18:45:56

Spark Streaming DStream元素与RDD

问题描述

1 个解决方案

解决方案1 1 2016-02-29 18:45:56

解决方案1
1 2016-02-29 18:45:56