简体   繁体   中英

Spark Streaming DStream element vs RDD

I'm using Spark's Streaming API, I just wanted to get a better understanding for how to best design the code.

I'm currently using Kafka Consumer (in pyspark) from pyspark.streaming.kafka.createDirectStream

According to http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

Essentially, I want to apply a set of functions to each of the elements in the DStream. Currently, I'm using the "map" function for pyspark.streaming.DStream. According to documentation, my approach seems correct. http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream

map(f, preservesPartitioning=False) Return a new DStream by applying a function to each element of DStream.

Should I be using map, or would the right approach be to apply functions/transformations to the RDDs (Since DStream uses RDD)??

foreachRDD(func) Apply a function to each RDD in this DStream.

More Docs: http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html

DirectStream.map is a correct choice here. Following map :

stream.map(f)

is equivalent to:

stream.transform(lambda rdd: rdd.map(f))

DirectStream.foreachRDD from the other hand is an output action and creates an output DStream . Function you use with foreachRDD is not expected to return anything, same as the method itself. It is obvious when take a look at the Scala signature:

def foreachRDD(foreachFunc: RDD[T] => Unit): Unit

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM