I'm using Spark's Streaming API, I just wanted to get a better understanding for how to best design the code.
I'm currently using Kafka Consumer (in pyspark) from pyspark.streaming.kafka.createDirectStream
According to http://spark.apache.org/docs/latest/streaming-programming-guide.html
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Essentially, I want to apply a set of functions to each of the elements in the DStream. Currently, I'm using the "map" function for pyspark.streaming.DStream. According to documentation, my approach seems correct. http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream
map(f, preservesPartitioning=False) Return a new DStream by applying a function to each element of DStream.
Should I be using map, or would the right approach be to apply functions/transformations to the RDDs (Since DStream uses RDD)??
foreachRDD(func) Apply a function to each RDD in this DStream.
More Docs: http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html
DirectStream.map
is a correct choice here. Following map
:
stream.map(f)
is equivalent to:
stream.transform(lambda rdd: rdd.map(f))
DirectStream.foreachRDD
from the other hand is an output action and creates an output DStream
. Function you use with foreachRDD
is not expected to return anything, same as the method itself. It is obvious when take a look at the Scala signature:
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.