简体   繁体   English

Spark Streaming DStream元素与RDD

[英]Spark Streaming DStream element vs RDD

I'm using Spark's Streaming API, I just wanted to get a better understanding for how to best design the code. 我正在使用Spark的Streaming API,我只是想更好地了解如何最佳设计代码。

I'm currently using Kafka Consumer (in pyspark) from pyspark.streaming.kafka.createDirectStream 我目前正在使用pyspark.streaming.kafka.createDirectStream中的Kafka Consumer(在pyspark中)

According to http://spark.apache.org/docs/latest/streaming-programming-guide.html 根据http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. Spark Streaming提供了称为离散流或DStream的高级抽象,它表示连续的数据流。 DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. 可以根据来自Kafka,Flume和Kinesis等来源的输入数据流来创建DStream,也可以通过对其他DStream应用高级操作来创建DStream。 Internally, a DStream is represented as a sequence of RDDs. 在内部,DStream表示为RDD序列。

Essentially, I want to apply a set of functions to each of the elements in the DStream. 本质上,我想对DStream中的每个元素应用一组函数。 Currently, I'm using the "map" function for pyspark.streaming.DStream. 目前,我正在为pyspark.streaming.DStream使用“地图”功能。 According to documentation, my approach seems correct. 根据文档,我的方法似乎是正确的。 http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream

map(f, preservesPartitioning=False) Return a new DStream by applying a function to each element of DStream. map(f,servesPartitioning = False)通过将函数应用于DStream的每个元素来返回新的DStream。

Should I be using map, or would the right approach be to apply functions/transformations to the RDDs (Since DStream uses RDD)?? 我应该使用map还是正确的方法是将功能/转换应用于RDD(因为DStream使用RDD)?

foreachRDD(func) Apply a function to each RDD in this DStream. foreachRDD(func)对此DStream中的每个RDD应用一个函数。

More Docs: http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html 更多文档: http : //spark.apache.org/docs/latest/api/python/pyspark.streaming.html

DirectStream.map is a correct choice here. DirectStream.map是此处的正确选择。 Following map : 以下map

stream.map(f)

is equivalent to: 等效于:

stream.transform(lambda rdd: rdd.map(f))

DirectStream.foreachRDD from the other hand is an output action and creates an output DStream . 另一方面, DirectStream.foreachRDD是一个输出动作,并创建一个输出DStream Function you use with foreachRDD is not expected to return anything, same as the method itself. 与方法本身相同,与foreachRDD使用的函数不应返回任何内容。 It is obvious when take a look at the Scala signature: 看一下Scala签名很明显:

def foreachRDD(foreachFunc: RDD[T] => Unit): Unit

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM