简体   繁体   English

火花流DStream映射与foreachRDD相比,转换效率更高

[英]spark streaming DStream map vs foreachRDD, which is more efficient for transformation

Just for transformation, map and foreachRDD can achieve the same goal, but which one is more efficient? 仅对于转换,map和foreachRDD可以实现相同的目标,但是哪一个效率更高? And why? 又为什么呢?

for example,for a DStream[Int]: 例如,对于DStream [Int]:

val newDs1=Ds.map(x=> x+1)
val newDs2=Ds.foreachRDD (rdd=>rdd.map(x=> x+1))

I know foreachRDD will operate on the RDD directly, but map seams to transform DStream to RDD first(not sure), thus foreachRDD seams more efficient than map. 我知道foreachRDD将直接在RDD上运行,但是先通过地图接缝将DStream转换为RDD(不确定),因此foreachRDD接缝比map更有效。 However, map is a Transformations Operation while foreachRDD is a Output Operations. 但是,map是转换操作,而foreachRDD是输出操作。 Thus, map should be more efficient than foreachRDD while doing transformation. 因此,映射在进行转换时应该比foreachRDD更有效。 Anybody knows which one is right and why? 谁知道哪个是对的,为什么? Thanks for any reply. 感谢您的答复。

Add one more comparison: 再添加一个比较:

val newDS3=Ds.transform (rdd=>rdd.map(x=> x+1))

which is more efficient for transformation? 哪种转换效率更高?

You could answer this question yourself if you checked the types. 如果检查类型,则可以自己回答此问题。 foreachRDD is Unit so what you have is: foreachRDDUnit因此您拥有的是:

 val newDs2: Unit = Ds.foreachRDD (rdd=>rdd.map(x=> x+1))

You not only don't have DStream[_] , but internal map is never executed (it is lazy). 您不仅没有DStream[_] ,而且永远不会执行内部map (这是惰性的)。

Following two: 以下两个:

Ds.map(x=> x+1)
Ds.transform (rdd=>rdd.map(x=> x+1))

are identical in terms of execution, so it doesn't make sense to use the latter one, which is unnecessarily verbose. 就执行而言,它们是相同的,因此使用后者(后者不必要冗长)没有意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM