简体   繁体   English

如何在Spark中使用Scala将RDD映射到另一个RDD?

[英]How to map a RDD to another RDD with Scala, in Spark?

I have a RDD: 我有一个RDD:

RDD1 = (big,data), (apache,spark), (scala,language) ... 

and I need to map that with the time stamp 我需要将其与时间戳相对应

RDD2 = ('2015-01-01 13.00.00')

so that I get 这样我得到

RDD3 = (big, data, 2015-01-01 13.00.00), (apache, spark, 2015-01-01 13.00.00), (scala, language, 2015-01-01 13.00.00)

I wrote a simple map function for this: 我为此编写了一个简单的map函数:

RDD3 = RDD1.map(rdd => (rdd, RDD2))

but it is not working, and I think it is not the way to go. 但这是行不通的,我认为这不是要走的路。 How to do it? 怎么做? I am new to Scala and Spark. 我是Scala和Spark的新手。 Thank you. 谢谢。

You can use zip : 您可以使用zip

val rdd1 = sc.parallelize(("big","data") :: ("apache","spark") :: ("scala","language") :: Nil)
// RDD[(String, String)]
val rdd2 = sc.parallelize(List.fill(3)(new java.util.Date().toString))
// RDD[String]

rdd1.zip(rdd2).map{ case ((a,b),c) => (a,b,c) }.collect()
// Array((big,data,Fri Jul 24 22:25:01 CEST 2015), (apache,spark,Fri Jul 24 22:25:01 CEST 2015), (scala,language,Fri Jul 24 22:25:01 CEST 2015))

If you want the same time stamp with every element of rdd1 : 如果要对rdd1每个元素使用相同的时间戳记:

val now = new java.util.Date().toString
rdd1.map{ case (a,b) => (a,b,now) }.collect()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM