简体   繁体   English

Spark RDD元组转换

[英]Spark RDD tuple transformation

I'm trying to transform an RDD of tuple of Strings of this format : 我正在尝试转换这种格式的字符串元组的RDD:

(("abc","xyz","123","2016-02-26T18:31:56"),"15") TO (("abc","xyz","123","2016-02-26T18:31:56"),"15")

(("abc","xyz","123"),"2016-02-26T18:31:56","15")

Basically seperating out the timestamp string as a seperate tuple element. 基本上将时间戳字符串分隔为单独的元组元素。 I tried following but it's still not clean and correct. 我尝试了以下操作,但仍然不够干净和正确。

val result = rdd.map(r => (r._1.toString.split(",").toVector.dropRight(1).toString, r._1.toString.split(",").toList.last.toString, r._2))

However, it results in 但是,它导致

(Vector(("abc", "xyz", "123"),"2016-02-26T18:31:56"),"15")

The expected output I'm looking for is 我正在寻找的预期输出是

(("abc", "xyz", "123"),"2016-02-26T18:31:56","15")

This way I can access the elements using r._1 , r._2 (the timestamp string) and r._3 in a seperate map operation. 这样,我可以在单独的映射操作中使用r._1r._2 (时间戳字符串)和r._3访问元素。

Any hints/pointers will be greatly appreciated. 任何提示/指针将不胜感激。

Vector.toString will include the String 'Vector' in its result. Vector.toString中将包含字符串“ Vector”。 Instead, use Vector.mkString(",") . 而是使用Vector.mkString(",")

Example: 例:

scala> val xs = Vector(1,2,3)
xs: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)

scala> xs.toString
res25: String = Vector(1, 2, 3)

scala> xs.mkString
res26: String = 123

scala> xs.mkString(",")
res27: String = 1,2,3

However, if you want to be able to access (abc,xyz,123) as a Tuple and not as a string, you could also do the following: 但是,如果您希望能够以元组而不是字符串的形式访问(abc,xyz,123) ,则还可以执行以下操作:

val res = rdd.map{
  case ((a:String,b:String,c:String,ts:String),d:String) => ((a,b,c),ts,d)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM