简体   繁体   English

将RDD Array [Any] = Array(List([String],ListBuffer([string]))转换为RDD(String,Seq [String])

[英]convert RDD Array[Any] = Array(List([String], ListBuffer([string])) to RDD(String, Seq[String])

I have a RDD with Any type, example: 我有Any类型的RDD,例如:

Array(List(Mathematical Sciences, ListBuffer(applications, asymptotic, largest, enable, stochastic)))

I want to convert it to RDD of type RDD[(String, Seq[String])] 我想将其转换为RDD类型的RDD[(String, Seq[String])]

I tried: 我试过了:

val rdd = sc.makeRDD(strList)
case class X(titleId: String, terms: List[String])

val df = rdd.map { case Array(s0, s1) => X(s0, s1) }.toDF()

I passed a long time to try without success 我花了很长时间尝试没有成功

You can use: 您可以使用:

val result: RDD[(String, Seq[String])] = 
  rdd.map { case List(s0: String, s1: ListBuffer[String]) =>  (s0, s1) }

But note that any record in the input RDD[Any] that doesn't match these types (that can't be checked in compile time) would throw a scala.MatchError . 但是请注意,输入RDD[Any]中与这些类型不匹配(在编译时无法检查)的任何记录都将引发scala.MatchError

As mentioned in the question, if you have 如问题中所述,如果您有

val strList = Array(List("Mathematical Sciences", ListBuffer("applications", "asymptotic", "largest", "enable", "stochastic")))
val rdd = sc.makeRDD(strList)

which is of following dataTypes 属于以下dataTypes

rdd: org.apache.spark.rdd.RDD[List[java.io.Serializable]]

You can convert it to your required dataTypes 您可以将其转换为所需的dataTypes

res0: org.apache.spark.rdd.RDD[(String, Seq[String])]

by simply using map and converting the dataTypes as 通过简单地使用map 并将dataTypes转换

rdd.map(x => (x(0).toString, x(1).asInstanceOf[ListBuffer[String]].toSeq))

I hope the answer is helpful 我希望答案是有帮助的

Finally , it s worked i have a warning but worked 最后,它奏效了,我有一个警告,但奏效了

val rdd = sc.makeRDD(strList) val rdd = sc.makeRDD(strList)

val result = rdd.map { case List(s0: String, s1: Seq[String]) => (s0, s1) } val result = rdd.map {case List(s0:String,s1:Seq [String])=>(s0,s1)}

:32: warning: non-variable type argument String in type pattern Seq[String] (the underlying of Seq[String]) is unchecked since it is eliminated by erasure val result = rdd.map { case List(s0: String, s1: Seq[String]) => (s0, s1) } ^ result: org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1051] at map at :32 :32:警告:类型模式Seq [String](Seq [String]的基础)中的非变量类型参数String未选中,因为它已通过擦除val结果= rdd.map {case List(s0:String,s1 :Seq [String])=>(s0,s1)} ^结果:org.apache.spark.rdd.RDD [(String,Seq [String])] = MapPartitionsRDD [1051]在地图上的:32

thank you 谢谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM