[英]how to convert RDD[(String, Any)] to Array(Row)?
I've got a unstructured RDD with keys and values. 我有一个带有键和值的非结构化RDD。 The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps.
值是RDD [Any],键当前是Strings,RDD [String],主要包含Maps。 I would like to make them of type Row so I can make a dataframe eventually.
我想将它们设置为Row类型,以便最终制作一个数据框。 Here is my rdd :
这是我的rdd:
removed 已移除
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? 除了最后4个键外,大多数rdd都遵循一种模式,该如何处理? Perhaps split them into their own rdd, especially for reverseDeltas ?
也许将它们拆分为自己的rdd,尤其是对于reverseDeltas?
Thanks 谢谢
Edit 编辑
This is what I've tired so far based on the first answer below. 到目前为止,根据下面的第一个答案,这就是我很累的地方。
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map
in scala ? 如何看不到匹配任何这些情况,如何在Scala中的
Map
进行匹配? I keep getting null
s back when printing out parsedRdd
打印出
parsedRdd
时,我不断返回null
To convert the RDD to a dataframe you need to have fixed schema. 要将RDD转换为数据框,您需要具有固定的架构。 If you define the schema for the RDD rest is simple.
如果为RDD定义架构,其余的操作很简单。
something like 就像是
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate 备用
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe use it like below 有一种将rdd转换为数据帧的方法,如下所示
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below 不,你有数据框使用下面的SQL查询做你想做的事
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.