[英]Convert CSV to JSON to Pair RDD in Scala Spark
我有CSV数据。 我首先要将其转换为Json
,然后再将其转换为Pair RDD
。
我既可以做这两种事情,也不确定这样做是否有效,而且它们的密钥不是预期的格式。
val df = //some how read the csv data
val dataset = df.toJSON //This gives the expected json.
val pairRDD = dataset.rdd.map(record => (JSON.parseFull(record).get.asInstanceOf[Map[String, String]].get("hashKey"), record))
假设我的模式是
root
|-- hashKey: string (nullable = true)
|-- sortKey: string (nullable = true)
|-- score: number (nullable = true)
|-- payload: string (nullable = true)
In json
{
"hashKey" : "h1",
"sortKey" : "s1",
"score" : 1.0,
"payload" : "data"
}
{
"hashKey" : "h2",
"sortKey" : "s2",
"score" : 1.0,
"payload" : "data"
}
EXPECTED result should be
[1, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
[2, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]
ACTUAL result I am getting
[**Some(1)**, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
[**Some(2)**, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]
我可以解决这个问题吗?
这是因为get("hashKey")
。 将其更改为getOrElse("hashKey","{defaultKey}")
-当默认密钥可以是""
或之前声明的常量时。
更新为更安全的Scala方法(而不是使用的instance of
)
最好将您的json解析更改为:
dataset.rdd.map(record => JSON.parseFull(record).map{
case json: Map[String, String] => (json.getOrElse("hashKey",""), record)
case _ => ("", "")
}.filter{ case (key, record) => key != "" && record != "") }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.