繁体   English   中英

将CSV转换为JSON以在Scala Spark中配对RDD

[英]Convert CSV to JSON to Pair RDD in Scala Spark

我有CSV数据。 我首先要将其转换为Json ,然后再将其转换为Pair RDD

我既可以做这两种事情,也不确定这样做是否有效,而且它们的密钥不是预期的格式。


    val df = //some how read the csv data
    val dataset = df.toJSON //This gives the expected json.
    val pairRDD = dataset.rdd.map(record => (JSON.parseFull(record).get.asInstanceOf[Map[String, String]].get("hashKey"), record))

假设我的模式是


    root
     |-- hashKey: string (nullable = true)
     |-- sortKey: string (nullable = true)
     |-- score: number (nullable = true)
     |-- payload: string (nullable = true)


    In json
    {
    "hashKey" : "h1",
    "sortKey" : "s1",
    "score" : 1.0,
    "payload" : "data"
    }
    {
    "hashKey" : "h2",
    "sortKey" : "s2",
    "score" : 1.0,
    "payload" : "data"
    }

    EXPECTED result should be
    [1, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
    [2, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]


    ACTUAL result I am getting
    [**Some(1)**, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
    [**Some(2)**, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]

我可以解决这个问题吗?

这是因为get("hashKey") 将其更改为getOrElse("hashKey","{defaultKey}") -当默认密钥可以是""或之前声明的常量时。

更新为更安全的Scala方法(而不是使用的instance of

最好将您的json解析更改为:

dataset.rdd.map(record => JSON.parseFull(record).map{
    case json: Map[String, String] => (json.getOrElse("hashKey",""), record)
    case _ => ("", "")
}.filter{ case (key, record) => key != "" && record != "") }

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM