Spark 从 json 字符串和字符串 scala 创建 DF

Question

I have a json string and a different string I'd like to create a dataframe of.我有一个 json 字符串和一个不同的字符串，我想创建一个 dataframe 的。

val body = """{
             |  "time": "2020-07-01T17:17:15.0495314Z",
             |  "ver": "4.0",
             |  "name": "samplename",
             |  "iKey": "o:something",
             |  "random": {
             |    "stuff": {
             |      "eventFlags": 258,
             |      "num5": "DHM",
             |      "num2": "something",
             |      "flags": 415236612,
             |      "num1": "4004825",
             |      "seq": 44
             |    },
             |    "banana": {
             |      "id": "someid",
             |      "ver": "someversion",
             |      "asId": 123
             |    },
             |    "something": {
             |      "example": "somethinghere"
             |    },
             |    "apple": {
             |      "time": "2020-07-01T17:17:37.874Z",
             |      "flag": "something",
             |      "userAgent": "someUserAgent",
             |      "auth": 12,
             |      "quality": 0
             |    },
             |    "loc": {
             |      "country": "US"
             |    }
             |  },
             |  "EventEnqueuedUtcTime": "2020-07-01T17:17:59.804Z"
             |}
             |""".stripMargin

val offset = "10"

I tried我试过了

    val data = Seq(body, offset)

    val columns = Seq("body","offset")
    import sparkSession.sqlContext.implicits._
    val df = data.toDF(columns:_*)

As well as也

val data = Seq(body, offset)  
val rdd = sparkSession.sparkContext.parallelize((data))

val dfFromRdd = rdd.toDF("body", "offset")

dfFromRdd.show(20, false)

but for both I get this an error: "value toDF is not a member of org.apache.spark.RDD[String]"但对于这两个我都得到这个错误：“值 toDF 不是 org.apache.spark.RDD[String] 的成员”

Is there a different way I can create a dataframe that will have one column with my json body data, and another column with my offset string value?有没有不同的方法可以创建一个 dataframe ，其中一列包含我的 json 主体数据，另一列包含我的偏移字符串值？

Edit: I've also tried the following:编辑：我还尝试了以下方法：

    val offset = "1000"
    val data = Seq(body, offset)

   val rdd = sparkSession.sparkContext.parallelize((data))

    val dfFromRdd = rdd.toDF("body", "offset")

    dfFromRdd.show(20, false)

and get an error of column mismatch: "The number of columns doesn't match. Old column names (1): value New column names (2): body, offset"并得到列不匹配的错误： “列数不匹配。旧列名（1）：值新列名（2）：正文，偏移量”

I dont understand why my data has the column name of "value"我不明白为什么我的data的列名是“值”

Answer 1

I guess the issue is with your Seq syntax, elements should be tuples.我想问题出在你的Seq语法上，元素应该是元组。 Below code has worked for me,下面的代码对我有用，

val data = Seq((body, offset))  // <--- Check this line
val columns = Seq("body","offset")

import sparkSession.sqlContext.implicits._

data.toDF(columns:_*).printSchema()

/*
/
/ root
/  |-- body: string (nullable = true)
/  |-- offset: string (nullable = true)
/
*/

data.toDF(columns:_*).show()

/*
/
/ +--------------------+------+
/ |                body|offset|
/ +--------------------+------+
/ |{
/  "time": "2020...|    10|
/ +--------------------+------+
/
/*

Spark 从 json 字符串和字符串 scala 创建 DF

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-21 19:29:26

Spark 从 json 字符串和字符串 scala 创建 DF

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-21 19:29:26

解决方案1
1 已采纳 2020-07-21 19:29:26