[英]Spark Create DF from json string and string scala
I have a json string and a different string I'd like to create a dataframe of.我有一个 json 字符串和一个不同的字符串,我想创建一个 dataframe 的。
val body = """{
| "time": "2020-07-01T17:17:15.0495314Z",
| "ver": "4.0",
| "name": "samplename",
| "iKey": "o:something",
| "random": {
| "stuff": {
| "eventFlags": 258,
| "num5": "DHM",
| "num2": "something",
| "flags": 415236612,
| "num1": "4004825",
| "seq": 44
| },
| "banana": {
| "id": "someid",
| "ver": "someversion",
| "asId": 123
| },
| "something": {
| "example": "somethinghere"
| },
| "apple": {
| "time": "2020-07-01T17:17:37.874Z",
| "flag": "something",
| "userAgent": "someUserAgent",
| "auth": 12,
| "quality": 0
| },
| "loc": {
| "country": "US"
| }
| },
| "EventEnqueuedUtcTime": "2020-07-01T17:17:59.804Z"
|}
|""".stripMargin
val offset = "10"
I tried我试过了
val data = Seq(body, offset)
val columns = Seq("body","offset")
import sparkSession.sqlContext.implicits._
val df = data.toDF(columns:_*)
As well as也
val data = Seq(body, offset)
val rdd = sparkSession.sparkContext.parallelize((data))
val dfFromRdd = rdd.toDF("body", "offset")
dfFromRdd.show(20, false)
but for both I get this an error: "value toDF is not a member of org.apache.spark.RDD[String]"但对于这两个我都得到这个错误:“值 toDF 不是 org.apache.spark.RDD[String] 的成员”
Is there a different way I can create a dataframe that will have one column with my json body data, and another column with my offset string value?有没有不同的方法可以创建一个 dataframe ,其中一列包含我的 json 主体数据,另一列包含我的偏移字符串值?
Edit: I've also tried the following:编辑:我还尝试了以下方法:
val offset = "1000"
val data = Seq(body, offset)
val rdd = sparkSession.sparkContext.parallelize((data))
val dfFromRdd = rdd.toDF("body", "offset")
dfFromRdd.show(20, false)
and get an error of column mismatch: "The number of columns doesn't match. Old column names (1): value New column names (2): body, offset"并得到列不匹配的错误: “列数不匹配。旧列名(1):值新列名(2):正文,偏移量”
I dont understand why my data
has the column name of "value"我不明白为什么我的data
的列名是“值”
I guess the issue is with your Seq
syntax, elements should be tuples.我想问题出在你的Seq
语法上,元素应该是元组。 Below code has worked for me,下面的代码对我有用,
val data = Seq((body, offset)) // <--- Check this line
val columns = Seq("body","offset")
import sparkSession.sqlContext.implicits._
data.toDF(columns:_*).printSchema()
/*
/
/ root
/ |-- body: string (nullable = true)
/ |-- offset: string (nullable = true)
/
*/
data.toDF(columns:_*).show()
/*
/
/ +--------------------+------+
/ | body|offset|
/ +--------------------+------+
/ |{
/ "time": "2020...| 10|
/ +--------------------+------+
/
/*
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.