[英]Convert json to dataframe using Apache Spark with Scala
在很久沒有使用它並嘗試轉換這個 MongoDB 字符串之后,我正在學習 Apache Spark:
{
"_id": {
"$oid": "601de7179acebcfb50c8f347"
},
"timestamp": {
"$numberLong": "1612572439411"
},
"newsdata": {
"test1": ["n1", "n2"],
"test2": ["n3", "n4"]
}
}
使用:
package sparkanalysis
import org.apache.spark.sql.SparkSession
object WordCount {
def main(args: Array[String]): Unit = {
val mongoString = "{\"_id\":{\"$oid\":\"601de7179acebcfb50c8f347\"}," +
"\"timestamp\":{\"$numberLong\":\"1612572439411\"}," +
"\"newsdata\":{" +
"\"test1\" : [\"n1\",\"n2\"]" +
",\"test2\" : [\"n3\",\"n4\"]}}"
print(mongoString)
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.config("spark.master", "local[*]")
.getOrCreate()
val df = spark.read.json(mongoString)
println(df)
}
}
但我收到異常:
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: {"_id":%7B%22$oid%22:%22601de7179acebcfb50c8f347%22%7D,%22timestamp%22:%7B%22$numberLong%22:%221612572439411%22%7D,%22newsdata%22:%7B%22test1%22%20:%20%5B%22n1%22,%22n2%22%5D,%22test2%22%20:%20%5B%22n3%22,%22n4%22%5D%7D%7D
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:546)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238)
at scala.collection.immutable.List.flatMap(List.scala:352)
at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:391)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:325)
at sparkanalysis.WordCount$.main(WordCount.scala:24)
at sparkanalysis.WordCount.main(WordCount.scala)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: {"_id":%7B%22$oid%22:%22601de7179acebcfb50c8f347%22%7D,%22timestamp%22:%7B%22$numberLong%22:%221612572439411%22%7D,%22newsdata%22:%7B%22test1%22%20:%20%5B%22n1%22,%22n2%22%5D,%22test2%22%20:%20%5B%22n3%22,%22n4%22%5D%7D%7D
at java.base/java.net.URI.checkPath(URI.java:1940)
at java.base/java.net.URI.<init>(URI.java:757)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 15 more
Process finished with exit code 1
我已經使用https://jsonlint.com/驗證了 JSON 是有效的。 我是否需要指定一個案例 class 才能正確轉換為 dataframe ?
您不能將 JSON 字符串直接傳遞給.read.json
,它只接受 json 文件的路徑作為String
或整個數據Dataset<String>
。 您可以嘗試將其加載到 dataframe 如下:
val ds = Seq(mongoString).toDS
val df = spark.read.json(ds)
.
scala> df.show()
+--------------------+--------------------+---------------+
| _id| newsdata| timestamp|
+--------------------+--------------------+---------------+
|[601de7179acebcfb...|[[n1, n2], [n3, n4]]|[1612572439411]|
+--------------------+--------------------+---------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.