[英]Unable to fetch Json Column using sparkDataframe:org.apache.spark.sql.AnalysisException: cannot resolve 'explode;
[英]Reading Nested JSON via Spark SQL - [AnalysisException] cannot resolve Column
我有这样的JSON数据:
{
"parent":[
{
"prop1":1.0,
"prop2":"C",
"children":[
{
"child_prop1":[
"3026"
]
}
]
}
]
}
从Spark读取数据后,我得到以下架构:
val df = spark.read.json("test.json")
df.printSchema
root
|-- parent: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- children: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- child_prop1: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | |-- prop1: double (nullable = true)
| | |-- prop2: string (nullable = true)
现在,我想从df
选择child_prop1
。 但是当我尝试选择它时,我得到了org.apache.spark.sql.AnalysisException
。 像这样的东西:
df.select("parent.children.child_prop1")
org.apache.spark.sql.AnalysisException: cannot resolve '`parent`.`children`['child_prop1']' due to data type mismatch: argument 2 requires integral type, however, ''child_prop1'' is of string type.;;
'Project [parent#60.children[child_prop1] AS child_prop1#63]
+- Relation[parent#60] json
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
... 48 elided
虽然,当我从df
选择只有children
,它工作正常。
df.select("parent.children").show(false)
+------------------------------------+
|children |
+------------------------------------+
|[WrappedArray([WrappedArray(3026)])]|
+------------------------------------+
即使列存在于数据帧中,我也无法理解为什么它会给出异常。
任何帮助表示赞赏!
你的Json是一个有效的json,我认为你不需要改变你的输入数据。
使用explode获取数据
import org.apache.spark.sql.functions.explode
val data = spark.read.json("src/test/java/data.json")
val child = data.select(explode(data("parent.children"))).toDF("children")
child.select(explode(child("children.child_prop1"))).toDF("child_prop1").show()
如果您可以更改输入数据,则可以关注@ramesh建议
如果你看一下架构, child_prop1
就是在根数组parent
nested array
里面。 所以我们需要能够定义child_prop1
的position
,这就是错误建议你定义的内容。
转换你的json
格式应该可以解决问题。
将json
更改为
{"parent":{"prop1":1.0,"prop2":"C","children":{"child_prop1":["3026"]}}}
并应用
df.select("parent.children.child_prop1").show(false)
将输出作为
+-----------+
|child_prop1|
+-----------+
|[3026] |
+-----------+
和
将json
更改为
{"parent":{"prop1":1.0,"prop2":"C","children":[{"child_prop1":["3026"]}]}}
并应用
df.select("parent.children.child_prop1").show(false)
将导致
+--------------------+
|child_prop1 |
+--------------------+
|[WrappedArray(3026)]|
+--------------------+
我希望答案有所帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.