[英]How to covert a Dataframe to a Dataset,having a object reference of the parent class as a composition inside another class?
I am trying to convert a Dataframe
to a Dataset
, and the java classes structure is as follows:我正在尝试将Dataframe
转换为Dataset
,并且 java 类结构如下:
class A: class
public class A {
private int a;
public int getA() {
return a;
}
public void setA(int a) {
this.a = a;
}
}
class B: class B:
public class B extends A {
private int b;
public int getB() {
return b;
}
public void setB(int b) {
this.b = b;
}
}
and class C和 class C
public class C {
private A a;
public A getA() {
return a;
}
public void setA(A a) {
this.a = a;
}
}
and the data in the dataframe is as follows: dataframe中的数据如下:
+-----+
| a |
+-----+
|[1,2]|
+-----+
When I am trying to apply Encoders.bean[C](classOf[C]) to the dataframe.当我尝试将 Encoders.bean[C](classOf[C]) 应用于 dataframe 时。 The object reference A
which is a instance of B
in class C
is not returning true when I am checking for.isInstanceOf[B], I am getting it as false. object 参考A
是 class C
中B
的一个实例,当我检查.isInstanceOf[B] 时,它没有返回 true,我将其视为 false。 The output of Dataset is as follows: Dataset的output如下:
+-----+
| a |
+-----+
|[1,2]|
+-----+
How do we get all the fields of A and B under the C object while iterating over it in foreach?我们如何在 foreach 中迭代时获取 C object 下的 A 和 B 的所有字段?
Code:-代码:-
object TestApp extends App {
implicit val sparkSession = SparkSession.builder()
.appName("Test-App")
.config("spark.sql.codegen.wholeStage", value = false)
.master("local[1]")
.getOrCreate()
var schema = new StructType().
add("a", new ArrayType(new StructType().add("a", IntegerType, true).add("b", IntegerType, true), true))
var dd = sparkSession.read.schema(schema).json("Test.txt")
var ff = dd.as(Encoders.bean[C](classOf[C]))
ff.show(truncate = false)
ff.foreach(f => {
println(f.getA.get(0).isInstanceOf[A])//---true
println(f.getA.get(0).isInstanceOf[B])//---false
})
Content of File: {"a":[{"a":1,"b":2}]}
文件内容: {"a":[{"a":1,"b":2}]}
Spark-catalyst
uses google reflection to get schema out of java beans. Spark-catalyst
使用谷歌反射从 java bean 中获取模式。 Please take a look at the JavaTypeInference.scala#inferDataType .请看一下JavaTypeInference.scala#inferDataType 。 This class uses getters to collect the field name and the returnType of getters to compute the SparkType
.这个 class 使用 getter 来收集字段名称和 getter 的 returnType 来计算SparkType
。
Since class C
has getter named getA()
with returnType as A
and A
, in turn, has getter as getA()
with returnType as int
, Schema will be created as struct<a:struct<a:int>>
where struct<a:int>
is derived from the getA
of class A
.由于 class C
具有名为getA()
且 returnType 为A
和A
的吸气剂,反过来,吸气剂为getA()
且 returnType 为int
,架构将创建为struct<a:struct<a:int>>
其中struct<a:int>
派生自getA
A
的 getA 。
The solution to this problem that I can think of is -这个问题我能想到的解决办法是——
// Modify your class C to have Real class reference rather its super type
public class C {
private B a;
public B getA() {
return a;
}
public void setA(B a) {
this.a = a;
}
}
Output-输出-
root
|-- a: struct (nullable = true)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
+------+
|a |
+------+
|[1, 2]|
+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.