如何将 Dataframe 转换为数据集，具有父 class 的 object 引用作为另一个 ZA2F2ED4ADC98EBC2CBBDZC21 内部的组合？

Question

I am trying to convert a Dataframe to a Dataset , and the java classes structure is as follows:我正在尝试将Dataframe转换为Dataset ，并且 java 类结构如下：

class A: class

public class A {

    private int a;

    public int getA() {
        return a;
    }

    public void setA(int a) {
        this.a = a;
    }
}

class B: class B：

public class B extends A {

    private int b;

    public int getB() {
        return b;
    }

    public void setB(int b) {
        this.b = b;
    }
}

and class C和 class C

public class C {

    private A a;

    public A getA() {
        return a;
    }

    public void setA(A a) {
        this.a = a;
    }
}

and the data in the dataframe is as follows: dataframe中的数据如下：

+-----+
|  a  |
+-----+
|[1,2]|
+-----+

When I am trying to apply Encoders.bean[C](classOf[C]) to the dataframe.当我尝试将 Encoders.bean[C](classOf[C]) 应用于 dataframe 时。 The object reference A which is a instance of B in class C is not returning true when I am checking for.isInstanceOf[B], I am getting it as false. object 参考A是 class C中B的一个实例，当我检查.isInstanceOf[B] 时，它没有返回 true，我将其视为 false。 The output of Dataset is as follows: Dataset的output如下：

+-----+
|  a  |
+-----+
|[1,2]|
+-----+

How do we get all the fields of A and B under the C object while iterating over it in foreach?我们如何在 foreach 中迭代时获取 C object 下的 A 和 B 的所有字段？

Code:-代码：-

object TestApp extends App {

  implicit val sparkSession = SparkSession.builder()
    .appName("Test-App")
    .config("spark.sql.codegen.wholeStage", value = false)
    .master("local[1]")
    .getOrCreate()


  var schema = new StructType().
    add("a", new ArrayType(new StructType().add("a", IntegerType, true).add("b", IntegerType, true), true))


  var dd = sparkSession.read.schema(schema).json("Test.txt")

  var ff = dd.as(Encoders.bean[C](classOf[C]))
  ff.show(truncate = false)



  ff.foreach(f => {
    println(f.getA.get(0).isInstanceOf[A])//---true
    println(f.getA.get(0).isInstanceOf[B])//---false
  })

Content of File: {"a":[{"a":1,"b":2}]}文件内容： {"a":[{"a":1,"b":2}]}

Answer 1

Spark-catalyst uses google reflection to get schema out of java beans. Spark-catalyst使用谷歌反射从 java bean 中获取模式。 Please take a look at the JavaTypeInference.scala#inferDataType .请看一下JavaTypeInference.scala#inferDataType 。 This class uses getters to collect the field name and the returnType of getters to compute the SparkType .这个 class 使用 getter 来收集字段名称和 getter 的 returnType 来计算SparkType 。

Since class C has getter named getA() with returnType as A and A , in turn, has getter as getA() with returnType as int , Schema will be created as struct<a:struct<a:int>> where struct<a:int> is derived from the getA of class A .由于 class C具有名为getA()且 returnType 为A和A的吸气剂，反过来，吸气剂为getA()且 returnType 为int ，架构将创建为struct<a:struct<a:int>>其中struct<a:int>派生自getA A的 getA 。

The solution to this problem that I can think of is -这个问题我能想到的解决办法是——

// Modify your class C to have Real class reference rather its super type
public class C {

    private B a;

    public B getA() {
        return a;
    }

    public void setA(B a) {
        this.a = a;
    }
}

Output-输出-

root
 |-- a: struct (nullable = true)
 |    |-- a: integer (nullable = false)
 |    |-- b: integer (nullable = false)

+------+
|a     |
+------+
|[1, 2]|
+------+

如何将 Dataframe 转换为数据集，具有父 class 的 object 引用作为另一个 ZA2F2ED4ADC98EBC2CBBDZC21 内部的组合？

问题描述

1 个解决方案

解决方案1
0 2020-05-30 08:27:51

如何将 Dataframe 转换为数据集，具有父 class 的 object 引用作为另一个 ZA2F2ED4ADC98EBC2CBBDZC21 内部的组合？

问题描述

1 个解决方案

解决方案1 0 2020-05-30 08:27:51

解决方案1
0 2020-05-30 08:27:51