简体   繁体   English

如何将 Dataframe 转换为数据集,具有父 class 的 object 引用作为另一个 ZA2F2ED4ADC98EBC2CBBDZC21 内部的组合?

[英]How to covert a Dataframe to a Dataset,having a object reference of the parent class as a composition inside another class?

I am trying to convert a Dataframe to a Dataset , and the java classes structure is as follows:我正在尝试将Dataframe转换为Dataset ,并且 java 类结构如下:

class A: class

public class A {

    private int a;

    public int getA() {
        return a;
    }

    public void setA(int a) {
        this.a = a;
    }
}

class B: class B:

public class B extends A {

    private int b;

    public int getB() {
        return b;
    }

    public void setB(int b) {
        this.b = b;
    }
}

and class C和 class C

public class C {

    private A a;

    public A getA() {
        return a;
    }

    public void setA(A a) {
        this.a = a;
    }
}

and the data in the dataframe is as follows: dataframe中的数据如下:

+-----+
|  a  |
+-----+
|[1,2]|
+-----+

When I am trying to apply Encoders.bean[C](classOf[C]) to the dataframe.当我尝试将 Encoders.bean[C](classOf[C]) 应用于 dataframe 时。 The object reference A which is a instance of B in class C is not returning true when I am checking for.isInstanceOf[B], I am getting it as false. object 参考A是 class CB的一个实例,当我检查.isInstanceOf[B] 时,它没有返回 true,我将其视为 false。 The output of Dataset is as follows: Dataset的output如下:

+-----+
|  a  |
+-----+
|[1,2]|
+-----+

How do we get all the fields of A and B under the C object while iterating over it in foreach?我们如何在 foreach 中迭代时获取 C object 下的 A 和 B 的所有字段?

Code:-代码:-

object TestApp extends App {

  implicit val sparkSession = SparkSession.builder()
    .appName("Test-App")
    .config("spark.sql.codegen.wholeStage", value = false)
    .master("local[1]")
    .getOrCreate()


  var schema = new StructType().
    add("a", new ArrayType(new StructType().add("a", IntegerType, true).add("b", IntegerType, true), true))


  var dd = sparkSession.read.schema(schema).json("Test.txt")

  var ff = dd.as(Encoders.bean[C](classOf[C]))
  ff.show(truncate = false)



  ff.foreach(f => {
    println(f.getA.get(0).isInstanceOf[A])//---true
    println(f.getA.get(0).isInstanceOf[B])//---false
  })

Content of File: {"a":[{"a":1,"b":2}]}文件内容: {"a":[{"a":1,"b":2}]}

Spark-catalyst uses google reflection to get schema out of java beans. Spark-catalyst使用谷歌反射从 java bean 中获取模式。 Please take a look at the JavaTypeInference.scala#inferDataType .请看一下JavaTypeInference.scala#inferDataType This class uses getters to collect the field name and the returnType of getters to compute the SparkType .这个 class 使用 getter 来收集字段名称和 getter 的 returnType 来计算SparkType

Since class C has getter named getA() with returnType as A and A , in turn, has getter as getA() with returnType as int , Schema will be created as struct<a:struct<a:int>> where struct<a:int> is derived from the getA of class A .由于 class C具有名为getA()且 returnType 为AA的吸气剂,反过来,吸气剂为getA()且 returnType 为int ,架构将创建为struct<a:struct<a:int>>其中struct<a:int>派生自getA A的 getA 。

The solution to this problem that I can think of is -这个问题我能想到的解决办法是——

// Modify your class C to have Real class reference rather its super type
public class C {

    private B a;

    public B getA() {
        return a;
    }

    public void setA(B a) {
        this.a = a;
    }
}

Output-输出-

root
 |-- a: struct (nullable = true)
 |    |-- a: integer (nullable = false)
 |    |-- b: integer (nullable = false)

+------+
|a     |
+------+
|[1, 2]|
+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Dataframe 到 Java 类的数据集 - Spark Dataframe to Dataset of Java class 如何在不使用案例类但使用 StructType 的情况下创建数据集(不是数据帧)? - How to create Dataset (not DataFrame) without using case class but using StructType? 如何使用案例类将简单的DataFrame转换为DataSet Spark Scala? - How to convert a simple DataFrame to a DataSet Spark Scala with case class? 如何将数据帧数据类型转换为String? - How to covert dataframe datatypes to String? 如何在另一个数据帧上的UDF中引用数据帧? - How to reference a dataframe when in an UDF on another dataframe? 如何在案例类中使用Scala将多个具有对象的RDD连接起来 - How to join multiple RDD having Object using scala with case class 将Python类对象转换为DataFrame - Converting Python Class Object To A DataFrame Apache Spark 在 Java 中具有参数化/通用类的数据集 - Apache Spark having Dataset of a parameterised/generic class in Java 仅当在 main 方法之外定义 case 类以创建 Dataset[case class] 或 Dataframe[case class] 时才工作 - Working only when case class defined outside main method to create Dataset[case class] or Dataframe[case class] Java Spark 数据集 MapFunction - 没有任何参考 class 的任务不可序列化 - Java Spark Dataset MapFunction - Task not serializable without any reference to class
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM