[英]How to convert my java complex object into spark dataframe
我正在使用 java spark,下面是我的代码
JavaRDD<MyComplexEntity> myObjectJavaRDD = resultJavaRDD.flatMap(result -> result.getMyObjects());
DataFrame df = sqlContext.createDataFrame(myObjectJavaRDD, MyComplexEntity.class);
df.saveAsParquetFile("s3a://mybucket/test.parquet");
MyComplexEntity.java
public MyComplexEntity implements Serializable {
private Identifier identifier;
private boolean isSwitch1True;
private String note;
private java.util.ArrayList<Identifier> secodaryIds;
......
}
标识符.java
public Identifier implements Serializable {
private int id;
private String uuid;
......
}
问题是我在从 myObjectJavaRDD 创建数据帧时在第 2 步失败了。 如何将复杂的 java 对象列表转换为数据框。 谢谢
无论如何,您可以将其转换为 Scala 吗?
Scala 支持这种case class
对于您的情况,挑战是您有一个Seq/Array
of Inner
case 类为 => private java.util.ArrayList<Identifier> secodaryIds;
所以它可以通过以下方式完成
// inner case class Identifier
case class Identifier(Id : Integer , uuid : String)
val innerVal = Seq(Identifier(1,"gsgsg"),Identifier(2,"dvggwgwg"))
// Outer case class MyComplexEntity
case class MyComplexEntity(notes : String, identifierArray : Seq[Identifier])
val outerVal = MyComplexEntity("Hello", innerVal)
请注意=>
externalVal :MyComplexEntity 包含一个标识符对象列表,如下所示
outerVal: MyComplexEntity = MyComplexEntity(Hello,List(Identifier(1,gsgsg), Identifier(2,dvggwgwg)))
现在使用Dataset的实际火花方式
import spark.implicits._
// Convert Our Input Data in Same Structure as your MyComplexEntity
// Only Trick is To 'Reflect' A Seq[(Int,String)] => Seq[Identifier]
// Hence we have to do 2 Mapping once for Outer Case class (MyComplexEntity) And Once For Inner Seq of Identifier
// If We Just Take this Input Data and Convert To DataSet ( without any Schema Inference)
// This is How It looks
val inputData = Seq(("Some DAY",Seq((210,"wert67"),(310,"bill123"))),
("I WILL BE", Seq((420,"henry678"),(1000,"baba123"))),
("Saturday Night",Seq((1000,"Roger123"),(2000,"God345")))
)
val unMappedDs = inputData.toDS
给我们=>
// See how it is Infered
// unMappedDs: org.apache.spark.sql.Dataset[(String, Seq[(Int, String)])] = [_1: string, _2: array<struct<_1:int,_2:string>>]
但是如果我们“正确地”映射它=>
As => // Second element is a Seq[(Int,String)] and We map it into Seq[Identifier] as x._2.map(y => Identifier(y._1,y._2))
像下面:
val resultDs = inputData.toDS.map(x =>MyComplexEntity(x._1,x._2.map(y => Identifier(y._1,y._2))))
resultDs.show(20,false)
我们得到一个结构,如=>
resultDs: org.apache.spark.sql.Dataset[MyComplexEntity] = [notes: string, identifierArray: array<struct<Id:int,uuid:string>>]
和数据为:
+--------------+--------------------------------+
|notes |identifierArray |
+--------------+--------------------------------+
|Some DAY |[[210,wert67], [310,bill123]] |
|I WILL BE |[[420,henry678], [1000,baba123]]|
|Saturday Night|[[1000,Roger123], [2000,God345]]|
+--------------+--------------------------------+
使用Scala很容易。 谢谢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.