简体   繁体   English

如何扁平化数组中的嵌套结构?

[英]How to flatten nested struct in array?

This is my current schema : 这是我当前的模式:

 |-- _id: string (nullable = true)
 |-- person: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- adr1: struct (nullable = true)
 |    |    |    |-- resid: string (nullable = true)

And this is what I want to obtain : 这就是我想要获得的:

 |-- _id: string (nullable = true)
 |-- person: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- resid: string (nullable = true)

I am using the java api. 我正在使用Java API。

You can use map transformation: 您可以使用map转换:

import java.util.Arrays;
import java.util.stream.Collectors;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;

Encoder<PeopleFlatten> peopleFlattenEncoder = Encoders.bean(PeopleFlatten.class);

people
  .map(person -> new PeopleFlatten(
      person.get_id(),
      person.getPerson().stream().map(p ->
        new PersonFlatten(
          p.getName(),
          p.getAdr1().getResid()
        )
      ).collect(Collectors.toList())
    ),
    peopleFlattenEncoder
  );

where PeopleFlatten and PersonFlatten are POJO corresponding to expected schema in question. 其中PeopleFlattenPersonFlatten是对应于所讨论的期望模式的POJO。

public class PeopleFlatten implements Serializable {
   private String _id;
   private List<PersonFlatten> person;
   // getters and setters
}

public class PersonFlatten implements Serializable {
   private String name;
   private String resid;
   // getters and setters
}

If it were Scala, I'd do the following, but since the OP asked about Java, I'm offering it as a guidance only. 如果是Scala,我将执行以下操作,但是由于OP询问了Java问题,因此我仅将其作为指导。

Solution 1 - Memory-Heavy 解决方案1-大量存储

case class Address(resid: String)
case class Person(name: String, adr1: Address)

val people = Seq(
  ("one", Array(Person("hello", Address("1")), Person("world", Address("2"))))
).toDF("_id", "persons")

import org.apache.spark.sql.Row
people.as[(String, Array[Person])].map { case (_id, arr) => 
  (_id, arr.map { case Person(name, Address(resid)) => (name, resid) })
}

This approach however is quite memory expensive as the internal binary rows are copied to their JVM objects that puts the environment to face OutOfMemoryErrors. 但是,由于内部二进制行被复制到其JVM对象,使环境面对OutOfMemoryErrors,因此此方法在内存上非常昂贵。

Solution 2 - Expensive but Language-Independent 解决方案2-昂贵但与语言无关

The other query with worse performance (but less memory requirement too) could use explode operator to destructure the array first that would give us an easy access to internal structs. 另一个性能较差 (但也需要较少的内存)的查询可以使用explode运算符首先对数组进行解构,这将使我们可以轻松访问内部结构。

val solution = people.
  select($"_id", explode($"persons") as "exploded"). // <-- that's expensive
  select("_id", "exploded.*"). // <-- this is the trick to access struct's fields
  select($"_id", $"name", $"adr1.resid").
  select($"_id", struct("name", "resid") as "person").
  groupBy("_id"). // <-- that's expensive
  agg(collect_list("person") as "persons")
scala> solution.printSchema
root
 |-- _id: string (nullable = true)
 |-- persons: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- resid: string (nullable = true)

The nice thing about the solution is that it has almost nothing related to Scala or Java (so you could use it right away regardless of the language of your choice). 该解决方案的优点在于,它几乎与Scala或Java没有任何关系(因此,无论您选择哪种语言,都可以立即使用它)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM