簡體   English   中英

加入Spark Scala后創建嵌套數據

[英]Create a nested data after join in Spark Scala

我的目標是在spark / Hadoop中准備一個數據框,以便在elasticsearch中對其進行索引。

我有2個orc表: clientperson 關系是一對多的

1個客戶可以有多個人。

所以我將使用Spark / Spark SQL,所以說出dataframe:

客戶端數據框架構:

root 
|-- client_id: string (nullable = true) 
|-- c1: string (nullable = true) 
|-- c2: string (nullable = true) 
|-- c3: string (nullable = true) 

人員數據框架構:

root 
|-- person_id: string (nullable = true) 
|-- p1: string (nullable = true) 
|-- p2: string (nullable = true) 
|-- p3: string (nullable = true) 
|-- client_id: string (nullable = true) 

我的目標是生成一個具有以下架構的數據框:

root 
|-- client_id: string (nullable = true) 
|-- c1: string (nullable = true) 
|-- c2: string (nullable = true) 
|-- c3: string (nullable = true) 
|-- persons: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- person_id: string (nullable = true) 
| | |-- p1: string (nullable = true) 
| | |-- p2: string (nullable = true) 
| | |-- p3: string (nullable = true)

我怎樣才能做到這一點?

在此先感謝您的幫助 。

您可以按client_idperson數據框進行group ,並創建所有其他columnslist ,並按如下所示與client數據框join

//client data 
val client = Seq(
  ("1", "a", "b", "c"),
  ("2", "a", "b", "c"),
  ("3", "a", "b", "c")
).toDF("client_id", "c1", "c2", "c2")

//person data 
val person = Seq(
  ("p1", "a", "b", "c", "1"),
  ("p2", "a", "b", "c", "1"),
  ("p1", "a", "b", "c", "2")
).toDF("person_id", "p1", "p2", "p3", "client_id")

//Group the person data by client_id and create a list of remaining columns 
val groupedPerson = person.groupBy("client_id")
  .agg(collect_list(struct("person_id", "p1", "p2", "p3")).as("persons"))


//Join the client and groupedPerson Data 
val resultDF = client.join(groupedPerson, Seq("client_id"), "left")

resultDF.show(false)

架構:

root
 |-- client_id: string (nullable = true)
 |-- c1: string (nullable = true)
 |-- c2: string (nullable = true)
 |-- c2: string (nullable = true)
 |-- persons: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- person_id: string (nullable = true)
 |    |    |-- p1: string (nullable = true)
 |    |    |-- p2: string (nullable = true)
 |    |    |-- p3: string (nullable = true)

輸出:

+---------+---+---+---+------------------------+
|client_id|c1 |c2 |c2 |persons                 |
+---------+---+---+---+------------------------+
|1        |a  |b  |c  |[[p1,a,b,c], [p2,a,b,c]]|
|2        |a  |b  |c  |[[p1,a,b,c]]            |
|3        |a  |b  |c  |null                    |
+---------+---+---+---+------------------------+

希望這可以幫助 !

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM