My goal is to prepare in spark/Hadoop a dataframe that i will index it in elasticsearch .
I have 2 orc table : client
and person
. The relation is one-to-many
1 client can have multiple person .
So i will be using Spark/Spark SQL , so lets speak dataframe :
The client dataframe schema :
root
|-- client_id: string (nullable = true)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c3: string (nullable = true)
The person dataframe schema :
root
|-- person_id: string (nullable = true)
|-- p1: string (nullable = true)
|-- p2: string (nullable = true)
|-- p3: string (nullable = true)
|-- client_id: string (nullable = true)
My goal is to generate a Dataframe that will have this schema :
root
|-- client_id: string (nullable = true)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c3: string (nullable = true)
|-- persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- person_id: string (nullable = true)
| | |-- p1: string (nullable = true)
| | |-- p2: string (nullable = true)
| | |-- p3: string (nullable = true)
How i can achieve this ?
Thanks in advance for your help .
You can group
the person
dataframe by client_id
and create a list
of all other columns
and join
with the client
dataframe as below
//client data
val client = Seq(
("1", "a", "b", "c"),
("2", "a", "b", "c"),
("3", "a", "b", "c")
).toDF("client_id", "c1", "c2", "c2")
//person data
val person = Seq(
("p1", "a", "b", "c", "1"),
("p2", "a", "b", "c", "1"),
("p1", "a", "b", "c", "2")
).toDF("person_id", "p1", "p2", "p3", "client_id")
//Group the person data by client_id and create a list of remaining columns
val groupedPerson = person.groupBy("client_id")
.agg(collect_list(struct("person_id", "p1", "p2", "p3")).as("persons"))
//Join the client and groupedPerson Data
val resultDF = client.join(groupedPerson, Seq("client_id"), "left")
resultDF.show(false)
Schema:
root
|-- client_id: string (nullable = true)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c2: string (nullable = true)
|-- persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- person_id: string (nullable = true)
| | |-- p1: string (nullable = true)
| | |-- p2: string (nullable = true)
| | |-- p3: string (nullable = true)
Output:
+---------+---+---+---+------------------------+
|client_id|c1 |c2 |c2 |persons |
+---------+---+---+---+------------------------+
|1 |a |b |c |[[p1,a,b,c], [p2,a,b,c]]|
|2 |a |b |c |[[p1,a,b,c]] |
|3 |a |b |c |null |
+---------+---+---+---+------------------------+
Hope this helps !
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.