I have a requirement to filter a List with another column in the same dataframe.
Below is my DataFrame. Here, I want to filter col3 list with col1 and get only active childs for parent.
Df.show(10,false):
=============================
Col1 Col2 col3 flag
P1 Parent [c1,c2,c3,c4] Active
c1 Child [] InActive
c2 Child [] Active
c3 Child [] Active
Expected Output :
===================
Df.show(10,false):
Col1 Col2 col3 flag
P1 Parent [c2,c3] Active
c2 Child [] Active
c3 Child [] Active
Can someone help me to get the above result.
I generated your dataframe like this:
val df = Seq(("p1", "Parent", Seq("c1", "c2", "c3", "c4"), "Active"),
("c1", "Child", Seq(), "Inactive"),
("c2", "Child", Seq(), "Active"),
("c3", "Child", Seq(), "Active"))
.toDF("Col1", "Col2", "col3", "flag")
Then I filter only the active children in one dataframe which is one part of your output:
val active_children = df.where('flag === "Active").where('Col2 === "Child")
I also generate a flatten dataframe of parent/child relationships with explode
:
val rels = df.withColumn("child", explode('col3))
.select("Col1", "Col2", "flag", "child")
scala> rels.show
+----+------+------+-----+
|Col1| Col2| flag|child|
+----+------+------+-----+
| p1|Parent|Active| c1|
| p1|Parent|Active| c2|
| p1|Parent|Active| c3|
| p1|Parent|Active| c4|
+----+------+------+-----+
and a dataframe with only one column corresponding to active children like this:
val child_filter = active_children.select('Col1 as "child")
and use this child_filter
dataframe to filter (with a join) the parents you are interested in and use a groupBy to aggregate the lines back to your output format:
val parents = rels
.join(child_filter, "child")
.groupBy("Col1")
.agg(first('Col2) as "Col2",
collect_list('child) as "col3",
first('flag) as "flag")
scala> parents.show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
+----+------+--------+------+
Finally, a union yields the expected output:
scala> parents.union(active_children).show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
| c2| Child| []|Active|
| c3| Child| []|Active|
+----+------+--------+------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.