简体   繁体   中英

How to Filter a List in spark with another column of same dataframe(Version 2.2)

I have a requirement to filter a List with another column in the same dataframe.

Below is my DataFrame. Here, I want to filter col3 list with col1 and get only active childs for parent.

    Df.show(10,false):
=============================

    Col1   Col2     col3            flag 
    P1     Parent   [c1,c2,c3,c4]   Active
    c1     Child    []              InActive
    c2     Child    []              Active
    c3     Child    []              Active

Expected Output :
===================

     Df.show(10,false):
    Col1   Col2     col3            flag 
    P1     Parent   [c2,c3]         Active
    c2     Child    []              Active
    c3     Child    []              Active

Can someone help me to get the above result.

I generated your dataframe like this:

val df = Seq(("p1", "Parent", Seq("c1", "c2", "c3", "c4"), "Active"), 
             ("c1", "Child", Seq(), "Inactive"), 
             ("c2", "Child", Seq(), "Active"), 
             ("c3", "Child", Seq(), "Active"))
        .toDF("Col1", "Col2", "col3", "flag")

Then I filter only the active children in one dataframe which is one part of your output:

val active_children = df.where('flag === "Active").where('Col2 === "Child")

I also generate a flatten dataframe of parent/child relationships with explode :

val rels = df.withColumn("child", explode('col3))
    .select("Col1", "Col2", "flag", "child")

scala> rels.show
+----+------+------+-----+
|Col1|  Col2|  flag|child|
+----+------+------+-----+
|  p1|Parent|Active|   c1|
|  p1|Parent|Active|   c2|
|  p1|Parent|Active|   c3|
|  p1|Parent|Active|   c4|
+----+------+------+-----+

and a dataframe with only one column corresponding to active children like this:

val child_filter = active_children.select('Col1 as "child")

and use this child_filter dataframe to filter (with a join) the parents you are interested in and use a groupBy to aggregate the lines back to your output format:

val parents = rels
    .join(child_filter, "child")
    .groupBy("Col1")
    .agg(first('Col2) as "Col2", 
         collect_list('child) as "col3", 
         first('flag) as "flag")
scala> parents.show
+----+------+--------+------+
|Col1|  Col2|    col3|  flag|
+----+------+--------+------+
|  p1|Parent|[c2, c3]|Active|
+----+------+--------+------+

Finally, a union yields the expected output:

scala> parents.union(active_children).show
+----+------+--------+------+
|Col1|  Col2|    col3|  flag|
+----+------+--------+------+
|  p1|Parent|[c2, c3]|Active|
|  c2| Child|      []|Active|
|  c3| Child|      []|Active|    
+----+------+--------+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM