drop duplicates columns with a groupBy funtion

Question

Hello I have the following dataframe

NAME ID   VER
 A.  650.  true
 A.   230.  false
 B.   430.  false
 B.   230.  false
 C.   125.  true
 C.   230.  false

So here the target is to drop duplicate columns and to only keep one. Here I want to remove the second line because it has same Name as the first one but with a VER equal to false.

same for the last column whose name is C we only keep the one having a true Ver , for B ones , we keep both of them because there is no Ver equal to true.

The expected result would be

  NAME ID   VER
 A.  650.  true
 B.   430.  false
 B.   230.  false
 C.   125.  true

So here I though about a window function partitioned by name and then filter over VER to only keep the Names having a true Ver

ANy idea how to implement this with spark SQL

Answer 1

You can add a flag column based on whether there are any True's for a given name:

spark.sql("""
    select NAME, ID, VER 
    from (
        select *, 
            max(VER) over(partition by NAME) == false or VER as flag 
        from df
    ) 
    where flag
""").show()
+----+----+-----+
|NAME|  ID|  VER|
+----+----+-----+
|  A.|650.| true|
|  C.|125.| true|
|  B.|430.|false|
|  B.|230.|false|
+----+----+-----+

Answer 2

Here is the way to use the join .

spark.sql("""
    SELECT a.NAME, a.ID, a.VER
      FROM df a
      LEFT JOIN (
           SELECT DISTINCT NAME
             FROM df
            WHERE VER = true) b
        ON a.NAME = b.NAME
     WHERE (b.NAME IS NOT NULL AND a.VER = true)
        or (b.NAME IS NULL AND a.VER = false)
""")

+----+---+-----+
|NAME|ID |VER  |
+----+---+-----+
|A   |650|true |
|B   |430|false|
|B   |230|false|
|C   |125|true |
+----+---+-----+

drop duplicates columns with a groupBy funtion

Question

2 answers

solution1
0 ACCPTED 2021-06-16 13:12:23

solution2
0 2021-06-16 13:24:01

drop duplicates columns with a groupBy funtion

Question

2 answers

solution1 0 ACCPTED 2021-06-16 13:12:23

solution2 0 2021-06-16 13:24:01

solution1
0 ACCPTED 2021-06-16 13:12:23

solution2
0 2021-06-16 13:24:01