简体   繁体   中英

drop duplicates columns with a groupBy funtion

Hello I have the following dataframe

NAME ID   VER
 A.  650.  true
 A.   230.  false
 B.   430.  false
 B.   230.  false
 C.   125.  true
 C.   230.  false

So here the target is to drop duplicate columns and to only keep one. Here I want to remove the second line because it has same Name as the first one but with a VER equal to false.

same for the last column whose name is C we only keep the one having a true Ver , for B ones , we keep both of them because there is no Ver equal to true.

The expected result would be

  NAME ID   VER
 A.  650.  true
 B.   430.  false
 B.   230.  false
 C.   125.  true
 

So here I though about a window function partitioned by name and then filter over VER to only keep the Names having a true Ver

ANy idea how to implement this with spark SQL

You can add a flag column based on whether there are any True's for a given name:

spark.sql("""
    select NAME, ID, VER 
    from (
        select *, 
            max(VER) over(partition by NAME) == false or VER as flag 
        from df
    ) 
    where flag
""").show()
+----+----+-----+
|NAME|  ID|  VER|
+----+----+-----+
|  A.|650.| true|
|  C.|125.| true|
|  B.|430.|false|
|  B.|230.|false|
+----+----+-----+

Here is the way to use the join .

spark.sql("""
    SELECT a.NAME, a.ID, a.VER
      FROM df a
      LEFT JOIN (
           SELECT DISTINCT NAME
             FROM df
            WHERE VER = true) b
        ON a.NAME = b.NAME
     WHERE (b.NAME IS NOT NULL AND a.VER = true)
        or (b.NAME IS NULL AND a.VER = false)
""")

+----+---+-----+
|NAME|ID |VER  |
+----+---+-----+
|A   |650|true |
|B   |430|false|
|B   |230|false|
|C   |125|true |
+----+---+-----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM