Hello I have the following dataframe
NAME ID VER
A. 650. true
A. 230. false
B. 430. false
B. 230. false
C. 125. true
C. 230. false
So here the target is to drop duplicate columns and to only keep one. Here I want to remove the second line because it has same Name as the first one but with a VER equal to false.
same for the last column whose name is C we only keep the one having a true Ver , for B ones , we keep both of them because there is no Ver equal to true.
The expected result would be
NAME ID VER
A. 650. true
B. 430. false
B. 230. false
C. 125. true
So here I though about a window function partitioned by name and then filter over VER to only keep the Names having a true Ver
ANy idea how to implement this with spark SQL
You can add a flag column based on whether there are any True's for a given name:
spark.sql("""
select NAME, ID, VER
from (
select *,
max(VER) over(partition by NAME) == false or VER as flag
from df
)
where flag
""").show()
+----+----+-----+
|NAME| ID| VER|
+----+----+-----+
| A.|650.| true|
| C.|125.| true|
| B.|430.|false|
| B.|230.|false|
+----+----+-----+
Here is the way to use the join
.
spark.sql("""
SELECT a.NAME, a.ID, a.VER
FROM df a
LEFT JOIN (
SELECT DISTINCT NAME
FROM df
WHERE VER = true) b
ON a.NAME = b.NAME
WHERE (b.NAME IS NOT NULL AND a.VER = true)
or (b.NAME IS NULL AND a.VER = false)
""")
+----+---+-----+
|NAME|ID |VER |
+----+---+-----+
|A |650|true |
|B |430|false|
|B |230|false|
|C |125|true |
+----+---+-----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.