简体   繁体   中英

Unable to filter DataFrame using Window function in Spark

I try to use a logical expression based on a window-function to detect duplicate records:

df
.where(count("*").over(Window.partitionBy($"col1",$"col2"))>lit(1))
.show

this gives in Spark 2.1.1:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate

on the other hand, it works if i assign the result of the window-function to a new column and then filter that column:

df
.withColumn("count", count("*").over(Window.partitionBy($"col1",$"col2"))
.where($"count">lit(1)).drop($"count")
.show

I wonder how I can write this without using an temporary column?

I guess window functions cannot be used within the filter. You have to create an additional column and filter this one.

What you could do is to draw the window function into the select.

df.select(col("1"), col("2"), lag(col("2"), 1).over(window).alias("2_lag"))).filter(col("2_lag")==col("2"))

Then you have it in one statement.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM