Unable to filter DataFrame using Window function in Spark

Question

I try to use a logical expression based on a window-function to detect duplicate records:

df
.where(count("*").over(Window.partitionBy($"col1",$"col2"))>lit(1))
.show

this gives in Spark 2.1.1:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate

on the other hand, it works if i assign the result of the window-function to a new column and then filter that column:

df
.withColumn("count", count("*").over(Window.partitionBy($"col1",$"col2"))
.where($"count">lit(1)).drop($"count")
.show

I wonder how I can write this without using an temporary column?

Answer 1

I guess window functions cannot be used within the filter. You have to create an additional column and filter this one.

What you could do is to draw the window function into the select.

df.select(col("1"), col("2"), lag(col("2"), 1).over(window).alias("2_lag"))).filter(col("2_lag")==col("2"))

Then you have it in one statement.

Unable to filter DataFrame using Window function in Spark

Question

1 answers

solution1
3 2017-11-16 09:56:54

Unable to filter DataFrame using Window function in Spark

Question

1 answers

solution1 3 2017-11-16 09:56:54

solution1
3 2017-11-16 09:56:54