I have a dataframe like this:
df =
--------------
|col1 | col2 |
--------------
| A | 1 |
| A | 5 |
| B | 0 |
| A | 2 |
| B | 6 |
| B | 8 |
--------------
I want to partition by col1, find the median of col2 in each partition, and append the result to form a new column. The result should look like this:
result =
---------------------
|col1 | col2 | col3 |
---------------------
| A | 1 | 2 |
| A | 5 | 2 |
| B | 0 | 6 |
| A | 2 | 2 |
| B | 6 | 6 |
| B | 8 | 8 |
---------------------
For now, I'm using this code:
val df2 = df
.withColumn("tmp", percent_rank over Window.partition('col1).orderBy('col2))
.where("tmp <= 0.5")
.groupBy("col1").agg(max(col2) as "col3")
val result = df.join(df2, df("col1") === df2("col1")).drop(df2("col1"))
But this takes too much time and space resources to run when the dataframe is big. Please help me find a way to do the above more efficiently! Any help is much appreciated!
With the data you have, you can do a Spark DataFrame groupBy
statement with percentile_approx
to perform the calculation.
// Creating the `df` dataset
val df = Seq(("A", 1), ("A", 5), ("B", 0), ("A", 2), ("B", 6), ("B", 8)).toDF("col1", "col2")
df.createOrReplaceTempView("df")
Use percentile_approx
with groupBy
to perform median calculation:
val df2 = spark.sql("select col1, percentile_approx(col2, 0.5) as median from df group by col1 order by col1")
df2.show()
with the output of df2
being:
+----+------+
|col1|median|
+----+------+
| A| 2.0|
| B| 6.0|
+----+------+
And now running the join
to recreate the final result:
val result = df.join(df2, df("col1") === df2("col1"))
result.show()
//// output
+----+----+----+------+
|col1|col2|col1|median|
+----+----+----+------+
| A| 1| A| 2.0|
| A| 5| A| 2.0|
| B| 0| B| 6.0|
| A| 2| A| 2.0|
| B| 6| B| 6.0|
| B| 8| B| 6.0|
+----+----+----+------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.