简体   繁体   中英

Aggregate multiple columns in Spark dataframe

I have issues in solving the following problem. Basically I want to find on which date a particular item( item_code ) was sold maximum and minimum volume.

Input DataFrame

item_code, sold_date, price, volume
101,      10-12-2017, 20,    500
101,      11-12-2017, 20,    400
201,      10-12-2017, 50,    200
201,      13-12-2017, 51,    300

Expected output

Find max and min volume with sold date.I want this solution without using any lambda operations.

在此处输入图片说明

df.groupBy("item_code")agg(min("volume"),max("volume"))

the above one will help me to get max and min of volume but I want them along with respective date.

I tried my level best with udf but I could not crack it. any help highly appreciated.

The final output you desire needs complex process. You can use the following process.

Given the input dataframe as

+---------+----------+-----+------+
|item_code|sold_date |price|volume|
+---------+----------+-----+------+
|101      |10-12-2017|20   |500   |
|101      |11-12-2017|20   |400   |
|201      |10-12-2017|50   |200   |
|201      |13-12-2017|51   |300   |
+---------+----------+-----+------+

You can use the following code

import org.apache.spark.sql.functions._
val tempDF = df.groupBy("item_code").agg(min("volume").as("min"),max("volume").as("max"))
tempDF.as("t2").join(df.as("t1"), col("t1.item_code") === col("t2.item_code") && col("t1.volume") === col("t2.min"), "left")
  .select($"t2.item_code", $"t2.max", concat_ws(",", $"t2.item_code", $"t2.min", $"t1.sold_date").as("min"))
  .join(df.as("t3"), col("t3.item_code") === col("t2.item_code") && col("t3.volume") === col("t2.max"), "left")
  .select($"min", concat_ws(",", $"t3.item_code", $"t2.max", $"t3.sold_date").as("max"))
  .show(false)

which is going to give you the dataframe you desire

+------------------+------------------+
|min               |max               |
+------------------+------------------+
|101,400,11-12-2017|101,500,10-12-2017|
|201,200,10-12-2017|201,300,13-12-2017|
+------------------+------------------+

Best approach here is to create a new index (ie column) to the Dataframe as a result of concatenation of the columns required for sorting. Implement a smart sorting on String based new index so that the results are still sorted numerically but you carry along the information of Date and actually whatever you need to retrieve as part of the query.

That way there is no need for JOINs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM