简体   繁体   English

如何在Spark SQL中的两列上聚合

[英]How to aggregate on two columns in Spark SQL

Now I have a table with following task: 现在,我有一个执行以下任务的表:

  1. Group by functions on DepartmentID and EmployeeID 按部门编号和员工编号的功能分组
  2. Within each group, I need to order them by (ArrivalDate, ArrivalTime) and pick the first one. 在每个组中,我需要按(ArrivalDate,ArrivalTime)对其进行排序,然后选择第一个。 So if two dates are different, pick the newer date. 因此,如果两个日期不同,请选择较新的日期。 If two dates are same, pick the newer time. 如果两个日期相同,则选择较新的时间。

I am trying with this kinda of approach: 我正在尝试这种方法:

input.select("DepartmenId","EmolyeeID", "ArrivalDate", "ArrivalTime", "Word")
  .agg(here will be the function that handles logic from 2)
  .show()

What is the syntax to aggregate here? 这里要聚合的语法是什么?

Thank you in advance. 先感谢您。

 // +-----------+---------+-----------+-----------+--------+ // |DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime| Word | // +-----------+---------+-----------+-----------+--------+ // | D1 | E1 | 20170101 | 0730 | "YES" | // +-----------+---------+-----------+-----------+--------+ // | D1 | E1 | 20170102 | 1530 | "NO" | // +-----------+---------+-----------+-----------+--------+ // | D1 | E2 | 20170101 | 0730 | "ZOO" | // +-----------+---------+-----------+-----------+--------+ // | D1 | E2 | 20170102 | 0330 | "BOO" | // +-----------+---------+-----------+-----------+--------+ // | D2 | E1 | 20170101 | 0730 | "LOL" | // +-----------+---------+-----------+-----------+--------+ // | D2 | E1 | 20170101 | 1830 | "ATT" | // +-----------+---------+-----------+-----------+--------+ // | D2 | E2 | 20170105 | 1430 | "UNI" | // +-----------+---------+-----------+-----------+--------+ // output should be // +-----------+---------+-----------+-----------+--------+ // |DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime| Word | // +-----------+---------+-----------+-----------+--------+ // | D1 | E1 | 20170102 | 1530 | "NO" | // +-----------+---------+-----------+-----------+--------+ // | D1 | E2 | 20170102 | 0330 | "BOO" | // +-----------+---------+-----------+-----------+--------+ // | D2 | E1 | 20170101 | 1830 | "ATT" | // +-----------+---------+-----------+-----------+--------+ // | D2 | E2 | 20170105 | 1430 | "UNI" | // +-----------+---------+-----------+-----------+--------+ 

One approach would be to use Spark Window function: 一种方法是使用“火花窗口”功能:

val df = Seq(
  ("D1", "E1", "20170101", "0730", "YES"),
  ("D1", "E1", "20170102", "1530", "NO"),
  ("D1", "E2", "20170101", "0730", "ZOO"),
  ("D1", "E2", "20170102", "0330", "BOO"),
  ("D2", "E1", "20170101", "0730", "LOL"),
  ("D2", "E1", "20170101", "1830", "ATT"),
  ("D2", "E2", "20170105", "1430", "UNI")
).toDF(
  "DepartmenId", "EmolyeeID", "ArrivalDate", "ArrivalTime", "Word"
)

import org.apache.spark.sql.expressions.Window

val df2 = df.withColumn("rowNum", row_number().over(
    Window.partitionBy("DepartmenId", "EmolyeeID").
      orderBy($"ArrivalDate".desc, $"ArrivalTime".desc)
  )).
  select("DepartmenId", "EmolyeeID", "ArrivalDate", "ArrivalTime","Word").
  where($"rowNum" === 1).
  orderBy("DepartmenId", "EmolyeeID")

df2.show
+-----------+---------+-----------+-----------+----+
|DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime|Word|
+-----------+---------+-----------+-----------+----+
|         D1|       E1|   20170102|       1530|  NO|
|         D1|       E2|   20170102|       0330| BOO|
|         D2|       E1|   20170101|       1830| ATT|
|         D2|       E2|   20170105|       1430| UNI|
+-----------+---------+-----------+-----------+----+

You can use max on a new Struct column that contains all the non-grouping columns, with ArrivalData first and ArrivalTime second: the sorting of that new column will match your requirement (Latest date first; Latest hour among similar dates first) so getting the maximum will produce the record you're after. 您可以在包含所有非分组列的新Struct列上使用max ,首先使用ArrivalData然后使用ArrivalTime :新列的排序将满足您的要求(最新日期为第一;相似日期中的最后一小时为第一),因此maximum将产生您想要的记录。

Then, you can use a select operation to "split" the struct back into separate columns. 然后,您可以使用select操作将结构“拆分”回单独的列。

import spark.implicits._
import org.apache.spark.sql.functions._

df.groupBy($"DepartmentID", $"EmployeeID")
  .agg(max(struct("ArrivalDate", "ArrivalTime", "Word")) as "struct")
  .select($"DepartmentID", $"EmployeeID",
    $"struct.ArrivalDate" as "ArrivalDate",
    $"struct.ArrivalTime" as "ArrivalTime",
    $"struct.Word" as "Word"
  )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM