简体   繁体   English

SQL - 在 first_value 内过滤(或 window function 一般)

[英]SQL - Filtering within first_value (or window function in general)

I have some log data tracking invoice processing, like the example below:我有一些日志数据跟踪发票处理,如下例所示:

Invoice  Activity              Date
--------------------------------------
A        Creation              12-Mar
A        Quantity change       13-Mar
A        Quantity change       14-Mar
A        Payment               17-Mar
B        Creation              20-Apr
B        Payment               24-Apr
B        Payment               29-Apr

I need to show, for every invoice, when the first and last of each activity occurred.对于每张发票,我需要显示每个活动的第一个和最后一个发生的时间。 For example, for invoice A there were two quantity changes, and I am interested in that date.例如,发票 A 有两次数量变化,我对那个日期感兴趣。 I need to display everything in an aggregated table with 1 row per invoice as shown below:我需要在汇总表中显示所有内容,每张发票有 1 行,如下所示:

Invoice    Creation date    First quantity change     Last payment
---------------------------------------------------------------------
A          12-Mar           13-Mar                    17-Mar
B          20-Apr           NULL                      29-Apr

I have explored a couple of different options but nothing works so far.我已经探索了几种不同的选择,但到目前为止没有任何效果。 The most obvious one is to join the table on itself, using the invoice id as the join key.最明显的一个是连接表本身,使用发票 ID 作为连接键。 However, this is not possible because of performance issues, as the tables are very large and this would require too many joins.但是,由于性能问题,这是不可能的,因为表非常大,这需要太多的连接。

Another option is to use the first_value and last_value functions, but I am not able to set them up in a way that gives me the results I need, because I can't find a way to somehow put a filter in it.另一种选择是使用 first_value 和 last_value 函数,但我无法将它们设置为给我所需的结果,因为我找不到以某种方式在其中放置过滤器的方法。

I have tried this, which doesn't work, but kind of shows what I'm trying to do:我试过这个,它不起作用,但有点显示我正在尝试做的事情:

SELECT
Invoice
, first_value(CASE WHEN  Activity = 'Quantity Change' THEN Activity ELSE NULL END)
  OVER (PARTITION BY Invoice ORDER BY Date)

FROM
Data

Does anyone have any suggestion on how to do this?有人对如何做到这一点有任何建议吗? I am running these transformations in google big query.我在谷歌大查询中运行这些转换。

Many thanks,非常感谢,

Alessandro亚历山德罗

Using PIVOT query,使用PIVOT查询,

SELECT * FROM (
  SELECT Invoice, Activity,
         FORMAT_DATE('%d-%b', MIN(date0) OVER (PARTITION BY Invoice, Activity)) first,
         FORMAT_DATE('%d-%b', MAX(date0) OVER (PARTITION BY Invoice, Activity)) last,
    FROM sample_table, UNNEST([PARSE_DATE('%d-%b', Date)]) date0
)  PIVOT (ANY_VALUE(first) first, ANY_VALUE(last) last FOR REPLACE(Activity, ' ','_') IN ('Creation', 'Payment', 'Quantity_change'));

You can get following results:您可以获得以下结果:

在此处输入图像描述

And you can make above query more general using a dynamic sql, but I don't think you want to have a table with 100,000 columns.您可以使用动态 sql 使上述查询更通用,但我认为您不希望有一个包含 100,000 列的表。

So, I think below query and a table schema is more practical than a pivoted table.所以,我认为下面的查询和表模式比透视表更实用。

SELECT DISTINCT Invoice, Activity,
       FORMAT_DATE('%d-%b', MIN(date0) OVER (PARTITION BY Invoice, Activity)) first,
       FORMAT_DATE('%d-%b', MAX(date0) OVER (PARTITION BY Invoice, Activity)) last,
  FROM sample_table, UNNEST([PARSE_DATE('%d-%b', Date)]) date0;

在此处输入图像描述

A sameple table used in above queries:上述查询中使用的同一张表:

create temp table sample_table as
select 'A' Invoice, 'Creation' Activity, '12-Mar' Date union all
select 'A', 'Quantity change', '13-Mar' union all
select 'A', 'Quantity change', '14-Mar' union all
select 'A', 'Payment', '17-Mar' union all
select 'B', 'Creation', '20-Apr' union all
select 'B', 'Payment', '24-Apr' union all
select 'B', 'Payment', '29-Apr';

You can achieve this by using the MIN and MAX aggregate functions.您可以通过使用 MIN 和 MAX 聚合函数来实现此目的。

WITH inv AS  
(
  SELECT "A" AS Invoice, 'Creation' as Activity, DATE '2022-03-12' as Date UNION ALL
  SELECT "A" AS Invoice, 'Quantity change' as Activity, DATE '2022-03-13' as Date UNION ALL
  SELECT "A" AS Invoice, 'Quantity change' as Activity, DATE '2022-03-14' as Date UNION ALL
  SELECT "A" AS Invoice, 'Payment' as Activity, DATE '2022-03-17' as Date UNION ALL
  SELECT "B" AS Invoice, 'Creation' as Activity, DATE '2022-04-20' as Date UNION ALL
  SELECT "B" AS Invoice, 'Payment' as Activity, DATE '2022-04-24' as Date UNION ALL
  SELECT "B" AS Invoice, 'Payment' as Activity, DATE '2022-04-29' as Date
)
SELECT 
  Invoice,
  MIN(IF(inv.Activity = 'Creation', Date, NULL)) as CreationDate,
  MIN(IF(inv.Activity = 'Quantity change', Date, NULL)) as FirtsQuantityChange,
  MAX(IF(inv.Activity = 'Payment', Date, NULL)) as LastPayment
FROM inv
GROUP BY Invoice

This is the solution with the dynamic columns.这是动态列的解决方案。

BEGIN
DECLARE columns STRING;

CREATE TEMP TABLE inv
AS
SELECT "A" AS Invoice, 'Creation' as Activity, DATE '2022-03-12' as Date UNION ALL
SELECT "A" AS Invoice, 'Quantity change' as Activity, DATE '2022-03-13' as Date UNION ALL
SELECT "A" AS Invoice, 'Quantity change' as Activity, DATE '2022-03-14' as Date UNION ALL
SELECT "A" AS Invoice, 'Payment' as Activity, DATE '2022-03-17' as Date UNION ALL
SELECT "B" AS Invoice, 'Creation' as Activity, DATE '2022-04-20' as Date UNION ALL
SELECT "B" AS Invoice, 'Payment' as Activity, DATE '2022-04-24' as Date UNION ALL
SELECT "B" AS Invoice, 'Payment' as Activity, DATE '2022-04-29' as Date
;

SET columns = (
  SELECT STRING_AGG(
    CASE Activity WHEN 'Payment' THEN
    CONCAT("MAX(IF(inv.Activity = '", Activity ,"', Date, NULL)) as Last",REPLACE(Activity,' ','') , " ")
    ELSE
    CONCAT("MIN(IF(inv.Activity = '", Activity ,"', Date, NULL)) as First",REPLACE(Activity,' ','') , " ")
    END
  )
  FROM (SELECT DISTINCT Activity FROM inv)
);

SELECT columns;

EXECUTE IMMEDIATE format("""SELECT 
  Invoice,%s
  FROM inv
  GROUP BY Invoice
  """,columns);

END;

You must use the case statement in order to decide for which column you want the MAX and for which one you want the MIN as the aggregation function and also the EXECUTE IMMEDIATE in order to form the final statement.您必须使用 case 语句来决定您想要哪一列的MAX以及您想要哪一列的MIN作为聚合 function 以及EXECUTE IMMEDIATE以形成最终语句。

Anyway, if you really have 100.000 distinct values for Activity then you should use another table schema for your results like @Jaytiger suggested.无论如何,如果您确实有 100.000 个不同的 Activity 值,那么您应该使用另一个表模式来获取您的结果,如 @Jaytiger 建议的那样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM