简体   繁体   中英

SQL - Filtering within first_value (or window function in general)

I have some log data tracking invoice processing, like the example below:

Invoice  Activity              Date
--------------------------------------
A        Creation              12-Mar
A        Quantity change       13-Mar
A        Quantity change       14-Mar
A        Payment               17-Mar
B        Creation              20-Apr
B        Payment               24-Apr
B        Payment               29-Apr

I need to show, for every invoice, when the first and last of each activity occurred. For example, for invoice A there were two quantity changes, and I am interested in that date. I need to display everything in an aggregated table with 1 row per invoice as shown below:

Invoice    Creation date    First quantity change     Last payment
---------------------------------------------------------------------
A          12-Mar           13-Mar                    17-Mar
B          20-Apr           NULL                      29-Apr

I have explored a couple of different options but nothing works so far. The most obvious one is to join the table on itself, using the invoice id as the join key. However, this is not possible because of performance issues, as the tables are very large and this would require too many joins.

Another option is to use the first_value and last_value functions, but I am not able to set them up in a way that gives me the results I need, because I can't find a way to somehow put a filter in it.

I have tried this, which doesn't work, but kind of shows what I'm trying to do:

SELECT
Invoice
, first_value(CASE WHEN  Activity = 'Quantity Change' THEN Activity ELSE NULL END)
  OVER (PARTITION BY Invoice ORDER BY Date)

FROM
Data

Does anyone have any suggestion on how to do this? I am running these transformations in google big query.

Many thanks,

Alessandro

Using PIVOT query,

SELECT * FROM (
  SELECT Invoice, Activity,
         FORMAT_DATE('%d-%b', MIN(date0) OVER (PARTITION BY Invoice, Activity)) first,
         FORMAT_DATE('%d-%b', MAX(date0) OVER (PARTITION BY Invoice, Activity)) last,
    FROM sample_table, UNNEST([PARSE_DATE('%d-%b', Date)]) date0
)  PIVOT (ANY_VALUE(first) first, ANY_VALUE(last) last FOR REPLACE(Activity, ' ','_') IN ('Creation', 'Payment', 'Quantity_change'));

You can get following results:

在此处输入图像描述

And you can make above query more general using a dynamic sql, but I don't think you want to have a table with 100,000 columns.

So, I think below query and a table schema is more practical than a pivoted table.

SELECT DISTINCT Invoice, Activity,
       FORMAT_DATE('%d-%b', MIN(date0) OVER (PARTITION BY Invoice, Activity)) first,
       FORMAT_DATE('%d-%b', MAX(date0) OVER (PARTITION BY Invoice, Activity)) last,
  FROM sample_table, UNNEST([PARSE_DATE('%d-%b', Date)]) date0;

在此处输入图像描述

A sameple table used in above queries:

create temp table sample_table as
select 'A' Invoice, 'Creation' Activity, '12-Mar' Date union all
select 'A', 'Quantity change', '13-Mar' union all
select 'A', 'Quantity change', '14-Mar' union all
select 'A', 'Payment', '17-Mar' union all
select 'B', 'Creation', '20-Apr' union all
select 'B', 'Payment', '24-Apr' union all
select 'B', 'Payment', '29-Apr';

You can achieve this by using the MIN and MAX aggregate functions.

WITH inv AS  
(
  SELECT "A" AS Invoice, 'Creation' as Activity, DATE '2022-03-12' as Date UNION ALL
  SELECT "A" AS Invoice, 'Quantity change' as Activity, DATE '2022-03-13' as Date UNION ALL
  SELECT "A" AS Invoice, 'Quantity change' as Activity, DATE '2022-03-14' as Date UNION ALL
  SELECT "A" AS Invoice, 'Payment' as Activity, DATE '2022-03-17' as Date UNION ALL
  SELECT "B" AS Invoice, 'Creation' as Activity, DATE '2022-04-20' as Date UNION ALL
  SELECT "B" AS Invoice, 'Payment' as Activity, DATE '2022-04-24' as Date UNION ALL
  SELECT "B" AS Invoice, 'Payment' as Activity, DATE '2022-04-29' as Date
)
SELECT 
  Invoice,
  MIN(IF(inv.Activity = 'Creation', Date, NULL)) as CreationDate,
  MIN(IF(inv.Activity = 'Quantity change', Date, NULL)) as FirtsQuantityChange,
  MAX(IF(inv.Activity = 'Payment', Date, NULL)) as LastPayment
FROM inv
GROUP BY Invoice

This is the solution with the dynamic columns.

BEGIN
DECLARE columns STRING;

CREATE TEMP TABLE inv
AS
SELECT "A" AS Invoice, 'Creation' as Activity, DATE '2022-03-12' as Date UNION ALL
SELECT "A" AS Invoice, 'Quantity change' as Activity, DATE '2022-03-13' as Date UNION ALL
SELECT "A" AS Invoice, 'Quantity change' as Activity, DATE '2022-03-14' as Date UNION ALL
SELECT "A" AS Invoice, 'Payment' as Activity, DATE '2022-03-17' as Date UNION ALL
SELECT "B" AS Invoice, 'Creation' as Activity, DATE '2022-04-20' as Date UNION ALL
SELECT "B" AS Invoice, 'Payment' as Activity, DATE '2022-04-24' as Date UNION ALL
SELECT "B" AS Invoice, 'Payment' as Activity, DATE '2022-04-29' as Date
;

SET columns = (
  SELECT STRING_AGG(
    CASE Activity WHEN 'Payment' THEN
    CONCAT("MAX(IF(inv.Activity = '", Activity ,"', Date, NULL)) as Last",REPLACE(Activity,' ','') , " ")
    ELSE
    CONCAT("MIN(IF(inv.Activity = '", Activity ,"', Date, NULL)) as First",REPLACE(Activity,' ','') , " ")
    END
  )
  FROM (SELECT DISTINCT Activity FROM inv)
);

SELECT columns;

EXECUTE IMMEDIATE format("""SELECT 
  Invoice,%s
  FROM inv
  GROUP BY Invoice
  """,columns);

END;

You must use the case statement in order to decide for which column you want the MAX and for which one you want the MIN as the aggregation function and also the EXECUTE IMMEDIATE in order to form the final statement.

Anyway, if you really have 100.000 distinct values for Activity then you should use another table schema for your results like @Jaytiger suggested.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM