简体   繁体   中英

Aggregate consecutive values BigQuery

I need to aggregate consecutive values in a table with BigQuery, as shown in the example Segment can be only 'A' or 'B'. Value is a String.
Basically, for each id i need to consider only segment='A' taking into account the gaps. It should be ORDER BY date_column ASC

Example

id, segment, value, date_column
1, A, 3, daytime
1, A, 2, daytime
1, A, x, daytime
1, B, 3, daytime
1, B, 3, daytime
1, B, 3, daytime
1, A, 7, daytime
1, A, 3, daytime
1, B, 3, daytime
1, A, 9, daytime
1, A, 9, daytime
2, A, 3, daytime
2, B, 3, daytime
2, A, 3, daytime
2, A, m, daytime

Expected result

id, agg_values_A_segment
1, ['32x', '73', '99']
2, ['3', '3m']

How can I achieve this result? I'm struggling with the 'gap' between the segments.

Below options for BigQuery Standard SQL

Option 1 - using window analytics functions

#standardSQL
SELECT id, ARRAY_AGG(values_in_group ORDER BY grp) agg_values_A_segment
FROM (
  SELECT id, grp, STRING_AGG(value, '' ORDER BY date_column) values_in_group
  FROM (
    SELECT id, segment, value, date_column, flag, 
      COUNTIF(flag) OVER(PARTITION BY id ORDER BY date_column) grp
    FROM (
      SELECT *, IFNULL(LAG(segment) OVER(PARTITION BY id ORDER BY date_column), segment) != segment flag
      FROM `project.dataset.table`
    )
  )
  WHERE segment = 'A'
  GROUP BY id, grp
)
GROUP BY id   

You can test, play with above using sample data from your question as in below example:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 'A' segment, '3' value, DATETIME '2019-01-07T18:46:21' date_column UNION ALL
  SELECT 1, 'A', '2', '2019-01-07T18:46:22' UNION ALL
  SELECT 1, 'A', 'x', '2019-01-07T18:46:23' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:24' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:25' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:26' UNION ALL
  SELECT 1, 'A', '7', '2019-01-07T18:46:27' UNION ALL
  SELECT 1, 'A', '3', '2019-01-07T18:46:28' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:29' UNION ALL
  SELECT 1, 'A', '9', '2019-01-07T18:46:30' UNION ALL
  SELECT 1, 'A', '9', '2019-01-07T18:46:31' UNION ALL
  SELECT 2, 'A', '3', '2019-01-07T18:46:32' UNION ALL
  SELECT 2, 'B', '3', '2019-01-07T18:46:33' UNION ALL
  SELECT 2, 'A', '3', '2019-01-07T18:46:34' UNION ALL
  SELECT 2, 'A', 'm', '2019-01-07T18:46:35' 
)
SELECT id, ARRAY_AGG(values_in_group ORDER BY grp) agg_values_A_segment
FROM (
  SELECT id, grp, STRING_AGG(value, '' ORDER BY date_column) values_in_group
  FROM (
    SELECT id, segment, value, date_column, flag, 
      COUNTIF(flag) OVER(PARTITION BY id ORDER BY date_column) grp
    FROM (
      SELECT *, IFNULL(LAG(segment) OVER(PARTITION BY id ORDER BY date_column), segment) != segment flag
      FROM `project.dataset.table`
    )
  )
  WHERE segment = 'A'
  GROUP BY id, grp
)
GROUP BY id
-- ORDER BY id   

with result

Row id  agg_values_A_segment     
1   1   32x  
        73   
        99   
2   2   3    
        3m     

Option 2 - above option should work for big volumes of rows per id, but looks a little heavy - so second option is more of simple option but assumes you have some character or sequence of chars that you sure will not be result from combining your values, for example pipe char or tab or as in below example I choose word 'delimiter' assuming it will not appear as a result of concatenation

#standardSQL
SELECT id,
  ARRAY(SELECT part FROM UNNEST(parts) part WHERE part != '') agg_values_A_segment 
FROM (
  SELECT id, 
    SPLIT(STRING_AGG(IF(segment = 'A', value, 'delimiter'), ''), 'delimiter') parts
  FROM `project.dataset.table`
  GROUP BY id
)

You can test, play with above using same sample data:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 'A' segment, '3' value, DATETIME '2019-01-07T18:46:21' date_column UNION ALL
  SELECT 1, 'A', '2', '2019-01-07T18:46:22' UNION ALL
  SELECT 1, 'A', 'x', '2019-01-07T18:46:23' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:24' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:25' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:26' UNION ALL
  SELECT 1, 'A', '7', '2019-01-07T18:46:27' UNION ALL
  SELECT 1, 'A', '3', '2019-01-07T18:46:28' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:29' UNION ALL
  SELECT 1, 'A', '9', '2019-01-07T18:46:30' UNION ALL
  SELECT 1, 'A', '9', '2019-01-07T18:46:31' UNION ALL
  SELECT 2, 'A', '3', '2019-01-07T18:46:32' UNION ALL
  SELECT 2, 'B', '3', '2019-01-07T18:46:33' UNION ALL
  SELECT 2, 'A', '3', '2019-01-07T18:46:34' UNION ALL
  SELECT 2, 'A', 'm', '2019-01-07T18:46:35' 
)
SELECT id,
  ARRAY(SELECT part FROM UNNEST(parts) part WHERE part != '') agg_values_A_segment 
FROM (
  SELECT id, 
    SPLIT(STRING_AGG(IF(segment = 'A', value, 'delimiter'), ''), 'delimiter') parts
  FROM `project.dataset.table`
  GROUP BY id
)
-- ORDER BY id   

obviously with same result

Row id  agg_values_A_segment     
1   1   32x  
        73   
        99   
2   2   3    
        3m      

note: second option can result with resources exceeded for case when you have too many rows per id - you just need to try it on your real data

SQL tables represent unordered sets. This is particularly true in a parallel, columnar database such as BigQuery. The rest of this answer assumes you have a column that specifies the ordering of the rows.

This is a gaps-and-islands problem. You can use the difference of row_number() to identify the adjacent groups . . . and then aggregation:

select id, array_agg(vals order by min_ordercol)
from (select id, segment, string_agg(value delimiter '' order by date_column) as vals,
             min(<ordercol>) as min_ordercol
      from (select t.*,
                   row_number() over (partition by id order by date_column) as seqnum,
                   row_number() over (partition by id, segment order by date_column) as seqnum_2,
            from t
           ) t
      group by id, segment, (seqnum - seqnum_2)
     ) x
group by id;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM