简体   繁体   中英

Group rows based on column values in SQL / BigQuery

Is it possible to "group" rows within BigQuery/SQL depending on column values? Let's say I want to assign a string/id for all rows between stream_start_init and stream_start and then do the same for the rows between stream_resume and the last stream_ad.

The amount of stream_ad event can differ hence I can't use a RANK() or ROW() to group them be based on those values.

|id, timestamp, event|
|1 |  1231231 | first_visit|
|2 |  1231232 | login|
|3 |  1231233 | page_view|
|4 |  1231234 | page_view| 
|5 |  1231235 | stream_start_init|
|6 |  1231236 | stream_ad|
|7 |  1231237 | stream_ad| 
|8 |  1231238 | stream_ad| 
|9 |  1231239 | stream_start|
|6 |  1231216 | stream_resume|
|6 |  1231236 | stream_ad|
|7 |  1231217 | stream_ad| 
|8 |  1231258 | stream_ad| 
|10|  1231240 | page_view|

How I wish the table to be

|id, timestamp, event, group_id|
|1 |  1231231 | first_visit, null|
|2 |  1231232 | login, null|
|3 |  1231233 | page_view, null|
|4 |  1231234 | page_view, null| 
|5 |  1231235 | stream_start_init, group_1|
|6 |  1231236 | stream_ad, group_1|
|7 |  1231237 | stream_ad, group_1| 
|8 |  1231238 | stream_ad, group_1| 
|9 |  1231239 | stream_start, group_1|
|6 |  1231216 | stream_resume, group_2|
|6 |  1231236 | stream_ad, group_2|
|7 |  1231217 | stream_ad, group_2| 
|8 |  1231258 | stream_ad, group_2| 
|10|  1231240 | page_view, null|

I wouldn't assign a string. I would assign a number. This appears to be a cumulative sum. I think a sum of the number of "stream_start_init" and "stream_resume" does what you want:

select t.*,
       countif(event in ('stream_start_init', 'stream_resume')) over (order by timestamp) as group_id
from t;

Note that this produces 0 for the first group -- which seems like a good thing. You can convert that to a NULL using NULLIF() .

If you really want strings, you can use CONCAT() .

Below is for BigQuery Standard SQL

#standardSQL
SELECT *,
  IF(event IN ('stream_start_init', 'stream_start', 'stream_resume', 'stream_ad'),
    COUNTIF(event IN ('stream_start_init', 'stream_resume')) OVER(ORDER BY timestamp),
    NULL
  ) AS group_id
FROM `project.dataset.table`

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM