简体   繁体   中英

How to make sure only rows with max timestamp values are selected in BigQuery?

My table looks something like this:

datetime | field_a | field_b | field_c | field_d | field_e | field_f | updated_at

Actually, the number of fields is larger than that, more about 20, the af numbering is just for brevity.

This table is updated on a regular basis and the same rows can appear more than once but with more recent values of updated_at .

What I want to achieve is to select rows with the most recent updated_at so as to avoid duplicates (rows A and are duplicates if the only difference is the value of updated_at ).

My initial attempt is something like this:

WITH temp AS (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY datetime, field_a, field_b, ... field_f ORDER BY updated_at DESC) rnk
    FROM some_table)
)

SELECT * FROM temp WHERE rnk = 1

At first, I had thought that using datetime in the PARTITION BY clause might be enough, but it seems that I have to include all the fields so that the desired deduplication can happen.

Does this approach make sense? Am I correct in that all fields should be included in the window function? Is there a more elegant way to achieve what I want?

Sample input:

datetime | field_a | field_b | field_c | field_d | field_e | field_f | updated_at 

2022-04-05 | a | b | c | d | e | f | 2022-04-05T20:11:42.864086

2022-04-05 | a | b | c | d | e | f | 2022-04-05T20:22:42.864086

2022-04-04 | a | b | c | d | e | f | 2022-04-05T19:11:42.864086

2022-04-04 | a | b | c | d | e | f | 2022-04-05T19:22:42.864086

The query should return:

2022-04-05 | a | b | c | d | e | f | 2022-04-05T20:22:42.864086

2022-04-04 | a | b | c | d | e | f | 2022-04-05T19:22:42.864086

That is, rows where all fields are the same (except for updated_at ), and updated_at is the largest. In other words, the most recent row for each unique combination of (datetime, field_a, field_b, field_c, field_d, field_e, field_f) .

Consider below approach

select * from your_table t
qualify 1 = row_number() over win
window win as (partition by to_json_string((select as struct * except(updated_at) from unnest([t]))) order by updated_at desc)    

if applied to sample data in your question - output is

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM