简体   繁体   中英

LAST_VALUE with IF statement inside not backfilling it's partition --> losing last values when selecting first line of each partition (BigQuery/SQL)

I am suffering from window function issues.. For a data set containing events tied to users, I want to pick the FIRST_VALUE for some, and the LAST_VALUE for others, and condense that into one row per user.

When using a FIRST_VALUE/LAST_VALUE approach, partitioning by user and sorting by date/timestamp, I get a satisfactory result with FIRST_VALUE (= the row in my first value populates the whole column). In the LAST_VALUE clause, I'm including an IF statement, to create a column stating time of account deletion. It does not work at all.. Any suggestions for a way to fix this?

Including a minimal example table below, and an expected output further down.

WITH dataset_table AS (
  SELECT DATE '2020-01-01' date , 1 user, 'german' user_language, 'created_account' event UNION ALL
  SELECT '2020-01-02', 1, 'german', 'successful_login' UNION ALL
  SELECT '2020-01-03', 1, 'english', 'screen_view' UNION ALL
  SELECT '2020-01-04', 1, 'english', 'deleted_account' UNION ALL
  SELECT '2020-01-01', 2, 'english', 'login' UNION ALL
  SELECT '2020-01-02', 2, 'english', 'settings' UNION ALL
  SELECT '2020-01-03', 2, 'english', 'NULL' UNION ALL
  SELECT '2020-01-04', 2, 'french', 'screen_view'
),

user_info AS (
    SELECT
        `date`,
        user,
        -- record first value for language = signup demographics
        FIRST_VALUE(user_language IGNORE NULLS) OVER time_order user_language,
        -- record last value for app removal - want to know if the user deleted their account and didn't return
        LAST_VALUE(IF(event = 'deleted_account', `date`, NULL)) OVER time_order deleted_account,
        ROW_NUMBER() OVER time_order row_idx
    FROM dataset_table
    WINDOW time_order AS (PARTITION BY user ORDER BY date)
)

SELECT
  *
FROM user_info
WHERE row_idx = 1 -- Here, I select the first row, but deleted_account hasn't been populated by the last value for user 1. The same test for FIRST_VALUE does populate the whole column with german, so if I'd use row_idx = 4 I'd get a correct answer to this example, but there are different amount of events for each user in reality, so I want to use row_idx 1 to pick out the ideal line. 

Expected output:

date         user  user_language  deleted_account row_idx 
2020-01-01   1     german         2020-01-04      1
2020-01-02   2     english        null            1

I think you want:

with dataset_table AS (...),
user_info AS (
    SELECT
        `date`,
        user,
        FIRST_VALUE(user_language IGNORE NULLS) OVER (PARTITION BY user ORDER BY date) user_language,
        MAX(IF(event = 'deleted_account', `date`, NULL)) OVER (PARTITION BY user) deleted_account,
        ROW_NUMBER() OVER (PARTITION BY user ORDER BY date) row_idx
    FROM dataset_table
)

SELECT *
FROM user_info
WHERE row_idx = 1 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM