简体   繁体   中英

BIG QUERY SQL: Tagging dates on a sliding time window with a moving reference

WHAT IS THIS POST FOR?

As a team, we wish to tag each date as TRUE or FALSE based on the following;

1) The visit_date is 1st visit for the visitor then TRUE

2) For all the visit_date's after 1st visit date

a) Compared to the last visit_date that was TRUE (hereby LAST TRUE ), if the visit_date is within the 30-day forward looking window of the LAST TRUE ; then FALSE

访客A 访客B

WHAT DATA AM I USING?

Website data with visitor and visit_date.

visit_date is recorded for every new session created by visitor. We manipulate the data to get only one record per date per visitor.

ISSUE:

The simple LAG function and LEAD function does not help because;

1) The reference position of LAST TRUE keeps changing after a new TRUE tag is given

a) The number of records we have to check through for LAST TRUE changes by visitor. The next TRUE can be after 2 rows or 15 rows; this depends upon the visitor.

Effectively, we were not able to create a loop within the BigQuery to

  1. Check if the visit_date meets the criteria

  2. If it does then tag it as TRUE, also change the LAST TRUE reference to this visit_date

  3. Repeat the process till last record for a visitor

QUERIES USED:

First Table (table_id) is just to get the following calculations

visitor Unique Identifier

visit_date Day of visit

plus_30_days 30 days + Day of Visit

lag_value Value of visit_date of one row above

lead_value Value of visit_date of one row below

For ease of use, all values were converted to DAYOFYEAR ()

Based on the above table, I have used the below query

 SELECT    *
 , CASE WHEN lag_value IS NULL THEN 1 
        WHEN visit_date - lag_value > 30 THEN 1 
        WHEN lead_value IS NULL THEN 1 
        ELSE 0 END AS true_false
 FROM [project_id:dataset.table_id] 
 ORDER BY visitor, visit_date

OUTPUT:

The GREEN box should be TRUE OUTPUT

Because

访客C

ACTIONS TAKEN:

1) I have tried the LEAD and LAG function

2) CHECKED the web, normal SQL functions that can be translated into BIGQUERY >> cannot find the loop function

3) Asked my team lead >> SAME ISSUE

4) Up to 3 hours of experimenting >> SAME OUTPUT

Table for your reference ( not formatted )

visitor_id  file_date   plus_30_days    lag_value   lead_value  true_false
A   1   31  null    38  1
A   38  68  1   41  1
A   41  71  38  65  0
A   65  95  41  76  0
A   76  106 65  null    1
B   90  120 null    122 1
B   122 152 90  null    1
C   123 153 null    134 1
C   134 164 123 163 0
C   163 193 134 183 0
C   183 213 163 217 0
C   217 247 183 null    1
D   245 275 null    256 1
D   256 286 245 262 0
D   262 292 256 275 0
D   275 305 262 279 0
D   279 309 275 null    1
E   279 309 null    310 1
E   310 340 279 341 1
E   341 5   310 null    1

Try below.

SELECT visitor_id, file_date, true_false FROM JS(    // input table
( SELECT visitor_id, GROUP_CONCAT(STRING(100000 + file_date), ';') AS visits FROM
    (SELECT 'A' AS visitor_id, 1 AS file_date), (SELECT 'A' AS visitor_id, 38 AS file_date), (SELECT 'A' AS visitor_id, 41 AS file_date), (SELECT 'A' AS visitor_id, 65 AS file_date),
    (SELECT 'A' AS visitor_id, 76 AS file_date), (SELECT 'B' AS visitor_id, 90 AS file_date), (SELECT 'B' AS visitor_id, 122 AS file_date), (SELECT 'C' AS visitor_id, 123 AS file_date),
    (SELECT 'C' AS visitor_id, 134 AS file_date), (SELECT 'C' AS visitor_id, 163 AS file_date), (SELECT 'C' AS visitor_id, 183 AS file_date), (SELECT 'C' AS visitor_id, 217 AS file_date),
    (SELECT 'D' AS visitor_id, 245 AS file_date), (SELECT 'D' AS visitor_id, 256 AS file_date), (SELECT 'D' AS visitor_id, 262 AS file_date), (SELECT 'D' AS visitor_id, 275 AS file_date),
    (SELECT 'D' AS visitor_id, 279 AS file_date), (SELECT 'E' AS visitor_id, 279 AS file_date), (SELECT 'E' AS visitor_id, 310 AS file_date), (SELECT 'E' AS visitor_id, 341 AS file_date)  
  GROUP BY visitor_id
) ,
// input columns
visitor_id, visits,
// output schema
"[{name: 'visitor_id', type: 'string'},
{name: 'file_date', type: 'integer'},
{name: 'true_false', type: 'integer'}]",
// function
"function(r, emit){
  var visits = r.visits.split(';');
  visits.sort();
  plus_30_days = 0;
  for (var i = 0; i < visits.length; i++) { 
    file_date = parseInt(visits[i]) - 100000;
    true_false = 0;
    if (file_date > plus_30_days) {
      plus_30_days = file_date + 30;
      true_false = 1;
    }
      emit({
        visitor_id: r.visitor_id,
        file_date: file_date,
        true_false: true_false
      });      
  }
 }")

Hope this gives you good start
Please note: above example uses undocumented inline version of JavaScript UDF which is great so far to quickly demonstrate/share/try code, but ... undocumented and thus not supported
For using above example in production you better modify it to comply with BigQuery User-Defined Functions documentation (few very simple adjustments).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM