As a team, we wish to tag each date as TRUE or FALSE based on the following;
1) The visit_date is 1st visit for the visitor then TRUE
2) For all the visit_date's after 1st visit date
a) Compared to the last visit_date that was TRUE (hereby LAST TRUE ), if the visit_date is within the 30-day forward looking window of the LAST TRUE ; then FALSE
Website data with visitor and visit_date.
visit_date is recorded for every new session created by visitor. We manipulate the data to get only one record per date per visitor.
The simple LAG function and LEAD function does not help because;
1) The reference position of LAST TRUE keeps changing after a new TRUE tag is given
a) The number of records we have to check through for LAST TRUE changes by visitor. The next TRUE can be after 2 rows or 15 rows; this depends upon the visitor.
Effectively, we were not able to create a loop within the BigQuery to
Check if the visit_date meets the criteria
If it does then tag it as TRUE, also change the LAST TRUE reference to this visit_date
Repeat the process till last record for a visitor
First Table (table_id) is just to get the following calculations
visitor Unique Identifier
visit_date Day of visit
plus_30_days 30 days + Day of Visit
lag_value Value of visit_date of one row above
lead_value Value of visit_date of one row below
For ease of use, all values were converted to DAYOFYEAR ()
Based on the above table, I have used the below query
SELECT *
, CASE WHEN lag_value IS NULL THEN 1
WHEN visit_date - lag_value > 30 THEN 1
WHEN lead_value IS NULL THEN 1
ELSE 0 END AS true_false
FROM [project_id:dataset.table_id]
ORDER BY visitor, visit_date
Because
1) I have tried the LEAD and LAG function
2) CHECKED the web, normal SQL functions that can be translated into BIGQUERY >> cannot find the loop function
3) Asked my team lead >> SAME ISSUE
4) Up to 3 hours of experimenting >> SAME OUTPUT
visitor_id file_date plus_30_days lag_value lead_value true_false
A 1 31 null 38 1
A 38 68 1 41 1
A 41 71 38 65 0
A 65 95 41 76 0
A 76 106 65 null 1
B 90 120 null 122 1
B 122 152 90 null 1
C 123 153 null 134 1
C 134 164 123 163 0
C 163 193 134 183 0
C 183 213 163 217 0
C 217 247 183 null 1
D 245 275 null 256 1
D 256 286 245 262 0
D 262 292 256 275 0
D 275 305 262 279 0
D 279 309 275 null 1
E 279 309 null 310 1
E 310 340 279 341 1
E 341 5 310 null 1
Try below.
SELECT visitor_id, file_date, true_false FROM JS( // input table
( SELECT visitor_id, GROUP_CONCAT(STRING(100000 + file_date), ';') AS visits FROM
(SELECT 'A' AS visitor_id, 1 AS file_date), (SELECT 'A' AS visitor_id, 38 AS file_date), (SELECT 'A' AS visitor_id, 41 AS file_date), (SELECT 'A' AS visitor_id, 65 AS file_date),
(SELECT 'A' AS visitor_id, 76 AS file_date), (SELECT 'B' AS visitor_id, 90 AS file_date), (SELECT 'B' AS visitor_id, 122 AS file_date), (SELECT 'C' AS visitor_id, 123 AS file_date),
(SELECT 'C' AS visitor_id, 134 AS file_date), (SELECT 'C' AS visitor_id, 163 AS file_date), (SELECT 'C' AS visitor_id, 183 AS file_date), (SELECT 'C' AS visitor_id, 217 AS file_date),
(SELECT 'D' AS visitor_id, 245 AS file_date), (SELECT 'D' AS visitor_id, 256 AS file_date), (SELECT 'D' AS visitor_id, 262 AS file_date), (SELECT 'D' AS visitor_id, 275 AS file_date),
(SELECT 'D' AS visitor_id, 279 AS file_date), (SELECT 'E' AS visitor_id, 279 AS file_date), (SELECT 'E' AS visitor_id, 310 AS file_date), (SELECT 'E' AS visitor_id, 341 AS file_date)
GROUP BY visitor_id
) ,
// input columns
visitor_id, visits,
// output schema
"[{name: 'visitor_id', type: 'string'},
{name: 'file_date', type: 'integer'},
{name: 'true_false', type: 'integer'}]",
// function
"function(r, emit){
var visits = r.visits.split(';');
visits.sort();
plus_30_days = 0;
for (var i = 0; i < visits.length; i++) {
file_date = parseInt(visits[i]) - 100000;
true_false = 0;
if (file_date > plus_30_days) {
plus_30_days = file_date + 30;
true_false = 1;
}
emit({
visitor_id: r.visitor_id,
file_date: file_date,
true_false: true_false
});
}
}")
Hope this gives you good start
Please note: above example uses undocumented inline version of JavaScript UDF which is great so far to quickly demonstrate/share/try code, but ... undocumented and thus not supported
For using above example in production you better modify it to comply with BigQuery User-Defined Functions documentation (few very simple adjustments).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.