I am looking to identify return visitors to a website within a 7 day window. A data sample and attempt at solving are included below:
visitor_id(integer)
session_id(integer)
event_sequence(integer)
d_date(date)
Sample raw data:
+-----------+-------------+----------------+-------------+
| visitor_id| session_id | event_sequence | d_date |
+-----------+-------------+----------------+-------------+
| 1 | 1 | 1 | 2017-01-01 |
| 1 | 1 | 2 | 2017-01-01 |
| 1 | 1 | 3 | 2017-01-01 |
| 1 | 2 | 1 | 2017-01-05 |
| 1 | 2 | 2 | 2017-01-05 |
| 1 | 3 | 1 | 2017-01-20 |
| 1 | 4 | 1 | 2017-01-25 |
| 2 | 1 | 1 | 2017-01-02 |
| 2 | 1 | 2 | 2017-01-02 |
| 2 | 2 | 1 | 2017-01-02 |
| 2 | 2 | 2 | 2017-01-02 |
| 2 | 2 | 3 | 2017-01-02 |
| 2 | 3 | 1 | 2017-01-18 |
+-----------+-------------+----------------+-------------+
I would like to know, for each visitor-session, whether the visitor returns (has another session) within the subsequent 7 days of the visit date. Ultimately the table would be unique at the visitor_id
, session_id
level and include a flag for whether the visitor returned in the subsequent 7 days.
The following is how I would expect my output to look:
+-----------+-------------+-----------------+-------------+
| visitor_id| session_id | returned_7_days | d_date |
+-----------+-------------+-----------------+-------------+
| 1 | 1 | 1 | 2017-01-01 |
| 1 | 2 | 0 | 2017-01-05 |
| 1 | 3 | 1 | 2017-01-20 |
| 1 | 4 | 0 | 2017-01-25 |
| 2 | 1 | 1 | 2017-01-02 |
| 2 | 2 | 0 | 2017-01-02 |
| 2 | 3 | 0 | 2017-01-18 |
+-----------+-------------+-----------------+-------------+
One way to solve this involves joining every visitor_id
- session_id
combination to the corresponding visitor_id
, as so:
SELECT t2.visitor_id, t2.session_id, t2.d_date, t1.start_date
FROM table t2
INNER JOIN (
SELECT visitor_id, session_id, min(d_date) as start_date
FROM table t1
GROUP BY visitor_id, session_id
) t1
ON t1.visitor_id = t2.visitor_id
Which returns, for each visitor_id
- session_id
combination, the dates of all other sessions corresponding to that visitor_id
. From there, I can compare whether d_date
is within 7 days of start_date
. However, this does not appear an efficient way to solve the problem, especially when there are millions of unique visitor_id
combinations, each crossed with dozens of session_id
- event_sequence
combinations.
Is there a better way to solve this problem I am not thinking of?
First, I remove event_sequence
with a DISTINCT
(assuming that all events are on the same day), then I use the window function lead
and compare with the date of the next visit:
SELECT visitor_id,
session_id,
COALESCE(
lead(d_date) OVER w - d_date,
10
) < 7 AS revisited,
d_date
FROM (SELECT DISTINCT visitor_id,
session_id,
d_date
FROM "table"
) t
WINDOW w AS (PARTITION BY visitor_id
ORDER BY d_date
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
)
ORDER BY visitor_id, session_id;
┌────────────┬────────────┬───────────┬────────────┐
│ visitor_id │ session_id │ revisited │ d_date │
├────────────┼────────────┼───────────┼────────────┤
│ 1 │ 1 │ t │ 2017-01-01 │
│ 1 │ 2 │ f │ 2017-01-05 │
│ 1 │ 3 │ t │ 2017-01-20 │
│ 1 │ 4 │ f │ 2017-01-25 │
│ 2 │ 1 │ t │ 2017-01-02 │
│ 2 │ 2 │ f │ 2017-01-02 │
│ 2 │ 3 │ f │ 2017-01-18 │
└────────────┴────────────┴───────────┴────────────┘
(7 rows)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.