简体   繁体   中英

PostgreSQL: Identifying return visitors based on date - joins or window functions?

I am looking to identify return visitors to a website within a 7 day window. A data sample and attempt at solving are included below:

visitor_id(integer)
session_id(integer)
event_sequence(integer)
d_date(date)

Sample raw data:

+-----------+-------------+----------------+-------------+
| visitor_id| session_id  | event_sequence |   d_date    |
+-----------+-------------+----------------+-------------+
|      1    |     1       |      1         | 2017-01-01  |
|      1    |     1       |      2         | 2017-01-01  |
|      1    |     1       |      3         | 2017-01-01  |
|      1    |     2       |      1         | 2017-01-05  |
|      1    |     2       |      2         | 2017-01-05  |
|      1    |     3       |      1         | 2017-01-20  |
|      1    |     4       |      1         | 2017-01-25  |
|      2    |     1       |      1         | 2017-01-02  |
|      2    |     1       |      2         | 2017-01-02  |
|      2    |     2       |      1         | 2017-01-02  |
|      2    |     2       |      2         | 2017-01-02  |
|      2    |     2       |      3         | 2017-01-02  |
|      2    |     3       |      1         | 2017-01-18  |
+-----------+-------------+----------------+-------------+

I would like to know, for each visitor-session, whether the visitor returns (has another session) within the subsequent 7 days of the visit date. Ultimately the table would be unique at the visitor_id , session_id level and include a flag for whether the visitor returned in the subsequent 7 days.

The following is how I would expect my output to look:

+-----------+-------------+-----------------+-------------+
| visitor_id| session_id  | returned_7_days |   d_date    |
+-----------+-------------+-----------------+-------------+
|      1    |     1       |      1          | 2017-01-01  |
|      1    |     2       |      0          | 2017-01-05  |
|      1    |     3       |      1          | 2017-01-20  |
|      1    |     4       |      0          | 2017-01-25  |
|      2    |     1       |      1          | 2017-01-02  |
|      2    |     2       |      0          | 2017-01-02  |
|      2    |     3       |      0          | 2017-01-18  |
+-----------+-------------+-----------------+-------------+

One way to solve this involves joining every visitor_id - session_id combination to the corresponding visitor_id , as so:

SELECT t2.visitor_id, t2.session_id, t2.d_date, t1.start_date
FROM table t2
INNER JOIN (
  SELECT visitor_id, session_id, min(d_date) as start_date
  FROM table t1
  GROUP BY visitor_id, session_id
) t1
ON t1.visitor_id = t2.visitor_id

Which returns, for each visitor_id - session_id combination, the dates of all other sessions corresponding to that visitor_id . From there, I can compare whether d_date is within 7 days of start_date . However, this does not appear an efficient way to solve the problem, especially when there are millions of unique visitor_id combinations, each crossed with dozens of session_id - event_sequence combinations.

Is there a better way to solve this problem I am not thinking of?

First, I remove event_sequence with a DISTINCT (assuming that all events are on the same day), then I use the window function lead and compare with the date of the next visit:

SELECT visitor_id,
       session_id,
       COALESCE(
          lead(d_date) OVER w - d_date,
          10
       ) < 7 AS revisited,
       d_date
FROM (SELECT DISTINCT visitor_id,
                      session_id,
                      d_date
      FROM "table"
     ) t
WINDOW w AS (PARTITION BY visitor_id
             ORDER BY d_date
             ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
            )
ORDER BY visitor_id, session_id;

┌────────────┬────────────┬───────────┬────────────┐
│ visitor_id │ session_id │ revisited │   d_date   │
├────────────┼────────────┼───────────┼────────────┤
│          1 │          1 │ t         │ 2017-01-01 │
│          1 │          2 │ f         │ 2017-01-05 │
│          1 │          3 │ t         │ 2017-01-20 │
│          1 │          4 │ f         │ 2017-01-25 │
│          2 │          1 │ t         │ 2017-01-02 │
│          2 │          2 │ f         │ 2017-01-02 │
│          2 │          3 │ f         │ 2017-01-18 │
└────────────┴────────────┴───────────┴────────────┘
(7 rows)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM