简体   繁体   中英

SQL: Calculate day-1 retention rate from user registration table and event log

I need to calculate the day-1 retention by user registration date. Day-1 retention is defined as the number of users who return 1 day after the registration date divided by the number of users who registered on the registration date.

Here's the user table

CREATE TABLE registration (
  user_id SERIAL PRIMARY KEY,
  user_name VARCHAR(255) NOT NULL,
  registrationDate TIMESTAMP NOT NULL
);

INSERT INTO registration (user_id, user_name, registrationDate)
VALUES
  (0, 'John', '2018-01-01 00:01:00'),
  (1, 'David', '2018-01-01 00:04:30'),
  (2, 'Cassy', '2018-01-02 10:00:00'),
  (3, 'Winka', '2018-01-02 14:30:00')
;

CREATE TABLE log (
  user_id INTEGER,
  eventDate TIMESTAMP
);

INSERT INTO log (user_id, eventDate)
VALUES
  (0, '2018-01-01 01:00:00'),
  (0, '2018-01-02 04:00:00'),
  (0, '2018-01-04 06:00:00'),
  (1, '2018-01-01 00:30:00'),
  (3, '2018-01-02 14:40:00'),
  (3, '2018-01-04 12:20:00'),
  (3, '2018-01-06 13:30:00'),
  (2, '2018-01-12 10:10:00'),
  (2, '2018-01-13 09:00:00')

I tried to join the registration table to log table, so I can compare the date difference.

select registration.user_id, registrationDate, log.eventDate, 
(log.eventDate - registration.registrationDate) as datediff 
from log left join registration ON log.user_id = registration.user_id

I think I somehow need to perform below tasks.

  1. select the users with datediff = 1 and count them.
    • I added a where statement, but getting an error saying "datediff does not exist Position"
where datediff = 1
  1. do the Group By registrationDate.
    • This also gave me an error: "ERROR: column "registration.user_id" must appear in the GROUP BY clause or be used in an aggregate function"

I am new to SQL and learning it as I am solving the problem. Any help/advice will be appreciated

The expected outcome should return a table with two columns (registrationDate and retention) with rows for each date any user registered.

I am not quiet sure if this is your expected result: For registrationdate = 2018-01-01 all two users have been logged within the first day, so the result is 1 . For registrationdate = 2018-01-02 only one of two users have been logged within this range, so the result is 0.5


Step-by-step demo: db<>fiddle

 SELECT registrationdate, COUNT(*) FILTER (WHERE is_in_one_day) / daily_regs::decimal -- 6 FROM ( SELECT DISTINCT ON (l.user_id) -- 4 l.user_id, eventdate::date AS eventdate, registrationdate::date AS registrationdate, daily_regs, eventdate - registrationdate < interval '1 day' AS is_in_one_day -- 3 FROM log l JOIN ( -- 2 SELECT *, COUNT(user_id) OVER (PARTITION BY registrationdate::date) AS daily_regs --1 FROM registration ) r ON l.user_id = r.user_id ORDER BY l.user_id, eventdate ) s GROUP BY registrationdate, daily_regs -- 5 
  1. Count the total number of registrations per registration date. This can be done using a partioned window function . It adds a column with the count
  2. Joining both tables (with the one extra column on registrations ) on their user_id
  3. Calculation the difference of the current eventdate and the registrationdate . Check if this is less one day.
  4. Do not take one user twice (it does not happen in you example data but it can be that one user is logged twice within this range. This user should not be counted twice).
  5. Group by the date of registration
  6. Count all records with the difference under one day (using the FILTER clause) and divide by the total number of registrations calculated in (1)

Day-1 retention is defined as the number of users who return 1 day after the registration date divided by the number of users who registered on the registration date.

This interprets the definition as being based on calendar days. I would express this as:

What ratio of users come back on the day after they register?

I think this is the simplest method:

select count(distinct l.user_id) * 1.0 / count(distinct r.user_id)
from registration r left join
     log l
     on l.user_id = r.user_id and
        l.eventDate::date = r.registrationDate::date + interval '1 day';

The count(distinct) is only needed if multiple events can happen on a single day.

Here is a db<>fiddle.

I'm not sure the definition is 100% useful. If you have another definition in mind, I would suggest that you ask a new question, with appropriate sample data and desired results .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM