简体   繁体   中英

Calculate retention for daily users with PostgreSQL

I am trying to calculate the daily retention of users login sessions using pgadminIII / postgresql.

table1 has user_id , first_login_date , last_login_date

CREATE TABLE table1 (user_id numeric, first_login_date date, last_login_date date);

  INSERT INTO table1 (user_id, first_login_date, last_login_date) VALUES (12346, '2010-06-01', '2010-06-02'),
(67890, '2010-03-01', '2010-03-10'),
(67890, '2010-03-01', '2010-03-10'),
(90123, '2010-08-01', '2010-08-15'),
(45678, '2010-08-01', '2010-08-20'),
(76543, '2010-07-01', '2010-07-01');

table2 has user_id , session_id , login_date

CREATE TABLE table2 (user_id numeric, session_id numeric, login_date date);

INSERT INTO table2 (user_id, session_id, login_date) VALUES
(12346, '8764', '2010-06-02'),
(67890, '4657', '2010-03-05'),
(90123, '3945', '2010-08-09'),
(45678, '20845', '2010-08-02'),
(67890, '29384', '2010-03-07'),
(90123, '3424', '2010-08-12'),
(45678, '349284', '2010-08-10');

There are some duplicates in table1 . Because of this, I am unsure if my query to calculate the users that have 2-day retention and 5-day retention is correct.

The query that I am using for the 2-day is:

SELECT table1.user_id, first_login_date, table2.login_date,
(table2.login_date - table1.first_login_date) as datediff, FROM table1
JOIN table2 ON table2.user_id = table2.user_id WHERE
(table2.login_date - table1.first_login_date) = 1;

which gives the result of 7 users with 2-day retention

However, if I add the distinct clause, such as:

SELECT distinct table1.user_id, first_login_date, table2.login_date,
    (table2.login_date - table1.first_login_date) as datediff FROM table1
    JOIN table2 ON table2.user_id = table2.user_id WHERE
    (table2.login_date - table1.first_login_date) = 1;

I get the result of 3 users with 2-day retention.

I have consulted HERE , HERE and HERE regarding the calculation of daily retention, and I am unsure if my technique is giving me the correct result. For instance, to calculate the DAU, would a self-join be more appropriate.

Given the data in table1 and table2 are my 2-day retention results accurate using my defined query? Is there an optimized way to calculate this retention?

You are joining table2 to itself:

ON table2.user_id = table2.user_id

Do the distinct in a subquery:

select distinct on (t2.login_date)
    user_id,
    first_login_date,
    t2.login_date,
    t2.login_date - t1.first_login_date as datediff
from
    (
        select distinct *
        from t1
    ) t1
    inner join
    t2 using (user_id)
where t2.login_date - t1.first_login_date = 1
 user_id | first_login_date | login_date | datediff 
---------+------------------+------------+----------
   12346 | 2010-06-01       | 2010-06-02 |        1
   45678 | 2010-08-01       | 2010-08-02 |        1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM