简体   繁体   中英

How to calculate retention month over month using SQL

Trying to get a basic table that shows retention from one month to the next. So if someone buys something last month and they do so the next month it gets counted.

month, num_transactions, repeat_transactions, retention
2012-02, 5, 2, 40%
2012-03, 10, 3, 30%
2012-04, 15, 8, 53%

So if everyone that bought last month bought again the following month you have 100%.

So far I can only calculate stuff manually. This gives me the rows that have been seen in both months:

select count(*) as num_repeat_buyers from 

(select distinct
  to_char(transaction.timestamp, 'YYYY-MM') as month,
  auth_user.email
from
  auth_user,
  transaction
where
  auth_user.id = transaction.buyer_id and
  to_char(transaction.timestamp, 'YYYY-MM') = '2012-03'
) as table1,


(select distinct
  to_char(transaction.timestamp, 'YYYY-MM') as month,
  auth_user.email
from
  auth_user,
  transaction
where
  auth_user.id = transaction.buyer_id and
  to_char(transaction.timestamp, 'YYYY-MM') = '2012-04'
) as table2
where table1.email = table2.email

This is not right but I feel like I can use some of Postgres' windowing functions. Keep in mind the windowing functions don't let you specify WHERE clauses. You mostly have access to the previous rows and the preceding rows:

select month, count(*) as num_transactions, count(*) over (PARTITION BY month ORDER BY month)
from 
    (select distinct
      to_char(transaction.timestamp, 'YYYY-MM') as month,
      auth_user.email
    from
      auth_user,
      transaction
    where
      auth_user.id = transaction.buyer_id
    order by
      month
    ) as transactions_by_month
group by
    month

Given the following test table (which you should have provided):

CREATE TEMP TABLE transaction (buyer_id int, tstamp timestamp);
INSERT INTO transaction VALUES 
 (1,'2012-01-03 20:00')
,(1,'2012-01-05 20:00')
,(1,'2012-01-07 20:00')  -- multiple transactions this month
,(1,'2012-02-03 20:00')  -- next month
,(1,'2012-03-05 20:00')  -- next month
,(2,'2012-01-07 20:00')
,(2,'2012-03-07 20:00')  -- not next month
,(3,'2012-01-07 20:00')  -- just once
,(4,'2012-02-07 20:00'); -- just once

Table auth_user is not relevant to the problem.
Using tstamp as column name since I don't use base types as identifiers.

I am going to use the window function lag() to identify repeated buyers. To keep it short I combine aggregate and window functions in one query level. Bear in mind that window functions are applied after aggregate functions.

WITH t AS (
   SELECT buyer_id
         ,date_trunc('month', tstamp) AS month
         ,count(*) AS item_transactions
         ,lag(date_trunc('month', tstamp)) OVER (PARTITION BY  buyer_id
                                           ORDER BY date_trunc('month', tstamp)) 
          = date_trunc('month', tstamp) - interval '1 month'
            OR NULL AS repeat_transaction
   FROM   transaction
   WHERE  tstamp >= '2012-01-01'::date
   AND    tstamp <  '2012-05-01'::date -- time range of interest.
   GROUP  BY 1, 2
   )
SELECT month
      ,sum(item_transactions) AS num_trans
      ,count(*) AS num_buyers
      ,count(repeat_transaction) AS repeat_buyers
      ,round(
          CASE WHEN sum(item_transactions) > 0
             THEN count(repeat_transaction) / sum(item_transactions) * 100
             ELSE 0
          END, 2) AS buyer_retention
FROM   t
GROUP  BY 1
ORDER  BY 1;

Result:

  month  | num_trans | num_buyers | repeat_buyers | buyer_retention_pct
---------+-----------+------------+---------------+--------------------
 2012-01 |         5 |          3 |             0 |               0.00
 2012-02 |         2 |          2 |             1 |              50.00
 2012-03 |         2 |          2 |             1 |              50.00

I extended your question to provide for the difference between the number of transactions and the number of buyers.

The OR NULL for repeat_transaction serves to convert FALSE to NULL , so those values do not get counted by count() in the next step.

-> SQLfiddle.

This uses CASE and EXISTS to get repeated transactions:

SELECT
    *,
    CASE
        WHEN num_transactions = 0
        THEN 0
        ELSE round(100.0 * repeat_transactions / num_transactions, 2)
    END AS retention
FROM
    (
        SELECT
            to_char(timestamp, 'YYYY-MM') AS month,
            count(*) AS num_transactions,
            sum(CASE
                WHEN EXISTS (
                    SELECT 1
                    FROM transaction AS t
                    JOIN auth_user AS u
                    ON t.buyer_id = u.id
                    WHERE
                        date_trunc('month', transaction.timestamp)
                            + interval '1 month'
                            = date_trunc('month', t.timestamp)
                        AND auth_user.email = u.email
                )
                THEN 1
                ELSE 0
            END) AS repeat_transactions
        FROM
            transaction
            JOIN auth_user
            ON transaction.buyer_id = auth_user.id
        GROUP BY 1
    ) AS summary
ORDER BY 1;

EDIT: Changed from minus 1 month to plus 1 month after reading the question again. My understanding now is that if someone buy something in 2012-02, and then buy something again in 2012-03, then his or her transactions in 2012-02 are counted as retention for the month.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM