简体   繁体   中英

Oracle SQL challenging subquery (aggregation, group by, count)

Have a table given, holding the status history and creation of a user:

ID USERID STATUS USER_CREATED TIMESTAMP
1 A active 01.02.2020 01.02.2020, 08:00:00
2 A inactive 01.02.2020 02.02.2020, 11:00:00
3 A active 01.02.2020 03.02.2020, 12:00:00
4 B active 02.02.2020 02.02.2020, 13:00:00
5 B inactive 02.02.2020 05.02.2020, 14:00:00
6 B active 02.02.2020 07.02.2020, 16:00:00
7 C active 02.02.2020 02.02.2020, 09:00:00
8 C inactive 03.02.2020 03.02.2020, 11:00:00
9 D active 03.02.2020 03.02.2020, 10:00:00
10 E active 04.02.2020 04.02.2020, 10:00:00

I am working on a statistics overview for user creation, which works on a provided min/max date.

Desired result for create date: 01.02.2020 to 03.02.2020:

Create Date Users created Still active today Total active users at create date
01.02.2020 1 1 1
02.02.2020 2 1 2
03.02.2020 1 1 3

The first two columns, Users created and Still active today are straight forward, I simply use a group by and select max(id) to get the latest status of every userid.

The challenge however comes with the last column, Total active users at create date .

SELECT USER_CREATED as "Create Date",
count(*) as "Users created",
sum(case when STATUS = 'active' then 1 else 0 end) as "Still active today",
??? as "Total active users at create date"
FROM myTable t1
WHERE t1.USER_CREATED >= to_date('01.02.2020', 'DD.MM.YYYY') and t1.USER_CREATED <= to_date('03.02.2020', 'DD.MM.YYYY')
AND and t1.ID = (select max(ID) from myTable where USERID = t1.USERID)
GROUP BY t1.USER_CREATED;

For the column Total active users at create date , I am a bit clueless. The point is I have to make sure that the timestamp is below the user_created field in my counting. Ie on create date 02.02.01, only timestamps below 02.02.01 must be considered. More precisely, the timestamp CLOSEST to the create date tells the actual status of a user.

Current approach (not working):

SELECT USER_CREATED as "Create Date",
count(*) as "Users created",
sum(case when STATUS = 'active' then 1 else 0 end) as "Still active today",
(select count(*) from (select USERID, max(STATUS) keep (dense_rank first order by USER_CREATED desc) as STATUS
      from myTable t2
      where t2.timestamp <= t1.timestamp
      group by USERID) where STATUS = 'active')
as "Total active users at create date"
FROM myTable t1
WHERE t1.USER_CREATED >= to_date('01.02.2020', 'DD.MM.YYYY') and t1.USER_CREATED <= to_date('03.02.2020', 'DD.MM.YYYY')
AND and t1.ID = (select max(ID) from myTable where USERID = t1.USERID)
GROUP BY t1.USER_CREATED;

I need to Any help is appreciated

I'd use CROSS APPLY rather than a correlated sub-query, it's just a bit easier to read.

Then, the sub-query would need to get every users' last event preceding the end of the day in question (for which I chose to use ROW_NUMBER() ) . Then count the number of users who's last event is an active event.

WITH
  dates AS
(
  SELECT
    User_Created,
    COUNT(*)                                             AS count_created,
    SUM(CASE WHEN STATUS = 'active' THEN 1 ELSE 0 END)   AS count_still_active
  FROM
  (
    SELECT
      myTable.*,
      ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY Timestamp DESC)  AS event_id
    FROM
      myTable
  )
    sorted
  WHERE
        User_Created BETWEEN DATE'2020-02-01' AND DATE'2020-02-03'
    AND event_id = 1
  GROUP BY
    User_Created
)
SELECT
  *
FROM
  dates
OUTER APPLY
(
  SELECT
    COUNT(*) AS active_users_at_end_of_day
  FROM
  (
    SELECT
      myTable.*,
      ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY Timestamp DESC)  AS event_id
    FROM
      myTable
    WHERE
      myTable.Timestamp < dates.User_Created+1
  )
    filtered_sorted
  WHERE
        event_id = 1
    AND status = 'active'
)
  daily_stats
ORDER BY
  User_Created

Demo...

You can use:

WITH expected_range (min_dt, max_dt) AS (
  SELECT DATE '2020-02-01',
         DATE '2020-02-04'
  FROM   DUAL
),
calendar (dt, max_dt) AS (
  SELECT MIN(LEAST(TRUNC(timestamp), min_dt)),
         MAX(GREATEST(TRUNC(timestamp), max_dt))
  FROM   myTable
         CROSS JOIN expected_range
UNION ALL
  SELECT dt + INTERVAL '1' DAY, max_dt
  FROM   calendar
  WHERE  dt + INTERVAL '1' DAY <= max_dt
),
activity (dt, userid, status, rn, day_rn, status_now, day_status) AS (
  SELECT c.dt,
         a.userid,
         status,
         ROW_NUMBER() OVER (PARTITION BY userid ORDER BY timestamp NULLS LAST),
         ROW_NUMBER() OVER (PARTITION BY userid, c.dt ORDER BY timestamp DESC),
         MAX(status)
           KEEP (DENSE_RANK LAST ORDER BY timestamp NULLS FIRST)
           OVER (PARTITION BY userid),
         COALESCE(
           LAST_VALUE(status)
             IGNORE NULLS
             OVER (PARTITION BY a.userid ORDER BY c.dt, a.timestamp),
           'inactive'
         )
  FROM   calendar c
         LEFT OUTER JOIN mytable a
         PARTITION BY (a.userid)
         ON (c.dt <= a.timestamp AND a.timestamp < c.dt + INTERVAL '1' DAY)
)
SELECT a.dt AS create_date,
       COUNT(CASE WHEN rn = 1 THEN 1 END) AS users_created,
       COUNT(CASE WHEN rn = 1 AND status_now = 'active' THEN 1 END)
         AS still_active,
       COUNT(CASE WHEN day_rn = 1 AND day_status = 'active' THEN 1 END)
         AS total_active_users_today
FROM   activity a
       INNER JOIN expected_range e
       ON (a.dt BETWEEN e.min_dt AND e.max_dt)
GROUP BY a.dt
ORDER BY a.dt

Which, for the sample data (with additional test cases):

CREATE TABLE myTable (ID, USERID, STATUS, TIMESTAMP) AS
SELECT  1, 'A', 'active',   TO_DATE('01.02.2020 08:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT  2, 'A', 'inactive', TO_DATE('02.02.2020 11:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT  3, 'A', 'active',   TO_DATE('03.02.2020 12:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT  4, 'B', 'active',   TO_DATE('02.02.2020 13:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT  5, 'B', 'inactive', TO_DATE('05.02.2020 14:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT  6, 'B', 'active',   TO_DATE('07.02.2020 16:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT  7, 'C', 'active',   TO_DATE('02.02.2020 09:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT  8, 'C', 'inactive', TO_DATE('03.02.2020 11:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT  9, 'D', 'active',   TO_DATE('03.02.2020 10:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT 10, 'E', 'active',   TO_DATE('04.02.2020 10:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT 11, 'F', 'active',   TO_DATE('31.01.2020 00:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT 12, 'G', 'active',   TO_DATE('04.02.2020 00:00:00', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL UNION ALL
SELECT 13, 'G', 'inactive', TO_DATE('04.02.2020 23:59:59', 'DD.MM.YYYY HH24:MI:SS') FROM DUAL;

Outputs:

CREATE_DATE USERS_CREATED STILL_ACTIVE TOTAL_ACTIVE_USERS_TODAY
2020-02-01 00:00:00 1 1 2
2020-02-02 00:00:00 2 1 3
2020-02-03 00:00:00 1 1 4
2020-02-04 00:00:00 2 1 5

db<>fiddle here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM