简体   繁体   中英

SQL query to find min and & max value for a user who has at least one day with a row count of > threshold

I have records for a user base, and I am trying to identify a kind of user who has at least 100 records per day and then determine that user's life span by finding the user's max and min time stamp. I have not been able to do that in a single query. Here's how I identify users who meet the threshold:

SELECT COUNT(*) count, userid, recorddate::date 
FROM data 
WHERE datatype = 0 
GROUP BY userid, recorddate::date 
HAVING COUNT(userid) > 100

However, this only returns data for days where the count was > 100. I am interested in the max and min date for a user who had at least one day with the count > 100. Is there a way to modify this query above to get what I want or must I use a second query?

join the result to the original table to get the lifespan of those users who have more than 100 entries per day atleast once.

select d.user_id 
,max(d.record_date::date) - min(d.record_date::date) as user_lifespan_in_days
from data d
join (SELECT COUNT(*) count, userid, recorddate::date 
      FROM data 
      WHERE datatype = 0 
      GROUP BY userid, recorddate::date 
      HAVING COUNT(*) > 100) t
on t.user_id = d.user_id
group by d.user_id

Note this is a comparison of 2 of the answers. While the first section of this is written for sql-server, I also tried the windowed functions specifically in Postgres the code is below as well. The bottom line is that this is a 2 step query for the questions desired results. Step 1 find the UserIds meeting the criteria you want Step 2 joining that back to the table and getting the max and min from the entire dataset.

I truly wish it could be done in one step but the results are clear that the windowed functions when combined with GROUP BY will calculate their results based upon the result set of the GROUP BY NOT the entire table.

Here is some test Data so that we can see the actual results:

DECLARE @Data AS TABLE (UserId INT, RecordDate DATETIME)

INSERT INTO @Data (UserId, RecordDate)
VALUES (2,DATEADD(YEAR,-3,GETDATE())), (2,DATEADD(YEAR,3,GETDATE())), (4,DATEADD(YEAR,-6,GETDATE())), (4,DATEADD(YEAR,6,GETDATE()))

DECLARE @U INT = 1

WHILE @U < 5
BEGIN
    DECLARE @I INT = 1

    WHILE @I < 12
    BEGIN
       IF (@U IN (1,3) AND @I > 6)
       BEGIN
          BREAK
       END

       INSERT INTO @Data (UserId, RecordDate) VALUES (@U, DATEADD(MINUTE,-1,GETDATE()))

       SET @I += 1
    END

    SET @U += 1
END

Here is @Gordon Linoff's suggestion

SELECT
    UserId, RecordDate, COUNT(*) AS [count]
    ,MIN(RecordDate) OVER (PARTITION BY UserId) AS min_recorddate
    ,MAX(RecordDate) OVER (PARTITION BY UserId) AS max_recorddate 
FROM
    @Data
GROUP BY
    UserId, RecordDate
HAVING
    COUNT(UserId) > 9

And here is @vkp's suggestion

SELECT
    t.UserId
    ,COUNT(*) AS [count]
    ,MIN(d.RecordDate) as min_recorddate
    ,MAX(d.RecordDate) as max_recorddate
FROM
    @Data d
    INNER JOIN 
    (
       SELECT
          UserId
          ,RecordDate
          ,[count] = COUNT(*)
       FROM
          @Data
       GROUP BY
          UserId
          ,RecordDate
       HAVING
          COUNT(*) > 9
    ) t
    ON d.UserId = t.UserId
GROUP BY
    t.UserId

Note @Gordon's results:

在此处输入图片说明

@Vkp's resuls:

在此处输入图片说明

Image of UserId 2 from Test Date I generated

在此处输入图片说明

Adding Postgres Test Case with @Gordons suggestion:

CREATE TEMPORARY TABLE DATA (USERID INT, RECORDDATE TIMESTAMP)
ON COMMIT DELETE ROWS;

INSERT INTO DATA (USERID, RECORDDATE) VALUES (2,NOW() + INTERVAL '3 YEAR');
INSERT INTO DATA (USERID, RECORDDATE) VALUES (2,NOW() + INTERVAL '-3 YEAR');
INSERT INTO DATA (USERID, RECORDDATE) VALUES (4,NOW() + INTERVAL '6 YEAR');
INSERT INTO DATA (USERID, RECORDDATE) VALUES (4,NOW() + INTERVAL '-6 YEAR');

DO $$
    DECLARE
        i integer;
        u integer;
    BEGIN
        u := 1;
        WHILE (u < 5) LOOP
            i := 1;

            WHILE (i < 11) LOOP

                IF (u IN (1,3) AND i > 6) THEN
                    EXIT;
                END IF;

                INSERT INTO DATA (USERID, RECORDDATE) VALUES (u,NOW() + INTERVAL '-1 MINUTE');

            i = i + 1;

            END LOOP;

            u = u + 1;

        END LOOP;

    RAISE NOTICE 'value of i: %, and u: %', i, u;

END $$ ;


SELECT userid, recorddate::date, COUNT(*) as count,
       MIN(recorddate::date) OVER (PARTITION BY userid) as min_recorddate,
       MAX(recorddate::date) OVER (PARTITION BY userid) as max_recorddate
FROM data 
GROUP BY userid, recorddate::date 
HAVING COUNT(*) > 9;

Results

在此处输入图片说明

You mean that on a given day, the user has at least 100 records. Here is one method:

SELECT userid, recorddate::date, COUNT(*) as count,
       MIN(recorddate::date) OVER (PARTITION BY userid) as min_recorddate,
       MAX(recorddate::date) OVER (PARTITION BY userid) as max_recorddate
FROM data 
WHERE datatype = 0 
GROUP BY userid, recorddate::date 
HAVING COUNT(*) > 100;

Now, this will produce multiple records for the same user, if a user meets the criteria on multiple dates. One solution is to use a subquery to filter down to the user level. Another is to use DISTINCT ON :

SELECT DISTINCT ON (userid)
       userid, recorddate::date, COUNT(*) as count,
       MIN(recorddate::date) OVER (PARTITION BY userid) as min_recorddate,
       MAX(recorddate::date) OVER (PARTITION BY userid) as max_recorddate
FROM data 
WHERE datatype = 0 
GROUP BY userid, recorddate::date 
ORDER BY userid, COUNT(*) DESC
HAVING COUNT(userid) > 100;

Now that I think about it . . . I haven't used window functions with DISTINCT ON . So I think this will work. A subquery or CTE definitely would work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM