简体   繁体   English

SQL查询以查找至少一天的行数大于或等于阈值的用户的最小值和最大值

[英]SQL query to find min and & max value for a user who has at least one day with a row count of > threshold

I have records for a user base, and I am trying to identify a kind of user who has at least 100 records per day and then determine that user's life span by finding the user's max and min time stamp. 我有一个用户群记录,我试图确定一种每天至少有100条记录的用户,然后通过查找用户的最大和最小时间戳来确定该用户的寿命。 I have not been able to do that in a single query. 我无法在单个查询中做到这一点。 Here's how I identify users who meet the threshold: 这是我确定满足阈值的用户的方式:

SELECT COUNT(*) count, userid, recorddate::date 
FROM data 
WHERE datatype = 0 
GROUP BY userid, recorddate::date 
HAVING COUNT(userid) > 100

However, this only returns data for days where the count was > 100. I am interested in the max and min date for a user who had at least one day with the count > 100. Is there a way to modify this query above to get what I want or must I use a second query? 但是,这仅返回计数> 100的日期的数据。我对至少一天计数> 100的用户的最大和最小日期感兴趣。是否可以通过上述方法修改此查询以获取我想要还是必须使用第二个查询?

join the result to the original table to get the lifespan of those users who have more than 100 entries per day atleast once. join结果为原始表让那些谁拥有每天超过100个条目ATLEAST一旦用户的寿命。

select d.user_id 
,max(d.record_date::date) - min(d.record_date::date) as user_lifespan_in_days
from data d
join (SELECT COUNT(*) count, userid, recorddate::date 
      FROM data 
      WHERE datatype = 0 
      GROUP BY userid, recorddate::date 
      HAVING COUNT(*) > 100) t
on t.user_id = d.user_id
group by d.user_id

Note this is a comparison of 2 of the answers. 请注意,这是两个答案的比较。 While the first section of this is written for sql-server, I also tried the windowed functions specifically in Postgres the code is below as well. 虽然本文的第一部分是为sql-server编写的,但我还尝试了Postgres中的窗口化函数,下面的代码也是如此。 The bottom line is that this is a 2 step query for the questions desired results. 最重要的是,这是一个两步查询,用于查询所需结果。 Step 1 find the UserIds meeting the criteria you want Step 2 joining that back to the table and getting the max and min from the entire dataset. 第1步找到符合所需条件的UserId。第2步将其加入表中,并从整个数据集中获取最大值和最小值。

I truly wish it could be done in one step but the results are clear that the windowed functions when combined with GROUP BY will calculate their results based upon the result set of the GROUP BY NOT the entire table. 我确实希望可以一​​步完成,但是结果很明显,窗口函数与GROUP BY结合使用时,将基于GROUP BY的结果集而不是整个表来计算其结果。

Here is some test Data so that we can see the actual results: 这是一些测试数据,以便我们可以看到实际结果:

DECLARE @Data AS TABLE (UserId INT, RecordDate DATETIME)

INSERT INTO @Data (UserId, RecordDate)
VALUES (2,DATEADD(YEAR,-3,GETDATE())), (2,DATEADD(YEAR,3,GETDATE())), (4,DATEADD(YEAR,-6,GETDATE())), (4,DATEADD(YEAR,6,GETDATE()))

DECLARE @U INT = 1

WHILE @U < 5
BEGIN
    DECLARE @I INT = 1

    WHILE @I < 12
    BEGIN
       IF (@U IN (1,3) AND @I > 6)
       BEGIN
          BREAK
       END

       INSERT INTO @Data (UserId, RecordDate) VALUES (@U, DATEADD(MINUTE,-1,GETDATE()))

       SET @I += 1
    END

    SET @U += 1
END

Here is @Gordon Linoff's suggestion 这是@Gordon Linoff的建议

SELECT
    UserId, RecordDate, COUNT(*) AS [count]
    ,MIN(RecordDate) OVER (PARTITION BY UserId) AS min_recorddate
    ,MAX(RecordDate) OVER (PARTITION BY UserId) AS max_recorddate 
FROM
    @Data
GROUP BY
    UserId, RecordDate
HAVING
    COUNT(UserId) > 9

And here is @vkp's suggestion 这是@vkp的建议

SELECT
    t.UserId
    ,COUNT(*) AS [count]
    ,MIN(d.RecordDate) as min_recorddate
    ,MAX(d.RecordDate) as max_recorddate
FROM
    @Data d
    INNER JOIN 
    (
       SELECT
          UserId
          ,RecordDate
          ,[count] = COUNT(*)
       FROM
          @Data
       GROUP BY
          UserId
          ,RecordDate
       HAVING
          COUNT(*) > 9
    ) t
    ON d.UserId = t.UserId
GROUP BY
    t.UserId

Note @Gordon's results: 注意@戈登的结果:

在此处输入图片说明

@Vkp's resuls: @Vkp的结果:

在此处输入图片说明

Image of UserId 2 from Test Date I generated 我生成的测试日期的UserId 2的图像

在此处输入图片说明

Adding Postgres Test Case with @Gordons suggestion: 使用@Gordons建议添加Postgres测试用例:

CREATE TEMPORARY TABLE DATA (USERID INT, RECORDDATE TIMESTAMP)
ON COMMIT DELETE ROWS;

INSERT INTO DATA (USERID, RECORDDATE) VALUES (2,NOW() + INTERVAL '3 YEAR');
INSERT INTO DATA (USERID, RECORDDATE) VALUES (2,NOW() + INTERVAL '-3 YEAR');
INSERT INTO DATA (USERID, RECORDDATE) VALUES (4,NOW() + INTERVAL '6 YEAR');
INSERT INTO DATA (USERID, RECORDDATE) VALUES (4,NOW() + INTERVAL '-6 YEAR');

DO $$
    DECLARE
        i integer;
        u integer;
    BEGIN
        u := 1;
        WHILE (u < 5) LOOP
            i := 1;

            WHILE (i < 11) LOOP

                IF (u IN (1,3) AND i > 6) THEN
                    EXIT;
                END IF;

                INSERT INTO DATA (USERID, RECORDDATE) VALUES (u,NOW() + INTERVAL '-1 MINUTE');

            i = i + 1;

            END LOOP;

            u = u + 1;

        END LOOP;

    RAISE NOTICE 'value of i: %, and u: %', i, u;

END $$ ;


SELECT userid, recorddate::date, COUNT(*) as count,
       MIN(recorddate::date) OVER (PARTITION BY userid) as min_recorddate,
       MAX(recorddate::date) OVER (PARTITION BY userid) as max_recorddate
FROM data 
GROUP BY userid, recorddate::date 
HAVING COUNT(*) > 9;

Results 结果

在此处输入图片说明

You mean that on a given day, the user has at least 100 records. 您的意思是,在给定的一天中,用户至少有100条记录。 Here is one method: 这是一种方法:

SELECT userid, recorddate::date, COUNT(*) as count,
       MIN(recorddate::date) OVER (PARTITION BY userid) as min_recorddate,
       MAX(recorddate::date) OVER (PARTITION BY userid) as max_recorddate
FROM data 
WHERE datatype = 0 
GROUP BY userid, recorddate::date 
HAVING COUNT(*) > 100;

Now, this will produce multiple records for the same user, if a user meets the criteria on multiple dates. 现在,如果一个用户在多个日期都符合条件,它将为同一个用户生成多个记录。 One solution is to use a subquery to filter down to the user level. 一种解决方案是使用子查询来过滤到用户级别。 Another is to use DISTINCT ON : 另一种是使用DISTINCT ON

SELECT DISTINCT ON (userid)
       userid, recorddate::date, COUNT(*) as count,
       MIN(recorddate::date) OVER (PARTITION BY userid) as min_recorddate,
       MAX(recorddate::date) OVER (PARTITION BY userid) as max_recorddate
FROM data 
WHERE datatype = 0 
GROUP BY userid, recorddate::date 
ORDER BY userid, COUNT(*) DESC
HAVING COUNT(userid) > 100;

Now that I think about it . 现在我考虑一下。 . . I haven't used window functions with DISTINCT ON . 我还没有使用DISTINCT ON窗口函数。 So I think this will work. 所以我认为这会起作用。 A subquery or CTE definitely would work. 子查询或CTE绝对可以工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM