简体   繁体   English

Postgres:在日期范围内选择带有group by子句的查询

[英]Postgres: select query with group by clause on a range of dates

My table contains answers from repeatable questionnaires that can be filled in a range of 30 days and are scheduled every 60 days. 我的表格包含可重复问卷的答案,这些问卷可以在30天内填写,并且每60天安排一次。 Therefore, the answers from a single instance of a questionnaire are spread in a range of date that is always smaller tha 30 days and the first answer to the following repeatable questionnaire is at least 31 days after the last answer of the previous one. 因此,来自单个调查表实例的答案分布在一个总是小于30天的日期范围内,而对下一个可重复调查表的第一个答案至少比上一个答案的最后一个答案晚31天。 How do I create a view that calculate a score (which is basically the sum of the answers of a single questionnaire) among the values whose dates are within 30 days from the start date (min date)? 我该如何创建一个视图来计算其日期在开始日期(最短日期)后30天内的得分(基本上是单个问卷的答案之和)?

Table raw_data
------------------------------------------------
user_name | question_id | answer | answer_date |
------------------------------------------------
user001   |      1      |   2    | 2019-02-04  |
user001   |      2      |   1    | 2019-02-04  |
user001   |      3      |   2    | 2019-02-05  |
user001   |      4      |   2    | 2019-02-05  |
user001   |      5      |   2    | 2019-02-09  |
user002   |      1      |   2    | 2019-01-09  |
user002   |      2      |   2    | 2019-01-10  |
user002   |      3      |   1    | 2019-02-01  |
user002   |      4      |   2    | 2019-02-01  |
user002   |      5      |   1    | 2019-02-01  |
user002   |      1      |   2    | 2019-03-11  |
user002   |      2      |   2    | 2019-03-11  |
user002   |      3      |   1    | 2019-03-12  |
user002   |      4      |   1    | 2019-03-13  |
user002   |      5      |   1    | 2019-03-14  |


Expected result
------------------------------
user_name | sum | start_date |
------------------------------
user001   |  9  | 2019-02-04 | 
user002   |  8  | 2019-01-09 |
user002   |  7  | 2019-03-11 |

The solution I tried works for the first group only: 我尝试的解决方案仅适用于第一组:

SELECT user_name, SUM(answer::int),
CASE 
WHEN answer_date - MIN(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) < 30 
THEN MIN(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) 
ELSE answer_date END AS start_date,
FROM public.raw_data
GROUP BY user_name, answer_date

Use lag() to find the gaps. 使用lag()查找差距。 Then a cumulative sum to assign a "question period" and then summarize: 然后是一个累加的总和,以分配一个“查询周期”,然后进行总结:

select userid, min(answer_date) as start_date, sum(answer)
from (select rd.*,
             count(*) filter (where prev_ad is null or prev_ad < answer_date - interval '30 day') over (partition by user_id) as period
      from (select rd.*,
                   lag(answer_date) over (partition by user_id order by answer_date) as prev_ad
            from raw_data rd
           ) rd
     )
group by userid, period;

Thanks to @Gordon and to this answer I eventually found the missing step to determine my groups on a date range basis. 感谢@Gordon和这个答案,我终于找到了缺少的步骤来确定日期范围内的组。

I will use the following query to create a view and SUM answers grouping by grp2 我将使用以下查询创建一个视图,并按grp2对SUM答案进行分组

WITH query AS (
SELECT r.*,
SUM(CASE WHEN answer_date < prev_date + 30 THEN 0 ELSE 1 END) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) AS grp
  FROM (SELECT r.*,
    LAG(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) AS prev_date
    FROM raw_data r 
  ) r
)
SELECT user_name, question_id, answer_date, answer, DENSE_RANK() OVER (ORDER BY user_name, grp) AS grp2
FROM query

It's a classical problem. 这是一个经典的问题。 You'll find a lot under the tag I added. 在我添加的标签下,您会发现很多东西。

An optimized query for your case could look like: 针对您的案例的优化查询如下所示:

SELECT user_name
     , sum(answer)
     , min(answer_date) AS start_date 
FROM  (
   SELECT user_name, answer, answer_date
        , count(*) FILTER (WHERE step) OVER (PARTITION BY user_name ORDER BY answer_date) AS grp
   FROM  (
      SELECT user_name, answer, answer_date
           , lag(answer_date) OVER (PARTITION BY user_name ORDER BY answer_date) < answer_date - 30 AS step
      FROM   raw_data
      ) sub1
   ) sub2
GROUP  BY user_name, grp
ORDER  BY user_name, start_date;  -- ORDER BY optional

db<>fiddle here db <> 在这里拨弄

Closely related, with more explanation: 密切相关,更多说明:

You can use the query with row_number() window analytic function as below 您可以将查询与row_number()窗口分析函数一起使用,如下所示

with raw_data( user_name, question_id, answer, answer_date ) as
(
 select  'user001',1,2, '2019-02-04' union all
 select  'user001',2,1, '2019-02-04' union all
 select  'user001',3,2, '2019-02-05' union all
 select  'user001',4,2, '2019-02-05' union all
 select  'user001',5,2, '2019-02-09' union all
 select  'user002',1,2, '2019-01-09' union all
 select  'user002',2,2, '2019-01-10' union all
 select  'user002',3,1, '2019-02-01' union all
 select  'user002',4,2, '2019-02-01' union all
 select  'user002',5,1, '2019-02-01' union all
 select  'user002',1,2, '2019-03-11' union all
 select  'user002',2,2, '2019-03-11' union all
 select  'user002',3,1, '2019-03-12' union all
 select  'user002',4,1, '2019-03-13' union all
 select  'user002',5,1, '2019-03-14'
)    
select user_name, sum(answer) as sum, min(answer_date) as start_date
  from 
  (
   select row_number() over (partition by question_id order by user_name, answer_date) as rn,
          t.*
     from raw_data t
   ) t
  group by user_name, rn
  order by rn;

user_name   sum   start_date
---------   ---   ----------
user001     9     2019-02-04
user002     8     2019-01-09
user002     7     2019-03-11

Demo 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM