简体   繁体   English

计算列组合的出现次数

[英]Count occurrences of combinations of columns

I have daily time series (actually business days) for different companies and I work with PostgreSQL. 我有不同公司的每日时间序列(实际上是工作日),并且我使用PostgreSQL。 There is also an indicator variable (called flag) taking the value 0 most of the time, and 1 on some rare event days. 还有一个指示符变量(称为标志)在大多数情况下取值为0,在某些罕见事件发生时取值为1。 If the indicator variable takes the value 1 for a company, I want to further investigate the entries from two days before to one day after that event for the corresponding company. 如果指标变量的值等于某公司的值1,那么我想进一步调查该事件发生前两天到事件发生后一天(对应公司)的条目。 Let me refer to that as [-2,1] window with the event day being day 0. 让我将其称为[-2,1]窗口,将事件日期设置为第0天。

I am using the following query 我正在使用以下查询

CREATE TABLE test AS
WITH cte AS (
   SELECT *
        , MAX(flag) OVER(PARTITION BY company ORDER BY day
                         ROWS BETWEEN 1 preceding AND 2 following) Lead1
   FROM mytable)
SELECT *
FROM cte
WHERE Lead1 = 1 
ORDER BY day,company

The query takes the entries ranging from 2 days before the event to one day after the event, for the company experiencing the event. 对于经历该事件的公司,该查询将获取从事件发生前2天到事件发生后1天的条目。 The query does that for all events. 该查询针对所有事件执行此操作。

This is a small section of the resulting table. 这是结果表的一小部分。

day              company    flag     
2012-01-23       A          0        
2012-01-24       A          0         
2012-01-25       A          1         
2012-01-25       B          0         
2012-01-26       A          0         
2012-01-26       B          0        
2012-01-27       B          1        
2012-01-30       B          0        
2013-01-10       A          0        
2013-01-11       A          0              
2013-01-14       A          1              

Now I want to do further calculations for every [-2,1] window separately. 现在,我想分别为每个[-2,1]窗口做进一步的计算。 So I need a variable that allows me to identify each [-2,1] window. 因此,我需要一个变量,使我能够识别每个[-2,1]窗口。 The idea is that I count the number of windows for every company with the variable "occur", so that in further calculations I can use the clause 我的想法是,我用变量“ occur”计算每个公司的窗口数,以便在进一步的计算中可以使用该子句

    GROUP BY company, occur

Therefore my desired output looks like that: 因此,我想要的输出如下所示:

day              company    flag     occur
2012-01-23       A          0        1
2012-01-24       A          0        1 
2012-01-25       A          1        1 
2012-01-25       B          0        1 
2012-01-26       A          0        1 
2012-01-26       B          0        1
2012-01-27       B          1        1
2012-01-30       B          0        1
2013-01-10       A          0        2
2013-01-11       A          0        2
2013-01-14       A          1        2 

In the example, the company B only occurs once (occur = 1). 在该示例中,公司B仅出现一次(发生= 1)。 But the company A occurs two times. 但是公司A发生两次。 For the first time from 2012-01-23 to 2012-01-26. 第一次从2012-01-23到2012-01-26。 And for the second time from 2013-01-10 to 2013-01-14. 这是2013年1月10日至2013年1月14日的第二次。 The second time range of company A does not consist of all four days surrounding the event day (-2,-1,0,1) since the company leaves the dataset before the end of that time range. 公司A的第二个时间范围不包含事件日(-2,-1,0,1)周围的所有四天,因为公司在该时间范围结束之前离开了数据集。

As I said I am working with business days. 正如我所说,我正在工作日。 I don't care for holidays, I have data from monday to friday. 我不在乎假期,我有从星期一到星期五的数据。 Earlier I wrote the following function: 之前我写了以下函数:

CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
  RETURNS date AS
$BODY$ 
WITH alldates AS (
    SELECT i,
    $1 + (i * CASE WHEN $2 < 0 THEN -1 ELSE 1 END) AS date
    FROM generate_series(0,(ABS($2) + 5)*2) i
),
days AS (
    SELECT i, date, EXTRACT('dow' FROM date) AS dow
    FROM alldates
),
businessdays AS (
    SELECT i, date, d.dow FROM days d
    WHERE d.dow BETWEEN 1 AND 5
    ORDER BY i
)

-- adding business days to a date --
SELECT date FROM businessdays WHERE
        CASE WHEN $2 > 0 THEN date >=$1 WHEN $2 < 0
             THEN date <=$1 ELSE date =$1 END
    LIMIT 1
    offset ABS($2)
$BODY$
  LANGUAGE 'sql' VOLATILE;

It can add/substract business days from a given date and works like that: 它可以添加/减去给定日期的工作日,其工作方式如下:

    select * from addbusinessdays('2013-01-14',-2)

delivers the result 2013-01-10. 提供结果2013年1月10日。 So in Jakub's approach we can change the second and third last line to 因此,在Jakub的方法中,我们可以将第二行和倒数第三行更改为

      w.day BETWEEN addbusinessdays(t1.day, -2) AND addbusinessdays(t1.day, 1)

and can deal with the business days. 并可以处理工作日。

Basically the strategy is to first enumarate the flag days and then join others with them: 基本上,策略是先增加卖旗日,然后再加入其他人:

WITH windows AS(
SELECT t1.day
       ,t1.company
       ,rank() OVER (PARTITION BY company ORDER BY day) as rank
FROM table1 t1
WHERE flag =1)

SELECT t1.day
      ,t1.company
      ,t1.flag
      ,w.rank
FROM table1 AS t1
JOIN windows AS w
ON
  t1.company = w.company
  AND
  w.day BETWEEN 
 t1.day - interval '2 day' AND t1.day + interval '1 day'
ORDER BY t1.day, t1.company;

Fiddle . 小提琴

However there is a problem with work days as those can mean whatever (do holidays count?). 但是,工作日存在问题,因为这可能意味着什么(假期算在内吗?)。

Function 功能

While using the function addbusinessdays() , consider this instead: 在使用功能addbusinessdays() ,请考虑以下事项:

CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
  RETURNS date AS
$func$ 
SELECT day
FROM  (
    SELECT i, $1 + i * sign($2)::int AS day
    FROM   generate_series(0, ((abs($2) * 7) / 5) + 3) i
    ) sub
WHERE  EXTRACT(ISODOW FROM day) < 6  -- truncate weekend
ORDER  BY i
OFFSET abs($2)
LIMIT  1
$func$  LANGUAGE sql IMMUTABLE;

Major points 要点

  • Never quote the language name sql . 切勿引用语言名称sql It's an identifier, not a string. 这是一个标识符,而不是字符串。

  • Why was the function VOLATILE ? 为什么函数VOLATILE Make it IMMUTABLE for better performance in repeated use and more options (like using it in a functional index). 将其IMMUTABLEIMMUTABLE可以提高重复使用性能和更多选项(例如在功能索引中使用它)。

  • (ABS($2) + 5)*2) is way too much padding. (ABS($2) + 5)*2)太多了。 Replace with ((abs($2) * 7) / 5) + 3) . 替换为((abs($2) * 7) / 5) + 3)

  • Multiple levels of CTEs were useless cruft. 多层次的CTE毫无用处。

  • ORDER BY in last CTE was useless, too. 上次CTE中的ORDER BY也没用。

  • As mentioned in my previous answer, extract( ISODOW FROM ...) is more convenient to truncate weekends. 如我之前的回答中所述, extract( ISODOW FROM ...)在截断周末时更方便。

Query 询问

That said, I wouldn't use above function for this query at all. 就是说,我根本不会在查询中使用上述功能。 Build a complete grid of relevant days once instead of calculating the range of days for every single row. 一次构建一个完整的相关天数网格,而不是计算每一行的天数范围。

Based on this assertion in a comment (should be in the question, really!): 基于评论中的这一断言(确实应该在问题中!):

two subsequent windows of the same firm can never overlap. 同一家公司的两个后续窗口永远不会重叠。

WITH range AS (              -- only with flag
   SELECT company
        , min(day) - 2 AS r_start
        , max(day) + 1 AS r_stop
   FROM   tbl t 
   WHERE  flag <> 0
   GROUP  BY 1
   )
, grid AS (
   SELECT company, day::date
   FROM   range r
         ,generate_series(r.r_start, r.r_stop, interval '1d') d(day)
   WHERE  extract('ISODOW' FROM d.day) < 6
   )
SELECT *, sum(flag) OVER(PARTITION BY company ORDER BY day
                         ROWS BETWEEN UNBOUNDED PRECEDING
                         AND 2 following) AS window_nr
FROM  (
   SELECT t.*, max(t.flag) OVER(PARTITION BY g.company ORDER BY g.day
                           ROWS BETWEEN 1 preceding
                           AND 2 following) in_window
   FROM   grid     g
   LEFT   JOIN tbl t USING (company, day)
   ) sub
WHERE  in_window > 0      -- only rows in [-2,1] window
AND    day IS NOT NULL    -- exclude missing days in [-2,1] window
ORDER  BY company, day;

How? 怎么样?

  • Build a grid of all business days: CTE grid . 建立一个所有工作日的grid :CTE grid

  • To keep the grid to its smallest possible size, extract minimum and maximum (plus buffer) day per company: CTE range . 为使网格保持最小尺寸,请提取每个公司的最小和最大(加缓冲)日:CTE range

  • LEFT JOIN actual rows to it. 向其LEFT JOIN实际行。 Now the frames for ensuing window functions works with static numbers. 现在,用于确保窗口功能的框架可以使用静态数字。

  • To get distinct numbers per flag and company ( window_nr ), just count flags from the start of the grid (taking buffers into account). 要获得每个标志和公司( window_nr )的不同编号,只需从网格开始算起标志(考虑缓冲区)。

  • Only keep days inside your [-2,1] windows ( in_window > 0 ). 仅在[-2,1]窗口( in_window > 0 )内保留几天。

  • Only keep days with actual rows in the table. 在表中仅保留实际行数。

Voilá. 瞧。

SQL Fiddle. SQL提琴。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM