[英]Count occurrences of combinations of columns
I have daily time series (actually business days) for different companies and I work with PostgreSQL. 我有不同公司的每日时间序列(实际上是工作日),并且我使用PostgreSQL。 There is also an indicator variable (called flag) taking the value 0 most of the time, and 1 on some rare event days.
还有一个指示符变量(称为标志)在大多数情况下取值为0,在某些罕见事件发生时取值为1。 If the indicator variable takes the value 1 for a company, I want to further investigate the entries from two days before to one day after that event for the corresponding company.
如果指标变量的值等于某公司的值1,那么我想进一步调查该事件发生前两天到事件发生后一天(对应公司)的条目。 Let me refer to that as [-2,1] window with the event day being day 0.
让我将其称为[-2,1]窗口,将事件日期设置为第0天。
I am using the following query 我正在使用以下查询
CREATE TABLE test AS
WITH cte AS (
SELECT *
, MAX(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN 1 preceding AND 2 following) Lead1
FROM mytable)
SELECT *
FROM cte
WHERE Lead1 = 1
ORDER BY day,company
The query takes the entries ranging from 2 days before the event to one day after the event, for the company experiencing the event. 对于经历该事件的公司,该查询将获取从事件发生前2天到事件发生后1天的条目。 The query does that for all events.
该查询针对所有事件执行此操作。
This is a small section of the resulting table. 这是结果表的一小部分。
day company flag
2012-01-23 A 0
2012-01-24 A 0
2012-01-25 A 1
2012-01-25 B 0
2012-01-26 A 0
2012-01-26 B 0
2012-01-27 B 1
2012-01-30 B 0
2013-01-10 A 0
2013-01-11 A 0
2013-01-14 A 1
Now I want to do further calculations for every [-2,1] window separately. 现在,我想分别为每个[-2,1]窗口做进一步的计算。 So I need a variable that allows me to identify each [-2,1] window.
因此,我需要一个变量,使我能够识别每个[-2,1]窗口。 The idea is that I count the number of windows for every company with the variable "occur", so that in further calculations I can use the clause
我的想法是,我用变量“ occur”计算每个公司的窗口数,以便在进一步的计算中可以使用该子句
GROUP BY company, occur
Therefore my desired output looks like that: 因此,我想要的输出如下所示:
day company flag occur
2012-01-23 A 0 1
2012-01-24 A 0 1
2012-01-25 A 1 1
2012-01-25 B 0 1
2012-01-26 A 0 1
2012-01-26 B 0 1
2012-01-27 B 1 1
2012-01-30 B 0 1
2013-01-10 A 0 2
2013-01-11 A 0 2
2013-01-14 A 1 2
In the example, the company B only occurs once (occur = 1). 在该示例中,公司B仅出现一次(发生= 1)。 But the company A occurs two times.
但是公司A发生两次。 For the first time from 2012-01-23 to 2012-01-26.
第一次从2012-01-23到2012-01-26。 And for the second time from 2013-01-10 to 2013-01-14.
这是2013年1月10日至2013年1月14日的第二次。 The second time range of company A does not consist of all four days surrounding the event day (-2,-1,0,1) since the company leaves the dataset before the end of that time range.
公司A的第二个时间范围不包含事件日(-2,-1,0,1)周围的所有四天,因为公司在该时间范围结束之前离开了数据集。
As I said I am working with business days. 正如我所说,我正在工作日。 I don't care for holidays, I have data from monday to friday.
我不在乎假期,我有从星期一到星期五的数据。 Earlier I wrote the following function:
之前我写了以下函数:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
WITH alldates AS (
SELECT i,
$1 + (i * CASE WHEN $2 < 0 THEN -1 ELSE 1 END) AS date
FROM generate_series(0,(ABS($2) + 5)*2) i
),
days AS (
SELECT i, date, EXTRACT('dow' FROM date) AS dow
FROM alldates
),
businessdays AS (
SELECT i, date, d.dow FROM days d
WHERE d.dow BETWEEN 1 AND 5
ORDER BY i
)
-- adding business days to a date --
SELECT date FROM businessdays WHERE
CASE WHEN $2 > 0 THEN date >=$1 WHEN $2 < 0
THEN date <=$1 ELSE date =$1 END
LIMIT 1
offset ABS($2)
$BODY$
LANGUAGE 'sql' VOLATILE;
It can add/substract business days from a given date and works like that: 它可以添加/减去给定日期的工作日,其工作方式如下:
select * from addbusinessdays('2013-01-14',-2)
delivers the result 2013-01-10. 提供结果2013年1月10日。 So in Jakub's approach we can change the second and third last line to
因此,在Jakub的方法中,我们可以将第二行和倒数第三行更改为
w.day BETWEEN addbusinessdays(t1.day, -2) AND addbusinessdays(t1.day, 1)
and can deal with the business days. 并可以处理工作日。
Basically the strategy is to first enumarate the flag days and then join others with them: 基本上,策略是先增加卖旗日,然后再加入其他人:
WITH windows AS(
SELECT t1.day
,t1.company
,rank() OVER (PARTITION BY company ORDER BY day) as rank
FROM table1 t1
WHERE flag =1)
SELECT t1.day
,t1.company
,t1.flag
,w.rank
FROM table1 AS t1
JOIN windows AS w
ON
t1.company = w.company
AND
w.day BETWEEN
t1.day - interval '2 day' AND t1.day + interval '1 day'
ORDER BY t1.day, t1.company;
However there is a problem with work days as those can mean whatever (do holidays count?). 但是,工作日存在问题,因为这可能意味着什么(假期算在内吗?)。
While using the function addbusinessdays()
, consider this instead: 在使用功能
addbusinessdays()
,请考虑以下事项:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$func$
SELECT day
FROM (
SELECT i, $1 + i * sign($2)::int AS day
FROM generate_series(0, ((abs($2) * 7) / 5) + 3) i
) sub
WHERE EXTRACT(ISODOW FROM day) < 6 -- truncate weekend
ORDER BY i
OFFSET abs($2)
LIMIT 1
$func$ LANGUAGE sql IMMUTABLE;
Never quote the language name sql
. 切勿引用语言名称
sql
。 It's an identifier, not a string. 这是一个标识符,而不是字符串。
Why was the function VOLATILE
? 为什么函数
VOLATILE
? Make it IMMUTABLE
for better performance in repeated use and more options (like using it in a functional index). 将其
IMMUTABLE
为IMMUTABLE
可以提高重复使用性能和更多选项(例如在功能索引中使用它)。
(ABS($2) + 5)*2)
is way too much padding. (ABS($2) + 5)*2)
太多了。 Replace with ((abs($2) * 7) / 5) + 3)
. 替换为
((abs($2) * 7) / 5) + 3)
。
Multiple levels of CTEs were useless cruft. 多层次的CTE毫无用处。
ORDER BY
in last CTE was useless, too. 上次CTE中的
ORDER BY
也没用。
As mentioned in my previous answer, extract(
ISODOW FROM ...)
is more convenient to truncate weekends. 如我之前的回答中所述,
extract(
ISODOW FROM ...)
在截断周末时更方便。
That said, I wouldn't use above function for this query at all. 就是说,我根本不会在查询中使用上述功能。 Build a complete grid of relevant days once instead of calculating the range of days for every single row.
一次构建一个完整的相关天数网格,而不是计算每一行的天数范围。
Based on this assertion in a comment (should be in the question, really!): 基于评论中的这一断言(确实应该在问题中!):
two subsequent windows of the same firm can never overlap.
同一家公司的两个后续窗口永远不会重叠。
WITH range AS ( -- only with flag
SELECT company
, min(day) - 2 AS r_start
, max(day) + 1 AS r_stop
FROM tbl t
WHERE flag <> 0
GROUP BY 1
)
, grid AS (
SELECT company, day::date
FROM range r
,generate_series(r.r_start, r.r_stop, interval '1d') d(day)
WHERE extract('ISODOW' FROM d.day) < 6
)
SELECT *, sum(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN UNBOUNDED PRECEDING
AND 2 following) AS window_nr
FROM (
SELECT t.*, max(t.flag) OVER(PARTITION BY g.company ORDER BY g.day
ROWS BETWEEN 1 preceding
AND 2 following) in_window
FROM grid g
LEFT JOIN tbl t USING (company, day)
) sub
WHERE in_window > 0 -- only rows in [-2,1] window
AND day IS NOT NULL -- exclude missing days in [-2,1] window
ORDER BY company, day;
Build a grid of all business days: CTE grid
. 建立一个所有工作日的
grid
:CTE grid
。
To keep the grid to its smallest possible size, extract minimum and maximum (plus buffer) day per company: CTE range
. 为使网格保持最小尺寸,请提取每个公司的最小和最大(加缓冲)日:CTE
range
。
LEFT JOIN
actual rows to it. 向其
LEFT JOIN
实际行。 Now the frames for ensuing window functions works with static numbers. 现在,用于确保窗口功能的框架可以使用静态数字。
To get distinct numbers per flag and company ( window_nr
), just count flags from the start of the grid (taking buffers into account). 要获得每个标志和公司(
window_nr
)的不同编号,只需从网格开始算起标志(考虑缓冲区)。
Only keep days inside your [-2,1] windows ( in_window > 0
). 仅在[-2,1]窗口(
in_window > 0
)内保留几天。
Only keep days with actual rows in the table. 在表中仅保留实际行数。
Voilá. 瞧。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.