简体   繁体   English

加速SQL Server交叉应用以获取聚合数据

[英]Speeding up SQL Server cross apply to get aggregated data

In SQL Server, I am trying to put together a single query which grabs a row and includes the aggregated data from a two hour window before that row as well as aggregated data from one hour window after. 在SQL Server中,我试图将一个查询放在一起,该查询抓取一行并包含该行之前两小时窗口的聚合数据以及之后一小时窗口的聚合数据。 How can I make this run faster? 如何让这个运行得更快?

The rows have time stamps to a millisecond precision, and are not evenly spaced. 行具有毫秒精度的时间戳,并且间隔不均匀。 I have over 50 million rows in this table, and the query does not seem to be completing. 我在这个表中有超过5000万行,查询似乎没有完成。 There are indexes in many places, but they don't seem to help. 许多地方都有索引,但它们似乎没有帮助。 I was also thinking about using a window function, but I am not sure that its possible to have a sliding window with unevenly distributed rows. 我也在考虑使用窗口函数,但我不确定它是否可能有一个不均匀分布的行的滑动窗口。 Also, for the future one hour window, I am not sure how that would be done with a SQL window. 此外,对于未来的一小时窗口,我不确定如何使用SQL窗口完成。

Box is a string and has 10 unique values. Box是一个字符串,有10个唯一值。 Process is a string and has 30 unique values. Process是一个字符串,有30个唯一值。 The average duration_ms is 200 ms. 平均duration_ms是200毫秒。 Errors account for less than 0.1% of the data. 错误占数据的比例不到0.1%。 The 50 million rows describes a years worth of data. 5000万行描述了多年的数据。

select 
c1.start_time,
c1.end_time,
c1.box,
c1.process,
datediff(ms,c1.start_time,c1.end_time) as duration_ms,
datepart(dw,c1.start_time) as day_of_week,
datepart(hour,c1.start_time) as hour_of_day,
c3.*,
c5.*
from metrics_table c1
cross apply
(select 
    avg(cast(datediff(ms,c2.start_time,c2.end_time) as numeric)) as avg_ms,
    count(1) as num_process_total,
    count(distinct process) as num_process_unique,
    count(distinct box) as num_box_unique
    from metrics_table c2
    where datediff(minute,c2.start_time,c1.start_time) <= 120
    and c1.start_time> c2.start_time
    and c2.error_code = 0
) c3
cross apply
(select
    avg(case when datediff(ms,c4.start_time,c4.end_time)>1000 then 1.0 else 0.0 end) as percent_over_thresh
    from metrics_table c4
    where datediff(hour,c1.start_time,c4.start_time) <= 1
    and c4.start_time> c1.start_time
    and c4.error_code= 0
) c5
where
c1.error_code= 0

Edit 编辑

Version: SQL Azure 12.0 版本:SQL Azure 12.0

Adding execution plan: 添加执行计划: 在此输入图像描述

The following should be a step in the right direction... Note: c2.start_time & c4.start_time are no longer wrappen in DATEDIFF functions making them SARGable... 以下应该是朝着正确方向迈出的一步......注意:c2.start_time&c4.start_time不再在DATEDIFF函数中包装,使它们成为SARGable ......

SELECT
    c1.start_time,
    c1.end_time,
    c1.box,
    c1.process,
    DATEDIFF(ms, c1.start_time, c1.end_time) AS duration_ms,
    DATEPART(dw, c1.start_time) AS day_of_week,
    DATEPART(HOUR, c1.start_time) AS hour_of_day,
    --c3.*,
    avg_ms = CASE WHEN 
    c5.*
FROM
    dbo.metrics_table c1
    CROSS APPLY (
                SELECT
                    AVG(CAST(DATEDIFF(ms, c2.start_time, c2.end_time) AS NUMERIC)) AS avg_ms,
                    COUNT(1) AS num_process_total,
                    COUNT(DISTINCT process) AS num_process_unique,
                    COUNT(DISTINCT box) AS num_box_unique
                FROM
                    dbo.metrics_table c2
                WHERE
                    --DATEDIFF(minute,c2.start_time,c1.start_time) <= 120
                    c2.start_time <= DATEADD(MINUTE, -120, c1.start_time)
                    --and c1.start_time> c2.start_time
                    AND c2.error_code = 0
                ) c3
    CROSS APPLY (
                SELECT
                    AVG(CASE WHEN DATEDIFF(ms, c4.start_time, c4.end_time) > 1000 THEN 1.0 ELSE 0.0 END
                    ) AS percent_over_thresh
                FROM
                    dbo.metrics_table c4
                WHERE
                    --DATEDIFF(HOUR, c1.start_time, c4.start_time) <= 1
                    c4.start_time >= DATEADD(HOUR, 1, c1.start_time)
                    --and c4.start_time> c1.start_time
                    AND c4.error_code = 0
                ) c5
WHERE
    c1.error_code = 0;

Of course, making a query SARGable doesn't do any good unless there's an appropriate index available. 当然,除非有适当的索引,否则进行查询SARGable没有任何好处。 The following should be good for all 3 metrics_table references... (see what indexes are currently available, there's a chance that you may not need to create a new index) 以下内容应该适用于所有3个metrics_table引用...(请参阅当前可用的索引,您可能不需要创建新索引)

CREATE NONCLUSTERED INDEX ixf_metricstable_errorcode_starttime ON dbo.metrics_table (
    error_code,
    start_time
    )
INCLUDE (
    end_time,
    box,
    process
    )
WHERE 
    error_code = 0;

I used Between and got good performance in my simple test rig. 我使用Between并在我的简单测试台中获得了良好的性能。 I've also used columnstore as 50 million records is DW volumes: 我还使用了columnstore,因为5000卷记录是DW卷:

CREATE TABLE dbo.metrics_table (
    rowId       INT IDENTITY,
    start_time  DATETIME NOT NULL,
    end_time    DATETIME NOT NULL,
    box         VARCHAR(10) NOT NULL,
    process     VARCHAR(10) NOT NULL,
    error_code  INT NOT NULL
);


-- Add records
;WITH cte AS (
SELECT TOP 3334 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
FROM sys.columns c1
    CROSS JOIN sys.columns c2
    CROSS JOIN sys.columns c3
)
INSERT INTO dbo.metrics_table ( start_time, end_time, box, process, error_code )
SELECT
    DATEADD( ms, rn, DATEADD( day, rn % 365, '1 Jan 2017' ) ) AS start_time,
    DATEADD( ms, rn % 409, DATEADD( ms, rn, DATEADD( day, rn % 365, '1 Jan 2017' ) ) ) AS end_time,
    'box' + CAST( boxes.box AS VARCHAR(10) ) box,
    'process' + CAST( boxes.box AS VARCHAR(10) ) process,
    ABS( CAST( rn % 3000 AS BIT ) -1 ) error_code
FROM cte c
    CROSS JOIN ( SELECT TOP 10 rn FROM cte ) AS boxes(box)
    CROSS JOIN ( SELECT TOP 30 rn FROM cte ) AS processes(process);


-- Create normal clustered index to order the data
CREATE CLUSTERED INDEX cci_metrics_table ON dbo.metrics_table ( start_time, end_time, box, process );
--CREATE CLUSTERED INDEX cci_metrics_table ON dbo.metrics_table ( box, process, start_time, end_time );

-- Convert to columnstore
CREATE CLUSTERED COLUMNSTORE INDEX cci_metrics_table ON dbo.metrics_table WITH ( MAXDOP = 1, DROP_EXISTING = ON );



IF OBJECT_ID('tempdb..#tmp1' ) IS NOT NULL DROP TABLE #tmp1

-- two hour window before, 1 hour window after
SELECT
    c1.start_time,
    c1.end_time,
    c1.box,
    c1.process,
    DATEDIFF( ms, c1.start_time, c1.end_time ) AS duration_ms,
    DATEPART( dw, c1.start_time ) AS day_of_week,
    DATEPART( hour, c1.start_time ) AS hour_of_day,
    c2.xavg,
    c2.num_process_total,
    c2.num_process_unique,
    c2.num_box_unique,
    c3.percent_over_thresh

INTO #tmp1

FROM dbo.metrics_table c1
    CROSS APPLY
        (
        SELECT
            COUNT(1) AS num_process_total, 
            AVG( CAST( DATEDIFF( ms, start_time, end_time ) AS NUMERIC ) ) xavg,
            COUNT( DISTINCT process ) num_process_unique,
            COUNT( DISTINCT box ) num_box_unique
        FROM dbo.metrics_table c2
        WHERE c2.error_code = 0
          AND c2.start_time Between DATEADD( minute, -120, c1.start_time ) And c1.start_time
          AND c1.start_time > c2.start_time
        ) c2

    CROSS APPLY
        (
        SELECT
            AVG( CASE WHEN DATEDIFF( ms, c4.start_time, c4.end_time ) > 1000 THEN 1.0 ELSE 0.0 END ) percent_over_thresh
        FROM dbo.metrics_table c4
        WHERE c4.error_code = 0
          AND c4.start_time Between c1.start_time And DATEADD( minute, 60, c1.start_time )
          AND c4.start_time > c1.start_time
        ) c3

WHERE error_code = 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM