简体   繁体   English

T-SQL查询优化

[英]T-SQL Query Optimization

I'm working on some upgrades to an internal web analytics system we provide for our clients (in the absence of a preferred vendor or Google Analytics), and I'm working on the following query: 我正在为我们为客户提供的内部网络分析系统进行一些升级(没有首选供应商或Google Analytics(分析)),并且正在处理以下查询:

select 
    path as EntryPage, 
    count(Path) as [Count] 
from 
    (
        /* Sub-query 1 */
        select 
            pv2.path
        from 
            pageviews pv2 
                inner join
                    (
                        /* Sub-query 2 */
                        select
                            pv1.sessionid,
                            min(pv1.created) as created
                        from
                            pageviews pv1 
                                inner join Sessions s1 on pv1.SessionID = s1.SessionID
                                inner join Visitors v1 on s1.VisitorID = v1.VisitorID
                        where
                            pv1.Domain = isnull(@Domain, pv1.Domain) and
                            v1.Campaign = @Campaign
                        group by
                            pv1.sessionid
                    ) t1 on pv2.sessionid = t1.sessionid and pv2.created = t1.created
    ) t2
group by 
    Path;

I've tested this query with 2 million rows in the PageViews table and it takes about 20 seconds to run. 我已经在PageViews表中用200万行测试了该查询,运行大约需要20秒钟。 I'm noticing a clustered index scan twice in the execution plan, both times it hits the PageViews table. 我注意到在执行计划中两次执行聚集索引扫描,两次都击中PageViews表。 There is a clustered index on the Created column in that table. 该表的“创建”列上有一个聚集索引。

The problem is that in both cases it appears to iterate over all 2 million rows, which I believe is the performance bottleneck. 问题在于,在这两种情况下,它似乎都会遍历所有200万行,我认为这是性能瓶颈。 Is there anything I can do to prevent this, or am I pretty much maxed out as far as optimization goes? 有什么我可以做的来防止这种情况发生,或者就优化而言我是否已经尽力了?

For reference, the purpose of the query is to find the first page view for each session. 作为参考,该查询的目的是找到每个会话的第一页视图。

EDIT: After much frustration, despite the help received here, I could not make this query work. 编辑:经过无奈,尽管在这里获得了帮助,但我无法使此查询正常工作。 Therefore, I decided to simply store a reference to the entry page (and now exit page) in the sessions table, which allows me to do the following: 因此,我决定在会话表中简单地存储对入口页面(和现在出口页面)的引用,这使我可以执行以下操作:

select
    pv.Path,
    count(*)
from
    PageViews pv
        inner join Sessions s on pv.SessionID = s.SessionID
            and pv.PageViewID = s.ExitPage
        inner join Visitors v on s.VisitorID = v.VisitorID
where
    (
        @Domain is null or 
        pv.Domain = @Domain
    ) and
    v.Campaign = @Campaign
group by pv.Path;

This query runs in 3 seconds or less. 该查询将在3秒或更短的时间内运行。 Now I either have to update the entry/exit page in real time as the page views are recorded (the optimal solution) or run a batch update at some interval. 现在,我不得不在记录页面浏览量时实时更新进入/退出页面(最佳解决方案),或者以一定间隔运行批处理更新。 Either way, it solves the problem, but not like I'd intended. 无论哪种方式,它都能解决问题,但不像我想要的那样。

Edit Edit: Adding a missing index (after cleaning up from last night) reduced the query to mere milliseconds). 编辑编辑:添加丢失的索引(从昨晚清除后)将查询减少到毫秒。 Woo hoo! 呜呜!

For starters, 对于初学者,

    where pv1.Domain = isnull(@Domain, pv1.Domain) 

won't SARG. 不会SARG。 You can't optimize a match on a function, as I remember. 我记得您无法优化函数的匹配。

To continue from doofledorf. 从doofledorf继续。

Try this: 尝试这个:

where
   (@Domain is null or pv1.Domain = @Domain) and
   v1.Campaign = @Campaign

Ok, I have a couple of suggestions 好吧,我有几点建议

  1. Create this covered index: 创建此涵盖的索引:

      create index idx2 on [PageViews]([SessionID], Domain, Created, Path) 
  2. If you can amend the Sessions table so that it stores the entry page, eg. 如果可以修改Sessions表,使其存储条目页面,例如。 EntryPageViewID you will be able to heavily optimise this. EntryPageViewID您将能够对此进行优化。

Your inner query (pv1) will require a nonclustered index on (Domain). 您的内部查询(pv1)将要求(域)上具有非聚集索引。

The second query (pv2) can already find the rows it needs due to the clustered index on Created, but pv1 might be returning so many rows that SQL Server decides that a table scan is quicker than all the locks it would need to take. 由于Created上的聚集索引,第二个查询(pv2)已经可以找到所需的行,但是pv1可能返回的行太多,以至于SQL Server决定表扫描比需要进行的所有锁定都要快。 As pv1 groups on SessionID (and hence has to order by SessionID), a nonclustered index on SessionID, Created, and including path should permit a MERGE join to occur. 由于SessionID上的pv1组(因此必须按SessionID进行排序),因此SessionID的非聚集索引(已创建并包含路径)应允许发生MERGE连接。 If not, you can force a merge join with "SELECT .. FROM pageviews pv2 INNER MERGE JOIN ..." 如果不是,则可以通过“ SELECT .. FROM pageviews pv2 INNER MERGE JOIN ...”强制进行合并联接。

The two indexes listed above will be: 上面列出的两个索引将是:

CREATE NONCLUSTERED INDEX ncixcampaigndomain ON PageViews (Domain) 在PageViews(域)上创建非索引索引ncixcampaigndomain

CREATE NONCLUSTERED INDEX ncixsessionidcreated ON PageViews(SessionID, Created) INCLUDE (path) 在PageViews上创建NONCLUSTERED INDEX ncixsessionid创建(SessionID,已创建)INCLUDE(路径)

I'm back. 我回来了。 To answer your first question, you could probably just do a union on the two conditions, since they are obviously disjoint. 要回答您的第一个问题,您可能只需在这两个条件上进行合并,因为它们显然是不相交的。

Actually, you're trying to cover both the case where you provide a domain, and where you don't. 实际上,您试图同时涵盖提供域名和不提供域名的情况。 You want two queries. 您需要两个查询。 They may optimize entirely differently. 它们可能完全不同地进行优化。

What's the nature of the data in these tables? 这些表中数据的本质是什么? Do you find most of the data is inserted/deleted regularly? 您是否发现大多数数据是定期插入/删除的?

Is that the full schema for the tables? 这是表的完整架构吗? The query plan shows different indexing.. Edit: Sorry, just read the last line of text. 查询计划显示不同的索引。编辑:对不起,请阅读文本的最后一行。 I'd suggest if the tables are routinely cleared/insertsed, you could think about ditching the clustered index and using the tables as heap tables.. just a thought 我建议如果定期清除/插入表,则可以考虑放弃聚集索引并将表用作堆表。

Definately should put non-clustered index(es) on Campaign, Domain as John suggested 一定要像约翰建议的那样在Campaign,Domain上放置非聚集索引

SELECT  
    sessionid,  
    MIN(created) AS created  
FROM  
    pageviews pv  
JOIN  
    visitors v ON pv.visitorid = v.visitorid  
WHERE  
    v.campaign = @Campaign  
GROUP BY  
    sessionid  

so that gives you the sessions for a campaign. 这样就为您提供了广告系列的会话。 Now let's see what you're doing with that. 现在,让我们看看您在做什么。

OK, this gets rid of your grouping: 好的,这摆脱了您的分组:

SELECT  
    campaignid,  
    sessionid,   
    pv.path  
FROM  
    pageviews pv  
JOIN  
    visitors v ON pv.visitorid = v.visitorid  
WHERE  
    v.campaign = @Campaign  
    AND NOT EXISTS (  
        SELECT 1 FROM pageviews  
        WHERE sessionid = pv.sessionid  
        AND created < pv.created  
    )  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM