简体   繁体   English

使用三重自联接执行缓慢的SQL查询

[英]Slow performing SQL query with triple self-join

I have a legacy database with the following table (note: no primary key) 我有一个带有下表的遗留数据库(注意:没有主键)

It defines each a record for each accommodation "unit" and date, and the price for that date. 它为每个住宿“单位”和日期以及该日期的价格定义了每条记录。

CREATE TABLE [single_date_availability](
    [accommodation_id] [int],
    [accommodation_unit_id] [int],
    [arrival_date] [datetime],
    [price] [decimal](18, 0),
    [offer_discount] [decimal](18, 0),
    [num_pax] [int],
    [rooms_remaining] [int],
    [eta_available] [int],
    [date_correct] [datetime],
    [max_occupancy] [int],
    [max_adults] [int],
    [min_stay_nights] [int],
    [max_stay_nights] [int],
    [nights_remaining_count] [numeric](2, 0)
) ON [PRIMARY]

The table contains roughly 16,500 records. 该表包含大约16,500条记录。

But I need to multiply out the data in a completely different format, like such: 但我需要以完全不同的格式将数据相乘,例如:

  • Accommodation 住所
  • Date 日期
  • Duration 持续时间
  • Total price 总价

Up to a max duration for each arrival date. 每个到达日期的最长持续时间。

I'm using the following query to achieve this: 我正在使用以下查询来实现此目的:

SELECT
    MIN(units.MaxAccommodationAvailabilityPax) AS MaxAccommodationAvailabilityPax,
    MIN(units.MaxAccommodationAvailabilityAdults) AS MaxAccommodationAvailabilityAdults,
    StartDate AS DepartureDate,
    EndDate AS ReturnDate,
    DATEDIFF(DAY, StartDate, EndDate) AS Duration,
    MIN(units.accommodation_id) AS AccommodationID, 
    x.accommodation_unit_id AS AccommodationUnitID,
    SUM(Price) AS Price,
    MAX(num_pax) AS Occupancy,
    SUM(offer_discount) AS OfferSaving,
    MIN(date_correct) AS DateTimeCorrect,
    MIN(rooms_remaining) AS RoomsRemaining,
    MIN(CONVERT(int, dbo.IsGreaterThan(ISNULL(eta_available, 0)+ISNULL(nights_remaining_count, 0), 0))) AS EtaAvailable
FROM single_date_availability fp
INNER JOIN (
    /* This gets max availability for the whole accommodation on the arrival date */
    SELECT accommodation_id, arrival_date,
        CASE EtaAvailable WHEN 1 THEN 99 ELSE MaxAccommodationAvailabilityPax END AS MaxAccommodationAvailabilityPax,
        CASE EtaAvailable WHEN 1 THEN 99 ELSE MaxAccommodationAvailabilityAdults END AS MaxAccommodationAvailabilityAdults
    FROM (SELECT accommodation_id, arrival_date, SUM(MaximumOccupancy) MaxAccommodationAvailabilityPax, SUM(MaximumAdults) MaxAccommodationAvailabilityAdults,
            CONVERT(int, WebData.dbo.IsGreaterThan(SUM(EtaAvailable), -1)) AS EtaAvailable                 
            FROM (SELECT accommodation_id, arrival_date, MIN(rooms_remaining*max_occupancy) as MaximumOccupancy,
                    MIN(rooms_remaining*max_adults) as MaximumAdults, MIN(ISNULL(eta_available, 0) + ISNULL(nights_remaining_count, 0) - 1) as EtaAvailable
                    FROM single_date_availability
                    GROUP BY accommodation_id, accommodation_unit_id, arrival_date) a 
            GROUP BY accommodation_id, arrival_date) b
) units ON fp.accommodation_id = units.accommodation_id AND fp.arrival_date = units.arrival_date
INNER JOIN (
    /* This gets every combination of StartDate and EndDate for each Unit/Occupancy */
    SELECT DISTINCT a.accommodation_unit_id, StartDate = a.arrival_date,
        EndDate = b.arrival_date+1, Duration = DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1
        FROM single_date_availability AS a
        INNER JOIN (SELECT accommodation_unit_id, arrival_date FROM single_date_availability) AS b
        ON a.accommodation_unit_id = b.accommodation_unit_id
            AND DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1 >= a.min_stay_nights
            AND DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1 <= (CASE a.max_stay_nights WHEN 0 THEN 28 ELSE a.max_stay_nights END)
) x ON fp.accommodation_unit_id = x.accommodation_unit_id AND fp.arrival_date >= x.StartDate AND fp.arrival_date < x.EndDate
GROUP BY x.accommodation_unit_id, StartDate, EndDate
/* This ensures that all dates between StartDate and EndDate are actually available */
HAVING COUNT(*) = DATEDIFF(DAY, StartDate, EndDate)

This works and gives me about 413,000 records. 这有效,给了我大约413,000条记录。 The results of this query I'm using to update another table. 这个查询的结果我用来更新另一个表。

But the query performs quite badly, as you might expect with so many self-joins. 但是查询执行起来非常糟糕,正如您可能期望的那样有很多自联接。 It takes about 15 secs to run locally, but on our test server takes over 1:30 mins, and on our live SQL server takes over 30 secs; 在本地运行大约需要15秒,但在我们的测试服务器上需要1:30分钟,在我们的实时SQL服务器上需要超过30秒; and in all cases it maxes out the CPU while it's performing the larger of the joins. 并且在所有情况下,它在执行更大的连接时最大化CPU。

No other processes are accessing the table at the same time, and that can be assumed. 没有其他进程同时访问该表,可以假设。

I don't really mind the length of the query so much as the demand on the CPU, which can cause problems for other queries trying to access other databases / tables at the same time. 我真的不介意查询的长度,就像对CPU的需求一样,这可能会导致其他查询同时尝试访问其他数据库/表时出现问题。

I have run the query through query optimizer and followed all the recommendations for indexes and statistics. 我已通过查询优化器运行查询,并遵循索引和统计信息的所有建议。

Any help on making this query faster or at least less CPU intensive would be much appreciated. 任何帮助使这个查询更快或至少减少CPU密集的帮助将非常感激。 If it needs to be broken down into different stages, that's acceptable. 如果需要将其分解为不同的阶段,那是可以接受的。

To be honest speed is not so important as it's a bulk operation performed on a table that's not being touched by other processes. 说实话,速度并不是那么重要,因为它是在没有被其他进程触及的表上执行的批量操作。

I'm not particularly looking for comments on how terrible and un-normalized this structure is... that, I already know :-) 我并不是特别关注这个结构有多糟糕和不规范化的评论......我已经知道了:-)

This site is for professional programmers, right. 这个网站是专业程序员的权利。

It is stultifying to try and operate on a "table" without a primary key. 在没有主键的情况下尝试操作“表”是很麻烦的。 Fine, it is a workspace, not a real table (but it is large, and you are trying to perform relational table operations on it). 很好,它是一个工作区,而不是一个真正的表(但它很大,你试图在它上面执行关系表操作)。 Fine, you know it is unnormalised. 好吧,你知道它是非标准化的。 Actually the database is unnormalised, and this "table" is a product of it: an exponential unnormalised product. 实际上数据库是非标准化的,这个“表”是它的产物:指数非标准化产品。

This works and gives me about 413,000 records. 这有效,给了我大约413,000条记录。 The results of this query I'm using to update another table. 这个查询的结果我用来更新另一个表。

That is even more crazy. 那更加疯狂。 All this (a) temp worktables an (b) temp worktables for the temp worktables business are classic symptoms of an unnormalised database. 所有这些(a)临时工作表和(b)临时工作台业务的临时工作表是非规范化数据库的典型症状。 OR inability to understand the data as it is, how to get the data out, and creating unnecessary worktables to supply your need. 或无法理解数据,如何获取数据,以及创建不必要的工作表以满足您的需求。 I am not trying to get you to change that, which would be the first option , and which would eliminate the need for this entire mess. 我不是试图让你改变它,这将是第一个选择 ,并且将消除对这整个混乱的需要。

The second option would be, see if you can produce the final result from the original tables, either: 第二个选项是,看看你是否可以从原始表中产生最终结果:
- using no worktables - 不使用工作台
- using one worktable - 使用一个工作台
instead of the two worktables (16,500 and 413,000 "records"; that's two levels of exponential unnormalisation) 而不是两个工作表(16,500和413,000“记录”;这是指数非正常化的两个级别)

The third option is, improve the mess you have ... but first you need to understand where the performance hogs are ... 第三种选择是,改善你所拥有的混乱......但首先你需要了解表现猪的位置......

But the query performs quite badly, as you might expect with so many self-joins 但是查询执行起来非常糟糕,正如您可能期望的那样有很多自联接

Nonsense, joins and self-joins cost nothing. 无意义,连接和自连接都没有任何成本。 The problems are, the cost is in: 问题是,成本是:

  • you are operating on a Heap 你在堆上操作

  • without a PK 没有PK

    • those two item alone mean performance has not been considered and cannot be expected 仅这两个项目的平均表现尚未被考虑且无法预期
  • using operators and functions (rather than pure "=") in joins means the server cannot make reasonable decisions on the search values, so you are table scanning all the time 在连接中使用运算符和函数(而不是纯“=”)意味着服务器无法对搜索值做出合理的决定,因此您始终在进行表扫描

  • table size (maybe different on Dev/Test/Prod) 表格大小(Dev / Test / Prod可能不同)

  • valid, useable indices (or not) 有效的,可用的指数(或不是)

  • the cost is in those four items, the heaps being brutishly slow in every aspect, and the operators not identifying anything to narrow the searches; 成本在这四个项目中,各个方面的堆都非常慢,而且运营商没有找到任何可以缩小搜索范围的内容; not the fact there is or is not a join operation. 不是有或没有连接操作的事实。

The next series of issues is the way you are doing it. 下一系列问题是你的方式。

  • Do you NOT realise that the "joins" are materialised tables; 你没有意识到“连接”是物化表; you are not "joining" you are materialising TABLES on the fly ??? 你没有“加入”你正在实现表格??? Nothing is free: materialisation has an enormous cost. 没有什么是免费的:物化有巨大的成本。 You are so focused on materialising without any idea of the cost, that you think the joins are the problem. 您如此专注于实现而不知道成本,您认为连接是问题所在。 Why is that ? 这是为什么 ?

  • Before you can make any reasonable coding decisions, you need to SET SHOWPLAN and STATISTICS IO ON. 在做出任何合理的编码决定之前,您需要设置SHOWPLAN和STATISTICS IO ON。 Do this while you are developing (it is nowhere near ready for "testing"). 在你开发的过程中这样做(它还没有准备好进行“测试”)。 That will give you an idea of the tables; 这会让你了解表格; the joins (what you expect vs what it determined, from the mess); 连接(你所期望的与它所决定的,从混乱中); the worktables (materialised). 工作表(物化)。 The high CPU usage is nothing, wait until you see the insane I/O your code uses. 高CPU使用率是没有的,等到你看到你的代码使用疯狂的I / O. If you want to argue about the cost of materialising on the fly, be my guest, but post the SHOWPLAN first. 如果你想争论实时成本,请成为我的客人,但首先发布SHOWPLAN。

  • note that the materialised tables have no indices, so it table scans every time , for the um "joins". 请注意, 实体化表没有索引,因此每次都会对um“连接”进行表扫描

  • The select as is, is doing tens of times (maybe hundreds) more work than it needs to. 按原样选择,比它需要的工作多几十(甚至几百)。 Since the table is there, and it has not moved, materialising another version of it is a very silly thing to do. 由于桌子在那里,并且它没有移动,实现它的另一个版本是一件非常愚蠢的事情。 So, the true question is: 所以,真正的问题是:

How Come My SQL query with One table and Six Materialised Versions of Itself is Slow ? 为什么我的SQL查询与一个表和六个物化版本本身很慢?

.
In case you are not sure, that means eliminate the six materialised tables and replace them with pure joins to the main table. 如果你不确定,这意味着消除六个物化表并用纯连接替换它们到主表。

  • If you can accept breaking it up, then do so. 如果你能接受分手,那就去做吧。 Create and load temp tables that this query is going to use FIRST (that means 3 temp tables for aggregates only). 创建并加载此查询将使用的临时表FIRST(这意味着仅有3个临时表用于聚合)。 Make sure you place indices on the correct columns. 确保将索引放在正确的列上。

  • So the 6 materialised tables with be replaced with 3 joins to the main table, and 3 joins to temp aggregate tables. 因此,6个物化表将被3个连接替换为主表,3个连接到临时聚合表。

  • Somewhere along the line, you have identified that you have cartesian products and duplicates; 在某个地方,您已经确定您拥有笛卡儿产品和重复产品; instead of fixing the cause (developing code that produces the set you need) you have avoided all that, left it full of dupes, and pulled out the DISTINCT rows. 而不是修复原因(开发产生你需要的集合的代码)你已经避免了所有这些,留下了充满欺骗,并拉出了DISTINCT行。 That causes an additional worktable. 这会导致额外的工作表。 Fix that. 修复它。 You have to get each of the temp tables (worktables, materialised tables, whatever) correct FIRST, before the select that uses them can be reasonably expected to be correct. 您必须先获取每个临时表(工作表,物化表,等等),然后才能合理地预期使用它们的选择是正确的。

  • THEN try the select. 然后尝试选择。

  • I presume this is all running in WebData. 我认为这都是在WebData中运行的。 If not, place IsGreaterThan() in this db. 如果没有,请将IsGreaterThan()放在此数据库中。


  1. Please provide DDL for UDF IsGreaterThan. 请为UDF IsGreaterThan提供DDL。 If that is using tables, we need to know about it. 如果是使用表格,我们需要了解它。

  2. Please provide the alleged Indices with the CREATE TABLE statement. 请使用CREATE TABLE语句提供所指控的索引。 They could be incorrect or worse, doubled up and not required. 它们可能不正确或更糟,加倍而不是必需的。

  3. Forget the Identity or forced values, what is the actual, real, natural, logical PK for this heap of a worktable ? 忘记身份或强制值,这个工作表堆的实际,真实,自然,逻辑PK是什么?

  4. Ensure you have no datatype mismatches on the join columns 确保连接列上没有数据类型不匹配

  5. Personally, I would be too ashamed to post code such as you have. 就个人而言,我会羞于发布你所拥有的代码。 It is completely unreadbable. 这是完全不可能的。 All I did, in order to identify the problems here, is format it, and make it readable. 为了找出这里的问题,我所做的只是格式化,并使其可读。 There are reasons for making code readable, such as, it allows you to spot problems quickly. 使代码可读的原因有很多,例如,它可以让您快速发现问题。 It doesn't matter what formatting you use, but you have to format, and you have to do it consistently. 使用什么格式无关紧要,但您必须格式化,并且必须始终如一地进行格式化。 Please clean it up before you post again, along with ALL related DDL. 请在再次发布之前清理它,以及所有相关的DDL。

It is no wonder that you have not been getting answers. 难怪你没有得到答案。 You need to do some basic work first (showplan, etc) and prepare the code so that human beings can read it, so that they can provide answers. 您需要先做一些基本的工作(showplan等)并准备好代码,以便人类可以阅读它,以便他们可以提供答案。

SELECT
        MIN(units.MaxAccommodationAvailabilityPax) AS MaxAccommodationAvailabilityPax,
        MIN(units.MaxAccommodationAvailabilityAdults) AS MaxAccommodationAvailabilityAdults,
        StartDate AS DepartureDate,
        EndDate AS ReturnDate,
        DATEDIFF(DAY, StartDate, EndDate) AS Duration,
        MIN(units.accommodation_id) AS AccommodationID, 
        x.accommodation_unit_id AS AccommodationUnitID,
        SUM(Price) AS Price,
        MAX(num_pax) AS Occupancy,
        SUM(offer_discount) AS OfferSaving,
        MIN(date_correct) AS DateTimeCorrect,
        MIN(rooms_remaining) AS RoomsRemaining,
        MIN(CONVERT(int, dbo.IsGreaterThan(ISNULL(eta_available, 0)+ISNULL(nights_remaining_count, 0), 0))) 
            AS EtaAvailable
    FROM single_date_availability fp INNER JOIN (
        -- This gets max availability for the whole accommodation on the arrival date
        SELECT  accommodation_id, arrival_date,
                CASE EtaAvailable 
                    WHEN 1 THEN 99
                    ELSE MaxAccommodationAvailabilityPax 
                    END AS MaxAccommodationAvailabilityPax,
                CASE EtaAvailable
                    WHEN 1 THEN 99
                    ELSE MaxAccommodationAvailabilityAdults
                    END AS MaxAccommodationAvailabilityAdults
            FROM ( 
                SELECT  accommodation_id, arrival_date,
                        SUM(MaximumOccupancy) 
                        MaxAccommodationAvailabilityPax,
                        SUM(MaximumAdults) MaxAccommodationAvailabilityAdults,
                        CONVERT(int, WebData.dbo.IsGreaterThan(SUM(EtaAvailable), -1))
                            AS EtaAvailable                 
                    FROM ( 
                        SELECT  accommodation_id,
                                arrival_date,
                                MIN(rooms_remaining*max_occupancy) as MaximumOccupancy,
                                MIN(rooms_remaining*max_adults) as MaximumAdults, 
                                MIN(ISNULL(eta_available, 0) + ISNULL(nights_remaining_count, 0) - 1)
                                    as EtaAvailable
                            FROM single_date_availability
                            GROUP BY accommodation_id, accommodation_unit_id, arrival_date
                            ) a 
                    GROUP BY accommodation_id, arrival_date
                    ) b
            ) units 
        ON fp.accommodation_id = units.accommodation_id 
        AND fp.arrival_date = units.arrival_date INNER JOIN (
            -- This gets every combination of StartDate and EndDate for each Unit/Occupancy
            SELECT  D.I.S.T.I.N.C.T a.accommodation_unit_id,
                    StartDate = a.arrival_date,
                    EndDate = b.arrival_date+1,
                    Duration = DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1
                FROM single_date_availability AS a INNER JOIN ( 
                    SELECT  accommodation_unit_id,
                            arrival_date 
                        FROM single_date_availability
                        ) AS b
                ON a.accommodation_unit_id = b.accommodation_unit_id
                AND DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1 >= a.min_stay_nights
                AND DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1 <= (
                    CASE a.max_stay_nights 
                        WHEN 0 THEN 28 
                        ELSE a.max_stay_nights 
                        END
                )
        ) x ON fp.accommodation_unit_id = x.accommodation_unit_id 
        AND fp.arrival_date >= x.StartDate 
        AND fp.arrival_date < x.EndDate
    GROUP BY x.accommodation_unit_id, StartDate, EndDate
    -- This ensures that all dates between StartDate and EndDate are actually available
    HAVING COUNT(*) = DATEDIFF(DAY, StartDate, EndDate)

this most likely won't fix all of your issues, but try switching 这很可能无法解决您的所有问题,但请尝试切换

AND DATEDIFF(DAY , a.arrival_date , b.arrival_date) + 1 >= a.min_stay_nights
AND DATEDIFF(DAY , a.arrival_date , b.arrival_date) + 1 <= (CASE a.max_stay_nights WHEN 0 THEN 28 ELSE a.max_stay_nights END)

to

and a.min_stay_nights<=DATEDIFF(DAY , a.arrival_date , b.arrival_date)
and (CASE a.max_stay_nights WHEN 0 THEN 28 ELSE a.max_stay_nights END)>=DATEDIFF(DAY , a.arrival_date , b.arrival_date) + 1

the reason being is that, as far as i can recall, sql server doesn't like functions on the left side of the = sign in where clauses 原因是,据我所知,sql server不喜欢=符号左侧的函数where where子句

Since you said you have already run query optimizer then I can only assume all your indexes are correct. 既然你说你已经运行了查询优化器,那么我只能假设你的所有索引都是正确的。 My next approach is to do the join in the application. 我的下一个方法是在应用程序中进行连接。 What do I mean by that? 那是什么意思? Instead of having DB do the joins of 100 thousand rows. 而不是让DB做10万行的连接。 Fetch all of them once in your application and then you loops and logic to do what you would have done in sql instead. 在您的应用程序中获取所有这些,然后循环和逻辑来执行您在sql中所做的事情。

Reason for this is that many fe applications like facebook, yahoo, aol frown upon joins. 原因是许多fe应用程序,如facebook,yahoo,aol皱眉加入。 Joins are not the best thing to do unless you know it will be fast. 加入并不是最好的事情,除非你知道它会很快。 In this case, you would want to the join in application then cache it for future needs. 在这种情况下,您可能希望在应用程序中加入,然后将其缓存以备将来使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM