简体   繁体   English

重叠预订查询的性能是否可以提高?

[英]Can the performance of this overlapping bookings query be improved?

I maintain an online bookings system that occasionally contains duplicate overlapping bookings as a result of a bug(s) we are trying to locate. 我维护一个在线预订系统,由于我们要查找的错误,偶尔会包含重复的重复预订。 While we are doing so, I've been given a query to list the overlapping bookings for the past two months so we can manually address them. 在执行此操作时,系统向我查询了过去两个月中重叠的预订,以便我们手动进行处理。

My problem is that this query takes forever (5+ minutes) to run and the bookings system grinds to halt while it does so to the detriment of our users. 我的问题是,此查询要花很长时间(5分钟以上)才能运行,而预订系统却停止运行,这对我们的用户不利。 So I'd like to improve its performance. 因此,我想提高其性能。

The relevant schema is pseudo-coded below. 相关架构在下面是伪编码。 There are two key tables and their respective columns. 有两个关键表及其各自的列。

Bookings                        Accounts
ID : int                        ID : int
Status : bool                   Status : bool
StartTime : datetime            Name : varchar
EndTime : datetime
RoomID : int
MemberID : int
AccountID : int

PK: ID                          PK: ID
Index: StartTime, EndTime, 
       MemberID, AccountID,
       RoomID, Status

The keys are all simple keys (ie. no compound keys). 这些键都是简单键(即没有复合键)。 Bookings.AccountID is a foreign key into Accounts.ID. Bookings.AccountID是Accounts.ID的外键。

The query is roughly: 该查询大致是:

SELECT b1.AccountID, a.Name, b1.ID, b2.ID, b1.StartTime, b1.EndTime, b1.RoomID
FROM Bookings b1
LEFT JOIN Bookings b2
ON b1.MemberID = b2.MemberID
   AND b1.RoomID = b2.RoomID
   AND b2.StartTime > SUBDATE(NOW(), INTERVAL 2 MONTH)) 
LEFT JOIN Accounts a
ON b1.AccountId = a.ID 
WHERE b1.ID != b2.ID
AND b1.Status = 1
AND b2.Status = 1
AND b1.StartTime > SUBDATE(NOW(), INTERVAL 2 MONTH)) 
AND (
  (b1.StartTime >= b2.StartTime AND b2.EndTime <= b1.EndTime AND b1.StartTime < b2.EndTime) OR
  (b1.StartTime <= b2.StartTime AND b2.EndTime >= b1.EndTime AND b2.StartTime < b1.EndTime) OR
  (b2.StartTime <= b1.StartTime AND b2.EndTime >= b1.EndTime)
)

As far as I can tell, the query essentially joins the bookings table to itself (for the past two months) and attempts to eliminate distinct bookings. 据我所知,该查询实际上将预订表与其自身(过去两个月)结合在一起,并尝试消除不同的预订。 That is, it looks for valid (status=1) bookings belonging to the same member for the same room where the duration of the bookings overlap. 也就是说,它将在预订持续时间重叠的同一房间内寻找属于同一成员的有效(状态= 1)预订。

The last three clauses look for (a) a booking starting during the other and finishing after; 最后三个条款寻找(a)在其他期间开始并在之后结束的预订; (b) a booking starting before the other and finishing during; (b)预订在另一个之前开始,并在此期间结束; and (c) a booking wholly contained within the other. (c)完全包含在另一个中的预订。 This appears to omit (for mine) a booking wholly around the other (although I'm not sure why). 这似乎忽略了(对于我而言)完全围绕另一个预订(尽管我不确定为什么)。

The bookings table is very large (~2m rows) as it has years of bookings data in it. 预订表非常大(约200万行),因为其中包含多年的预订数据。 Can the performance of this query be improved (or replaced with a better one)? 此查询的性能是否可以提高(或替换为更好的查询)? Any suggestions welcome. 任何建议欢迎。

I would rewrite the query like this 我会这样重写查询

SELECT sub.*, a.Name, a.id
from (

    SELECT b1.AccountId, b1.ID, b2.ID, b1.StartTime, b1.EndTime, b1.RoomID
    FROM (select SUBDATE(NOW(), INTERVAL 2 MONTH) as subDate) const, Bookings b1
    LEFT JOIN Bookings b2
    ON b1.MemberID = b2.MemberID
       AND b1.RoomID = b2.RoomID
       AND b2.StartTime > const.subDate
       AND b1.ID != b2.ID 
       AND b2.Status = 1
    WHERE 
    b1.Status = 1
    AND b1.StartTime > const.subDate  
    AND (
      (b1.StartTime >= b2.StartTime AND b2.EndTime <= b1.EndTime AND b1.StartTime < b2.EndTime) OR
      (b1.StartTime <= b2.StartTime AND b2.EndTime >= b1.EndTime AND b2.StartTime < b1.EndTime) OR
      (b2.StartTime <= b1.StartTime AND b2.EndTime >= b1.EndTime)
    )

) sub
LEFT JOIN Accounts a ON 
  sub.AccountId = a.ID 

UPDATE: Also check whether there are indexes for columns MemberID, RoomId, StartTime. 更新:还检查是否存在成员ID,RoomId,StartTime列的索引。 If there are no such indexes introduce them 如果没有这样的索引,请介绍它们

You didn't say whether this is like an e-commerce site for hotel/rental booking, or something like an intranet site for booking conference rooms, lecture halls, etc within an organization. 您没有说这像是一个用于酒店/租赁预订的电子商务网站,还是一个用于组织内部会议室,演讲厅等的内部网站的网站。 I'm going to assume it's the former, since 5 minutes of downtime for that site would be significant, but for the latter, probably not as big of a deal. 我要假设是前者,因为该站点的5分钟停机时间很长,但是对于后者,可能没什么大不了的。

So here's a heuristic you can use : It's unlikely (but not impossble) that a user would book the same room more than once within a two month period. 因此,您可以使用一种启发式方法 :用户在两个月内不太可能(但并非不可能)预订同一房间的次数超过一次。 If you select all the room IDs and user IDs within the timeframe, duplicate rows within the results could be a double-booking, or maybe just someone who goes on vacation a lot. 如果您选择时间范围内的所有房间ID和用户ID,则结果中重复的行可能是一本重复预订的书,或者可能只是经常度假的人。

This is one way duplicate row detection could be done: 这是可以完成重复行检测的一种方法:

SELECT ID, StartTime, EndTime, RoomID, MemberID 
FROM Bookings WHERE ID NOT IN
( SELECT t.ID FROM
    (
        SELECT count(ID) as c, ID
        FROM Bookings
        GROUP BY RoomID, MemberID
    ) 
AS t WHERE t.c = 1 )

You could also use a stored procedure something like this (pseudocode-ish): 您也可以使用类似以下的存储过程(pseudocode-ish):

DECLARE id, rid, mid, old_rid, old_mid INT;
DECLARE cur CURSOR FOR SELECT ID, RoomID, MemberID FROM Bookings ORDER BY RoomID, MemberID;
old_rid, old_mid = 0;
LOOP
/* check for break condition here */
FETCH cur into id, rid, mid;
IF rid == old_rid AND mid == old_mid
INSERT INTO temp_table VALUES (id);
END IF;
SET old_rid = rid;
SET old_mid = mid;
END LOOP;

Then you'd run a query like your original one with StartTime/EndTime comparison on the result. 然后,您将运行与原始查询类似的查询,并对结果进行StartTime / EndTime比较。

Essentially you were searching for all unique bookings. 本质上,您正在搜索所有独特的预订。 It is way faster to search for all the duplicates since that list should be shorter: 搜索所有重复项的方法更快,因为该列表应该更短:

DROP TABLE IF EXISTS duplicate_bookings;

CREATE TEMPORARY TABLE duplicate_bookings AS SELECT MAX(b1.ID) as last_bookings_id, b1.AccountID, b1.StartTime, b1.EndTime, b1.RoomID
FROM Bookings b1 
GROUP BY b1.AccountID, b1.StartTime, b1.EndTime, b1.RoomID
HAVING COUNT(*)>1;

This query selects all booking which are duplicates and (my) assumption is you want to delete the last booking (MAX(b1.ID)) 此查询选择所有重复的预订,并且(我)假设您要删除最后一个预订(MAX(b1.ID))

Delete the booking by: 通过以下方式删除预订:

DELETE FROM bookings WHERE id IN (SELECT last_bookings_id FROM duplicate_bookings);

Benefit: You can repeat this is a loop (execute all SQL in a single database session including the drop of the table duplicate_bookings) if you have triplicates, quadruples, etc. 好处:如果您具有三重,四倍等,则可以重复此循环(在单个数据库会话中执行所有SQL,包括删除表duplicate_bookings)。

To prevent new duplicates and find your bug real quick, and assuming you are using innodb: Add a unique index on: 为了防止新的重复并快速发现错误,并假设您正在使用innodb:在以下位置添加唯一索引:

CREATE UNIQUE INDEX idx_nn_1 ON Bookings(AccountID, StartTime, EndTime,RoomID);

YOu can only add this index after removing your duplicates. 您只能在删除重复项后添加此索引。 New duplicate inserts will fail from that point on. 从那时起,新的重复插入将失败。

Also a temporary index which might help in your deletion would be the non-unique index: 还有一个可能有助于删除的临时索引是非唯一索引:

CREATE INDEX idx_nn_2 ON Bookings(AccountID, StartTime, EndTime,RoomID);

This compound index 复合指数

INDEX(MemberID, RoomID, StartTime)

should speed up the first JOIN. 应该加快第一个JOIN的速度。

This should speed up the SELECT: 这样可以加快SELECT的速度:

INDEX(Status, StartTime)

(No, it is not the same to have individual INDEXes on the fields.) (不,在字段上具有单独的索引并不相同。)

For overlapping time ranges, consider this compact form: 对于重叠的时间范围,请考虑以下紧凑形式:

WHERE a.start < b.end AND a.end > b.start 

What is the meaning of Status = 1 ? Status = 1是什么意思? What percentage of the table has 1 ? 表中百分之几具有1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM