简体   繁体   English

SELECT查询SQL服务器的效率

[英]Efficiency of SELECT queries SQL Server

My friend and I tried to build an advanced model for soccer pools betting.我和我的朋友尝试为足球池投注构建高级 model。 Due to limitations in Excel we found SQL to be a better choice going forward.由于 Excel 的限制,我们发现 SQL 是一个更好的选择。 We've managed to achieve what we were aiming for but the process takes around 20-30 minutes each time which I think it because of the inefficiency in the process.我们已经设法实现了我们的目标,但这个过程每次大约需要 20-30 分钟,我认为这是因为过程效率低下。 I'll try to explain what we are doing and hopefully you smart guys can point me in the right direction to do a more efficient approach.我将尝试解释我们正在做什么,希望你们聪明的人能指出我正确的方向,以采取更有效的方法。

So let's start to show you how the SQL-database looks like.因此,让我们开始向您展示 SQL 数据库的外观。 We got one main table, Rows , which contains all possible combinations of game outcomes for 13 soccer matches:我们有一个主表Rows ,其中包含 13 场足球比赛的所有可能的比赛结果组合:

CREATE TABLE [dbo].[Rows](
[RowID] [int] NULL,
[Match_1] [int] NULL,
[Match_2] [int] NULL,
[Match_3] [int] NULL,
[Match_4] [int] NULL,
[Match_5] [int] NULL,
[Match_6] [int] NULL,
[Match_7] [int] NULL,
[Match_8] [int] NULL,
[Match_9] [int] NULL,
[Match_10] [int] NULL,
[Match_11] [int] NULL,
[Match_12] [int] NULL,
[Match_13] [int] NULL
);

INSERT INTO Rows
    (RowID, Match_1, Match_2, Match_3, Match_4, Match_5, Match_6, Match_7, Match_8, Match_9, Match_10, Match_11, Match_12, Match_13) 
VALUES 
(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
(2,1,1,1,1,1,1,1,1,1,1,1,1,3),
(3,1,1,1,1,1,1,1,1,1,1,1,1,2),
(4,1,1,1,1,1,1,1,1,1,1,1,3,1),
(5,1,1,1,1,1,1,1,1,1,1,1,3,3),
(6,1,1,1,1,1,1,1,1,1,1,1,3,2),
(7,1,1,1,1,1,1,1,1,1,1,1,2,1),
(8,1,1,1,1,1,1,1,1,1,1,1,2,3),
(9,1,1,1,1,1,1,1,1,1,1,1,2,2),
(10,1,1,1,1,1,1,1,1,1,1,3,1,1);

So this amounts to around 1,6 million rows.所以这相当于大约 160 万行。 The values stand for 1=Home victory, 3=Draw, 2=Away victory.这些值代表 1=主场胜利,3=平局,2=客场胜利。

Now we need to select which ones are likely be the outcome for the current week.现在我们需要 select 哪些可能是本周的结果。 We use Excel to have our data and track which conditions to be used.我们使用 Excel 来获取我们的数据并跟踪要使用的条件。 Excel populates SQL-queries with the numbers adjusted for each gameweek so we just copy and paste into SQL Server Management Studio. Excel 使用为每个游戏周调整的数字填充 SQL 查询,因此我们只需复制并粘贴到 SQL Server Management Studio 中。 We have around 1300 conditions which we test and apply approximately 600-700 every week.我们有大约 1300 个条件,我们每周测试和应用大约 600-700 个条件。 To be able to add the data for these 13 specific games we use computed columns.为了能够添加这 13 个特定游戏的数据,我们使用计算列。 So we add around 700 computed columns to the table.因此,我们向表中添加了大约 700 个计算列。 Example:例子:

ALTER TABLE dbo.Rows ADD Group1 AS ((CASE WHEN [Match_1] = 1 THEN 0 WHEN [Match_1] = 3 THEN 0 WHEN [Match_1] = 2 THEN 0 END) + (CASE WHEN [Match_2] = 1 THEN 1 WHEN [Match_2] = 3 THEN 0 WHEN [Match_2] = 2 THEN 0 END) + (CASE WHEN [Match_3] = 1 THEN 0 WHEN [Match_3] = 3 THEN 0 WHEN [Match_3] = 2 THEN 0 END) + (CASE WHEN [Match_4] = 1 THEN 0 WHEN [Match_4] = 3 THEN 0 WHEN [Match_4] = 2 THEN 0 END) + (CASE WHEN [Match_5] = 1 THEN 0 WHEN [Match_5] = 3 THEN 0 WHEN [Match_5] = 2 THEN 0 END) + (CASE WHEN [Match_6] = 1 THEN 0 WHEN [Match_6] = 3 THEN 0 WHEN [Match_6] = 2 THEN 0 END) + (CASE WHEN [Match_7] = 1 THEN 0 WHEN [Match_7] = 3 THEN 0 WHEN [Match_7] = 2 THEN 0 END) + (CASE WHEN [Match_8] = 1 THEN 0 WHEN [Match_8] = 3 THEN 0 WHEN [Match_8] = 2 THEN 0 END) + (CASE WHEN [Match_9] = 1 THEN 0 WHEN [Match_9] = 3 THEN 0 WHEN [Match_9] = 2 THEN 0 END) + (CASE WHEN [Match_10] = 1 THEN 0 WHEN [Match_10] = 3 THEN 0 WHEN [Match_10] = 2 THEN 0 END) + (CASE WHEN [Match_11] = 1 THEN 0 WHEN [Match_11] = 3 THEN 0 WHEN [Match_11] = 2 THEN 0 END) + (CASE WHEN [Match_12] = 1 THEN 0 WHEN [Match_12] = 3 THEN 0 WHEN [Match_12] = 2 THEN 0 END) + (CASE WHEN [Match_13] = 1 THEN 0 WHEN [Match_13] = 3 THEN 0 WHEN [Match_13] = 2 THEN 0 END));

So basically what this does is to, with help of the CASE-expression, calculate how many of the desired outcomes each row has.所以基本上它的作用是在 CASE 表达式的帮助下计算每行有多少期望的结果。 For Group1 we only want exactly one game which you will see in the SELECT query below.对于Group1 ,我们只想要一个您将在下面的 SELECT 查询中看到的游戏。

The last step (which is the one which takes the majority of the time) is to SELECT all rows which fulfils all our requirements.最后一步(这是花费大部分时间的一步)是SELECT所有满足我们所有要求的行。 It uses as previously said approx 700 different conditions so we have to split it into several queries using WITH -clauses.如前所述,它使用了大约 700 个不同的条件,因此我们必须使用WITH子句将其拆分为多个查询。

WITH Step1 AS (
SELECT [Rows].[RowID],[Rows].[Match_1],[Rows].[Match_2],[Rows].[Match_3],[Rows].[Match_4],[Rows].[Match_5],[Rows].[Match_6],[Rows].[Match_7],[Rows].[Match_8],[Rows].[Match_9],[Rows].[Match_10],[Rows].[Match_11],[Rows].[Match_12],[Rows].[Match_13]
FROM Rows
WHERE [Rows].[Group1] >= 1 AND [Rows].[Group1] <= 1
)
SELECT [Step1].[RowID],[Step1].[Match_1],[Step1].[Match_2],[Step1].[Match_3],[Step1].[Match_4],[Step1].[Match_5],[Step1].[Match_6],[Step1].[Match_7],[Step1].[Match_8],[Step1].[Match_9],[Step1].[Match_10],[Step1].[Match_11],[Step1].[Match_12],[Step1].[Match_13]
INTO FinalRows
FROM Step1
;

Where should I look to simplify this and gain efficiency?我应该在哪里简化这一点并提高效率? Do you have any suggestions for me going forward?你对我的前进有什么建议吗? The ideal would be to achieve start to finish on 5-10 minutes max.理想的情况是在 5-10 分钟内完成开始。

TLDR; TLDR;

Use PERSISTED computed columns, also check the execution plan for hints on where you can optimise your query or data structures.使用PERSISTED计算列,还可以检查执行计划以获取有关可以在何处优化查询或数据结构的提示。


There are far more efficient ways to do this, the first change you should consider is storing ALL the results into the SQL database, it looks like you have only modelled the outcomes over the entire pool for a single round.有更有效的方法可以做到这一点,您应该考虑的第一个更改是将所有结果存储到 SQL 数据库中,看起来您只对整个池的结果进行了建模单轮。

I and other commenters have question the validity of such a model, especially when the additional meta-data like team identity, current position on the ladder, previous win loss ration for the season or against the same team are not taken into account.我和其他评论者对这样的 model 的有效性提出了质疑,尤其是在不考虑团队身份、当前 position 在天梯上的附加元数据、本赛季或对阵同一支球队之前的胜负率时。 By storing the data such that you only record the results of each match additional insights can be gained and used when or if needed in your betting application.通过存储数据以便您只记录每场比赛的结果,可以在您的投注应用程序需要时或在需要时获得和使用额外的见解。

In Australia we simply call this "Tipping", everyone has a go designing something like this at some point, you will find out that simply from the raw results of each round you can write efficient set-based queries to come up with all sorts of derivative models.在澳大利亚,我们简单地称之为“Tipping”,每个人都有一个 go 设计类似的东西,你会发现只需从每一轮的原始结果中,你就可以编写有效的基于集合的查询来提出各种衍生模型。 You might not even go back to excel for data management, and instead only use it for data visualization...您甚至可能不会将 go 回 excel 进行数据管理,而仅将其用于数据可视化...

The first inefficiency is that the data is not all in the database, or if it is, it is not stored in an efficient manner.第一个低效率是数据不是全部在数据库中,或者如果是,它没有以有效的方式存储。 With simple indexes you should be able to get very efficient queries over many millions of records.使用简单的索引,您应该能够对数百万条记录进行非常有效的查询。 1.6 Million rows with only 14 columns is not a very big record set.只有 14 列的 160 万行并不是一个很大的记录集。 Its not small, but you should still expect sub-second response on standard hardware for simple queries.它不小,但对于简单查询,您仍然应该期望标准硬件上的亚秒级响应。 700 columns, thats a lot of data, but if the columns are defined as TINYINT then storage is not a big deal and you should still get very fast results. 700 列,这是很多数据,但如果这些列被定义为TINYINT ,那么存储就不是什么大问题,你仍然应该得到非常快的结果。

But it sounds like your excel sheet is generating a batch of individual queries.但听起来您的 excel 工作表正在生成一批单独的查询。 There is a cost imposed simply for establishing the connection, parsing the query and returning the results, this is sometimes insignificant for a single query, but if you are issuing many queries then this handshaking and transmission cost can add up.建立连接、解析查询和返回结果都会产生成本,这对于单个查询有时是微不足道的,但如果您发出许多查询,那么这种握手和传输成本可能会加起来。 You should try to alter the individual queries into more of a set-based query.您应该尝试将单个查询更改为更多基于集合的查询。

  • This again is much easier if ALL of the data you have in Excel is migrated into SQL Server, or at least the raw results for each match of each historic round.如果您在 Excel 中拥有的所有数据都迁移到 SQL 服务器,或者至少是每个历史回合的每场比赛的原始结果,这又会容易得多。

The next obvious efficiency to be gained is with PERSISTED computed columns.下一个明显的效率是使用PERSISTED计算列。 If you do not persist the values of the computed column, then for every query, and every evaluation within that query the computed column value needs to be resolved, if your 700 columns are computed, then thats a lot of work for the server to re-calculate the same value.如果您不保留计算列的值,那么对于每个查询以及该查询中的每个评估,都需要解析计算列值,如果计算了 700 列,那么服务器需要做很多工作来重新- 计算相同的值。

Follow Specify Computed Columns in a Table for how to create PERSISTED computed columns:按照在表中指定计算列了解如何创建PERSISTED计算列:

ALTER TABLE dbo.Products ADD RetailValue AS (QtyAvailable * UnitPrice * 1.5) PERSISTED ALTER TABLE dbo.Products ADD RetailValue AS (QtyAvailable * UnitPrice * 1.5) PERSISTED

It might be that this is the first change you can make that has significant impact.这可能是您可以做出的第一个具有重大影响的更改。 This offloads the computation to the INSERT command, and it adds a storage footprint, but you will experience must greater read throughput, one might even suggest at least a 700% increase.这将计算卸载到INSERT命令,并增加了存储空间,但您将体验到更大的读取吞吐量,甚至可能建议至少增加 700%。

The last change is in your CTE query, you have not demonstrated any need for a CTE or nested query, SQL server will probably optimise that away anyway, do not be fooled into thinking that CTE's are cached results of a sub-query, that is not how they are implemented under the hood.最后一个更改是在您的 CTE 查询中,您没有证明对 CTE 或嵌套查询有任何需求,SQL 服务器可能无论如何都会优化它,不要误以为 CTE 是子查询的缓存结果,即而不是它们是如何在幕后实现的。 But that is out of the scope of this post so just use this to return you final results:但这不在这篇文章的 scope 范围内,所以只需使用它来返回最终结果:

SELECT [Rows].[RowID],[Rows].[Match_1],[Rows].[Match_2],[Rows].[Match_3],[Rows].[Match_4],[Rows].[Match_5],[Rows].[Match_6],[Rows].[Match_7],[Rows].[Match_8],[Rows].[Match_9],[Rows].[Match_10],[Rows].[Match_11],[Rows].[Match_12],[Rows].[Match_13]
INTO FinalRows
FROM Rows
WHERE [Rows].[Group1] >= 1 AND [Rows].[Group1] <= 1

A Note about Efficiency关于效率的说明

When you want to troubleshoot query performance, the first place to check is the Actual Execution Plan , you can also review the Expected execution plan, which might give you some pointers, but the Actual will give you results of the real plan that was used to resolve the query.当您要排查查询性能时,首先要检查的是Actual Execution Plan ,您还可以查看预期执行计划,这可能会给您一些指示,但实际会为您提供实际计划的结果,用于解决查询。 For long running queries you will be able to visually identify the bottle-necks, then you can post back to SO or https://dba.stackexchange.com/ for more specific advice.对于长时间运行的查询,您将能够直观地识别瓶颈,然后您可以发回 SO 或https://dba.stackexchange.com/以获得更具体的建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM