[英]How can I improve the speed of a SQL query searching for a collection of strings
I have a table called T_TICKET
with a column CallId varchar(30)
. 我有一个名为T_TICKET
的表,该表的T_TICKET
CallId varchar(30)
。
Here is an example of my data: 这是我的数据的示例:
CallId | RelatedData
===========================================
MXZ_SQzfGMCPzUA | 0000
MXyQq6wQ7gVhzUA | 0001
MXwZN_d5krgjzUA | 0002
MXw1YXo7JOeRzUA | 0000
...
I am attempting to find records that match a collection of CallId
's. 我正在尝试查找与CallId
集合匹配的记录。 Something like this: 像这样:
SELECT * FROM T_TICKET WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
And I have anywhere from 200 - 300 CallId
's that I am looking up at a time using this query. 我有200到300个CallId
使用该查询一次查询。 The query takes around 35 seconds to run. 该查询大约需要35秒才能运行。 Is there anything I can do to either the table structure, the column type, the index, or the query itself to improve the performance of this query? 我可以对表结构,列类型,索引或查询本身做些什么来提高此查询的性能?
There are around 300,000 rows in T_INDEX
currently. T_INDEX
当前大约有300,000行。 CallId
is not unique. CallId
不是唯一的。 And RelatedData
is not unique. 而且RelatedData
不是唯一的。 I also have an index (non-clustered) on CallId
. 我在CallId
上也有一个索引(非聚集)。
I know the basics of SQL, but I'm not a pro. 我知道SQL的基础知识,但我不是专业人士。 Some things I've thought of doing are: 我想到的一些事情是:
CallId
from varchar
to char
. 将CallId
的类型从varchar
更改为char
。 CallId
(it's length is 30, but in reality, right now, I am using only 15 bytes). 缩短CallId
的长度(它的长度为30,但实际上,现在,我仅使用15个字节)。 I have not tried any of these yet because it requires changes to live production data. 我还没有尝试过这些方法,因为它需要更改实时生产数据。 And, I am not sure they would make a significant improvement. 而且,我不确定他们是否会做出重大改进。
Would either of these options make a significant improvement? 这些选择中的任何一个都会带来重大改进吗? Or, are there other things I could do to make this perform faster? 或者,还有其他我可以做的事情来使它更快地执行吗?
First, be sure that the types are the same -- either VARCHAR()
or NVARCHAR()
. 首先,请确保类型相同VARCHAR()
或NVARCHAR()
。 Then, add an index: 然后,添加一个索引:
create index idx_t_ticket_callid on t_ticket(callid);
If the types are compatible, SQL Server should make use of the index. 如果类型兼容,则SQL Server应使用索引。
Your table is what we called heap (a table without clustered index) . 您的表就是我们所谓的堆(没有聚集索引的表) 。 This kind of tables only good for data loading and/or as staging table. 这种表仅适合于数据加载和/或作为临时表。 I would recommend you to convert your table to have a clustered key. 我建议您将表转换为具有集群键。 A good clustering key should be unique, static, narrow, non-nullable, and ever-increasing (eg. int
/ bigint
identity datatype). 一个好的集群密钥应该是唯一的,静态的,狭窄的,不可为空的并且不断增长的(例如int
/ bigint
身份数据类型)。
Another downside of heap is when you have lots of UPDATE
/ DELETE
on your table, it will slow down your SELECT
because of forwarded records. 堆的另一个缺点是,当您的表上有很多UPDATE
/ DELETE
时,由于转发记录,它将减慢SELECT
速度。 Quoting from Paul Randal about forwarded records: Paul Randal引用转发记录:
If a forwarding record occurs in a heap, when the record locator points to that location, the Storage Engine gets there and says Oh, the record isn't really here – it's over there! 如果转发记录出现在堆中,则当记录定位器指向该位置时,存储引擎会到达该位置并说,哦,记录实际上不在这里-它在那儿! And then it has to do another (potentially physical) I/O to get to the page with the forwarded record on. 然后,它必须执行另一个(可能是物理的)I / O才能进入具有转发记录的页面。 This can result in a heap being less efficient that an equivalent clustered index. 这可能导致堆效率不及等效的聚集索引。
Lastly, make sure you define all your columns on your SELECT
. 最后,请确保您在SELECT
上定义了所有列。 Avoid the SELECT *
. 避免使用SELECT *
。 I'm guessing you are experiencing a table scan
when you execute the query. 我猜您在执行查询时遇到table scan
。 What you can do is INCLUDE
all columns list on your SELECT
on your index like this: 您可以做的是在索引的SELECT
上INCLUDE
所有列列表,如下所示:
CREATE INDEX [IX_T_TICKET_CallId_INCLUDE] ON [T_TICKET] ([CallId]) INCLUDE ([RelatedData]) WITH (DROP_EXISTING=ON)
It turns out there is in fact a way to significantly optimize my query without changing any data types. 事实证明,实际上有一种方法可以在不更改任何数据类型的情况下大大优化我的查询。
This query: 该查询:
SELECT * FROM T_TICKET
WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
is using NVARCHAR
types as the input params (N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA'...)
. 正在使用NVARCHAR
类型作为输入参数(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA'...)
。 As I specified in my question, CallId
is VARCHAR
. 正如我在问题中指定的那样, CallId
是VARCHAR
。 Sql Server was converting CallId
in every row of the table to an NVARCHAR
type to do the comparison, which was taking a long time (even though I have an index on CallId
). Sql Server将表的每一行中的CallId
转换为NVARCHAR
类型以进行比较,这花费了很长时间(即使我在CallId
上有索引)。
I was able to optimize it by simply NOT changing the parameter types to NVARCHAR
: 我能够通过不将参数类型更改为NVARCHAR
来对其进行优化:
SELECT * FROM T_TICKET
WHERE CALLID IN('MXZInrBl1DCnzUA', 'MXZ0TWkUhHprzUA', 'MXZ_SQzfGMCPzUA', ... ,'MXyQq6wQ7gVhzUA')
Now, instead of taking over 30 seconds to run, it only takes around .03 seconds. 现在,它不需要花费30秒钟来运行,而仅需大约.03秒钟。 Thanks for all the input. 感谢所有的投入。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.