简体   繁体   English

在Win 7上具有SQL Server 2008 R2唯一索引列的1000万条记录表中快速搜索

[英]fast search in a 10 million records table with unique index column of SQL server 2008 R2 on win 7

I need to do a fast search in a column with floating point numbers in a table of SQL server 2008 R2 on Win 7. 我需要在Win 7上的SQL Server 2008 R2的表中的带有浮点数的列中进行快速搜索。

the table has 10 million records. 该表有1000万条记录。

eg 例如

  Id    value
  532   937598.32421
  873   501223.3452
  741   9797327.231

ID is primary key, I need o do a search on "value" column for a given value such that I can find the 5 closest points to the given point in the table. ID是主键,我需要o在“值”列上搜索给定值,以便可以找到表中给定点的5个最接近的点。

The closeness is defined as the absolute value of the difference between the given value and column value. 接近度定义为给定值和列值之差的绝对值。

The smaller value, the closer. 值越小,越接近。

I would like to use binary search. 我想使用二进制搜索。

I want to set an unique index on the value column. 我想在值列上设置唯一索引。

But, I am not sure whether the table will be sorted every time when I search the given value in the column ? 但是,我不确定每次在列中搜索给定值时是否都会对表进行排序?

Or, it only sorts the table one time because I have set the value column as unique index ? 或者,它只对表进行一次排序,因为我已将value列设置为唯一索引?

Are there better ways to do this search ? 有更好的方法进行此搜索吗?

A sorting will have to be done whenever I do a search ? 每当我进行搜索时,都必须进行排序吗? I need to do a lot of times of search in the table. 我需要在表中进行很多次搜索。 I know the sorting time is O(n lg n). 我知道排序时间是O(n lg n)。 Using index can really have done the sort for me ? 使用索引真的可以为我完成排序吗? or the index is associated with a sorted tree to hold the column values ? 还是索引与排序树关联以保存列值?

When an index is set up, the values have been sorted ? 设置索引后,值已排序? I do not need to sort it every time when I do a search ? 搜索时不需要每次都对它进行排序吗?

Any help would be appreciated. 任何帮助,将不胜感激。

thanks 谢谢

Sorry for my initial response, no, I would not even create an index, it won't be able to use it because you're searching not on a given value but the difference between that given value and the value column on the table. 对不起,我的最初回答是,不,我什至不会创建索引,它将无法使用,因为您不是在搜索给定值,而是在该给定值和表中的value列之间进行搜索。 You could create a function based index, but you would have to specify the # you're searching on, which is not constant. 您可以创建基于函数的索引,但是必须指定要搜索的#,它不是常数。

Given that, I would look at getting enough RAM to swallow the whole table. 鉴于此,我将考虑获得足够的RAM吞噬整个表。 Ie. 就是 if the table is 10gb, try to get 10gb ram allocated for caching. 如果表是10gb,请尝试获取分配给缓存的10gb内存。 And if possible do it on a machine w/ an SSD, or get an SSD. 如果可能,请在带有SSD的计算机上进行操作,或者获取SSD。

The sql itself is not complicated, it's really just an issue of performance. sql本身并不复杂,它实际上只是性能问题。

select top 5 id, abs(99 - val) as diff
from tbl
order by 2

If you don't mind some trial and error, you could create an index on the value column, and then search as follows - 如果您不介意反复试验,则可以在value列上创建索引,然后按以下方式进行搜索-

select top 5 id, abs(99 - val) as diff
from tbl
where val between 99-30 and 99+30
order by 2

The above query WOULD utilize the index on the value column, because it is searching on a range of values in the value column, not the differences between the values in that column and X (2 very different things) 上面的查询将利用value列上的索引,因为它是在value列中搜索一个范围内的值,而不是该列中的值与X之间的差异(这是两个非常不同的地方)

However, there is no guarantee it would return 5 rows, it would only return 5 rows if there actually existed 5 rows within 30 of 99 (69 to 129). 但是,不能保证它会返回5行,如果在99的30(69到129)中确实存在5行,它只会返回5行。 If it returned 2, 3, etc. but not 5, you would have to run the query again and expand the range, and keep doing so until you determine your top 5. However, these queries should run quite a bit faster than having no index and firing against the table blind. 如果返回2、3等,但不是5,则必须再次运行查询并扩大范围,并继续这样做,直到确定前5位。但是,与没有查询相比,这些查询的运行速度要快得多。索引和反对盲注的桌子射击。 So you could give it a shot. 所以您可以试一试。 The index may take a while to create though, so you might want to do that part overnight. 虽然创建索引可能要花一些时间,所以您可能需要一整夜。

You mention sql server and binary search. 您提到了sql服务器和二进制搜索。 SQL server does not work that way, but sql server (or other database) is a good solution for this problem. SQL Server不能以这种方式工作,但是sql server(或其他数据库)是解决此问题的好方法。

Just to concrete, I will assume 具体来说,我会假设

create table mytable
(
  id int not null
, value float not null
  constraint mytable_pk primary key(id)
)

And you need an index on the value field. 并且您需要在value字段上建立索引。

Now get ten rows 5 above and 5 below the search value with these 2 selects 现在通过这2个选择在搜索值的上方获得5行,在其下方获得5行

  SELECT TOP 5 id, value, abs(id-value) as diff
      FROM mytable
      WHERE value >= @searchval
      ORDER BY val asc) as bigger

  -- and 

  SELECT TOP 5 id, value, abs(id-value) as diff
      FROM mytable
      WHERE value < @searchval
      ORDER BY val desc) as smaller

To combine the 2 unions into 1 result set you need 要将2个并集合并为1个结果集,您需要

SELECT *
  FROM (SELECT TOP 5 id, value, abs(id-value) as diff
          FROM mytable
         WHERE value >= @searchval
      ORDER BY val asc) as bigger
UNION ALL
  FROM (SELECT TOP 5 id, value, abs(id-value) as diff
          FROM mytable
         WHERE value < @searchval
      ORDER BY val desc) as smaller

But since you only want the smallest 5 differences, wrap with one more layer as 但是,由于您只需要最小的5个差异,因此可以多包裹一层

SELECT TOP 5 * FROM
(
SELECT *
  FROM (SELECT TOP 5 id, value, abs(id-value) as diff
          FROM mytable
         WHERE value >= @searchval
      ORDER BY val asc) as bigger
UNION ALL
  FROM (SELECT TOP 5 id, value, abs(id-value) as diff
          FROM mytable
         WHERE value < @searchval
      ORDER BY val desc) as smaller
)
ORDER BY DIFF ASC

I Have not tested any of this 我还没有测试过

Creating the table's clustered index upon [value] will cause [value]'s values to be stored on disk in sorted order. 在[value]上创建表的聚集索引将导致[value]的值按排序顺序存储在磁盘上。 The table's primary key (perhaps already defined on [Id]) might already be defined as the table's clustered index. 表的主键(可能已经在[Id]上定义了)可能已经定义为表的聚集索引。 There can only be one clustered index on a table. 一个表上只能有一个聚集索引。 If a primary key on [Id] is already clustered, the primary key will need to be dropped, the clustered index on [value] will need to be created, and then the primary key on [Id] can be recreated (as a nonclustered primary key). 如果已经对[Id]上的主键进行了集群,则需要删除该主键,需要在[value]上创建聚集索引,然后可以重新创建[Id]上的主键(作为非集群键)首要的关键)。 A clustered index upon [value] should improve performance of this specific statement, but you must ultimately test all variety of T-SQL that will reference this table before making the final choice about this table's most useful clustered index column(s). 基于[value]的聚簇索引应提高此特定语句的性能,但是在最终决定该表最有用的聚簇索引列之前,您必须最终测试将引用该表的所有T-SQL。

Because the FLOAT data type is imprecise (subject to your system's FPU and its floating point rounding and truncation errors, while still in accordance with IEEE 754's specifications), it can be a fatal mistake to assume every [value] will be unique, event when the decimal number (being inserted into FLOAT) appears (in decimal) to be unique. 由于FLOAT数据类型不精确(受制于系统的FPU以及其浮点舍入和截断错误,尽管仍符合IEEE 754的规范),假设每个[value]都是唯一的,则可能是致命的错误,十进制数字(插入FLOAT中)看起来是唯一的(十进制)。 Irrational numbers must always be truncated and rounded. 无理数必须始终被截断并四舍五入。 In decimal, PI is an example of an irrational value, which can be truncated and rounded to an imprecise value of 3.142. 在十进制中,PI是一个无理值的示例,该值可以被截断并四舍五入为3.142的不精确值。 Similarly, the decimal number 0.1 is an irrational number in binary, which means FLOAT will not store decimal 0.1 as a precise binary value.... You might want to consider whether the domain of acceptable values offered by the NUMERIC data type can accommodate [value] (thus gaining more precise answers when compared to a use of FLOAT). 同样,十进制数0.1是二进制形式的无理数,这意味着FLOAT不会将十进制数0.1存储为精确的二进制值。...您可能要考虑NUMERIC数据类型提供的可接受值的域是否可以容纳[值](与使用FLOAT相比,可获得更精确的答案)。

While a NUMERIC data type might require more storage space than FLOAT, the performance of a given query is often controlled by the levels of the (perhaps clustered) index's B-Tree (assuming an index seek can be harnessed by the query, which for your specific need is a safe assumption). 尽管NUMERIC数据类型可能比FLOAT需要更多的存储空间,但给定查询的性能通常由(可能是群集的)索引的B树的级别控制(假设查询可以利用索引查找,对于您具体需求是一个安全的假设)。 A NUMERIC data type with a precision greater than 28 will require 17 bytes of storage per value. 精度大于28的NUMERIC数据类型将需要每个值17个字节的存储空间。 The payload of SQL Server's 8KB page is approximately 8000 bytes. SQL Server的8KB页面的有效负载约为8000个字节。 Such a NUMERIC data type will thus store approximately 470 values per page. 因此,这种NUMERIC数据类型每页将存储大约470个值。 A B-Tree will consume 2^(index_level_pages-1) * 470 rows/page to store the 10,000,000 rows. B树将消耗2 ^(index_level_pages-1)* 470行/页来存储10,000,000行。 Dividing both sides by 470 rows/page: 2^(index_level_pages-1) = 10,000,000/470 pages. 两侧除以470行/页:2 ^(index_level_pages-1)= 10,000,000 / 470页。 Simplifying: log(base2)10,000,000/470 = (index_level_pages-1). 简化:log(base2)10,000,000 / 470 =(index_level_pages-1)。 Solving: ~16 = index_level_pages (albeit this is back of napkin math, I think it close enough). 解决:〜16 = index_level_pages(尽管这是餐巾纸数学的背面,但我认为它足够接近)。 Thus searching for a specific value in a 10,000,000 row table will require ~16*8KB = ~128 KB of reads. 因此,在10,000,000行表中搜索特定值将需要〜16 * 8KB =〜128 KB的读取次数。 If a clustered index is created, the leaf level of a clustered index will contain the other NUMERIC values that are "close" to the one being sought. 如果创建了聚簇索引,则聚簇索引的叶级别将包含与所要查找的“接近”的其他NUMERIC值。 Since that leaf level page (and the 15 other index pages) are now cached in SQL Server's buffer pool and are "hot", the next search (for values that are "close" to the value being sought) is likely to be constrained by memory access speeds (as opposed to disk access speeds). 由于该叶级页(和其他15个索引页)现在已缓存在SQL Server的缓冲池中并且是“热”的,因此下一次搜索(对于与所要查找的值“接近”的值)可能会受到限制内存访问速度(与磁盘访问速度相对)。 This is why a clustered index can enhance performance for your desired statement. 这就是为什么聚集索引可以提高所需语句的性能的原因。

If the [value]'s values are not unique (perhaps due to floating point truncation and rounding errors), and if [value] has been defined as the table's clustered index, SQL Server will (under the covers) add a 4-byte "uniqueifier" to each value. 如果[值]的值不是唯一的(可能是由于浮点截断和舍入错误所致),并且如果[值]已定义为表的聚集索引,则SQL Server将(在幕后)添加4个字节每个值的“唯一符”。 A uniqueifier adds overhead (per above math, it is less overhead than might be thought, when a index can be harnessed). 唯一符会增加开销(按数学计算,开销比可以利用索引时所能想到的要少)。 That overhead is another (albeit less important) reason to test. 该开销是要测试的另一个(尽管不太重要)原因。 If values can instead be stored as NUMERIC and if a use of NUMERIC would more precisely ensure persisted decimal values are indeed unique (just the way they look, in decimal), that 4 byte overhead can be eliminated by also declaring the clustered index as being unique (assuming value uniqueness is a business need). 如果可以将值另存为NUMERIC,并且使用NUMERIC可以更精确地确保持久性十进制值的确是唯一的(就像十进制一样),则可以通过声明聚簇索引为4字节来消除开销唯一性(假设价值唯一性是业务需求)。 Using similar math, I am certain you will discover the index levels for a FLOAT data type are not all that different from NUMERIC.... An index B-Tree's exponential behavior is "the great leveler" :). 使用类似的数学,我敢肯定,您会发现FLOAT数据类型的索引级别与NUMERIC并没有什么不同。...索引B-Tree的指数行为是“伟大的leveler” :)。 Choosing FLOAT because it has smaller storage space than NUMERIC may not be as useful as can initially be thought (even when greatly more storage space for the table, as a whole, is needed). 由于FLOAT的存储空间比NUMERIC小,因此选择FLOAT可能没有最初想到的那样有用(即使需要为表整体提供更大的存储空间)。

You should also consider/test whether a Columnstore index would enhance performance and suit your business needs. 您还应该考虑/测试Columnstore索引是否可以提高性能并满足您的业务需求。

This is a common request coming from my clients. 这是我的客户提出的常见要求。

It's better if you transform your float column into two integer columns (one for each part of the floating point number), and put the appropriate index on them for fast searching. 最好将float列转换为两个整数列(每个浮点数的一部分为一个整数列),并在其上放置适当的索引以进行快速搜索。 For example: 12345.678 will become two columns 12345 and 678. 例如:12345.678将成为两列12345和678。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM