简体   繁体   English

从庞大的表格中选择小样本的快速方法

[英]Fast way to select small sample from huge table

The table I have is huge about 100+ million entries, it is ordered by default by 'A'. 我拥有的表非常庞大,大约有100+百万个条目,默认情况下按“ A”排序。 There could be many items with the same column A, A increases from 0 to... A big number. 同一列A可能有很多项目,A从0增加到...很大。 I tried TABLESAMPLE but it does not quite select a good number from each A number, it skips some of them or maybe I am not using it well. 我尝试了TABLESAMPLE,但是它没有从每个A号码中选择一个好的号码,它跳过了其中一些号码,或者我没有很好地使用它。 So I would like to select the same amount of values from each A number. 因此,我想从每个A数字中选择相同数量的值。 And I would like the total of selected rows to be a number, let's say 10 million or let's call it B. 我希望所选行的总数是一个数字,比如说一千万,或者我们称之为B。

While it's not exactly clear to me what you need to achieve, when I have needed a large sample subset that is very well distributed between Parent and/or common Attribute values, I have done it like this: 虽然我不清楚要实现的目标,但当我需要一个在父级和/或公共属性值之间很好地分配的大型样本子集时,我就这样完成了:

SELECT  *
FROM    YourTable
WHERE   (YourID % 10) = 3

This also has the advantage that you can get another completely different sample just by changing the "3" to another digit. 这还具有一个优点,您只需将“ 3”更改为另一个数字就可以获得另一个完全不同的样本。 Plus you can change the sub-sample size by adjusting the "10". 另外,您可以通过调整“ 10”来更改子样本大小。

You can make use of NEWID() : 您可以使用NEWID()

SELECT TOP 100
  *
FROM
  YourTable
ORDER BY NEWID()

@RBarryYoung solution is right, generic and it works for any constant statistic distribution, like id sequences (or any auto-increment column). @RBarryYoung解决方案是正确的,通用的,它适用于任何恒定的统计量分布,例如id序列(或任何自动增量列)。 Sometimes, though, your distribution is not constant or you can run into performance issues (SQL Server has to scan all index entries to calculate the WHERE clause). 但是,有时您的分布不是恒定的,或者会遇到性能问题(SQL Server必须扫描所有索引条目以计算WHERE子句)。

If any of those affects your problem, consider the built-in T-SQL operator TOP that may suit your needs: 如果有任何一种影响您的问题,请考虑可能满足您需求的内置T-SQL操作符TOP

SELECT TOP (30) PERCENT *
FROM YourTable;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM