简体   繁体   English

SELECT随机ID的SQL优化(带有WHERE子句)

[英]SQL Optimization on SELECT random id (with WHERE clause)

I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. 我目前正在研究多线程程序(Java),该程序需要选择数据库中的随机行以更新它们。 This is working well but I started to encounter some performance issue regarding my SELECT request. 这运作良好,但是我开始遇到有关我的SELECT请求的性能问题。

I tried multiple solutions before finding this website : 在找到此网站之前,我尝试了多种解决方案:

http://jan.kneschke.de/projects/mysql/order-by-rand/ http://jan.kneschke.de/projects/mysql/order-by-rand/

I tried with the following solution : 我尝试了以下解决方案:

SELECT * FROM Table 
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table) 
AS R ON Table.ID > R.Random 
WHERE Table.FOREIGNKEY_ID IS NULL 
LIMIT 1;

It selects only one row below the random id number generated. 它仅在生成的随机ID号下方选择一行。 This is working pretty good (an average of less than 100ms per request on 150k rows). 这工作得很好(在15万行上,每个请求平均不到100毫秒)。 But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value). 但是,在我的程序执行完之后,FOREIGNKEY_ID将不再为NULL(它将使用某些值进行更新)。

The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them. 问题是,我的SELECT将“忘记”某些行,而该行的ID低于随机生成的ID,而我将无法对其进行处理。

So I tried to adapt my request, doing this : 所以我试图适应我的要求,这样做:

SELECT * FROM Table 
JOIN (SELECT FLOOR( 
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() ) 
AS Random FROM Table) 
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL 
LIMIT 1;

With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows). 有了该请求,不再有跳过某些行的问题,但是性能急剧下降(在15万行中,每个请求平均1s)。

I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work. 当我仍然有很多行要处理时,我可以简单地执行快速操作,而当它只剩下几行时,切换到慢速操作,但这将是代码中的“脏”修复程序,我希望使用可以完成工作的优雅的SQL请求。

Thank you for your help, please let me know if I'm not clear or if you need more details. 感谢您的帮助,如果我不清楚或需要更多详细信息,请告诉我。

Your ID's are probably gonna contain gaps. 您的ID可能包含空白。 Anything that works with COUNT(*) is not going to be able to find all the ID's. 任何与COUNT(*)一起使用的东西都将无法找到所有ID。

A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. ID为1,2,3,10,11,12,13记录的表只有7条记录。 Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3 . COUNT(*)进行随机操作通常会导致未命中,因为记录4,5和6不存在,然后它将选择最接近的ID 3 This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13. 这不仅不平衡(它经常会频繁选择3 ),而且也永远不会选择10-13个记录。

To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. 为了获得公平统一分配的记录随机选择,我建议先加载表的ID。 Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB): 即使对于15万行,加载一组integer ID也不会消耗大量内存(<1 MB):

SELECT id FROM table;

You can then use a function like Collections.shuffle to randomize the order of the ID's. 然后,您可以使用Collections.shuffle类的函数来随机化ID的顺序。 To get the rest of the data, you can select records one at a time or for example 10 at a time: 要获取其余数据,您可以一次选择一个记录,例如一次选择10个:

SELECT * FROM table WHERE id = :id

Or: 要么:

SELECT * FROM table WHERE id IN (:id1, :id2, :id3)

This should be fast if the id column has an index, and it will give you a proper random distribution. 如果id列具有索引,这应该很快,并且它将为您提供适当的随机分布。

For your method to work more generally, you want max(id) rather than count(*) : 为了使方法更通用,您需要max(id)而不是count(*)

SELECT t.*
FROM Table t JOIN
     (SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
     ON t.ID > R.Random 
WHERE t.FOREIGNKEY_ID IS NULL 
ORDER BY t.ID
LIMIT 1;

The ORDER BY is usually added to be sure that the "next" id is returned. 通常会添加ORDER BY ,以确保返回“下一个” ID。 In theory, MySQL could always return the maximum id in the table. 从理论上讲,MySQL总是可以返回表中的最大id。

The problem is gaps in ids. 问题是id之间的差距。 And, it is easy to create distributions where you never get a random number . 而且,创建从未获得随机数的分布很容易。 . . say that the four ids are 1 , 2 , 3 , 1000 . 说,四个ID是1231000 Your method will never get 1000000 . 您的方法永远不会得到1000000 The above will almost always get it. 以上几乎总是可以得到的。

Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. 可能最简单的解决方案是多次运行第一个查询,直到获得有效的行。 The next suggestion would be an index on (FOREIGNKEY_ID, ID) , which the subquery can use. 下一个建议是(FOREIGNKEY_ID, ID)上的索引,子查询可以使用该索引。 That might speed the query. 这可能会加快查询速度。

I tend to favor something more along these lines: 我倾向于以下方面:

SELECT t.id
FROM Table t 
WHERE t.FOREIGNKEY_ID IS NULL AND
      RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;

The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time. WHERE子句的目的是减少可观的体积,因此ORDER BY不需要花费很多时间。

Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. 不幸的是,这将需要扫描表,因此您可能不会在150k的表上获得100毫秒范围内的响应。 You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID) . 您可以将其减少为使用t(FOREIGNKEY_ID, ID)上的索引进行索引扫描。

EDIT: 编辑:

If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger. 如果您想要一个合理的机会来实现均匀分布并且性能不会随着表的增大而增加,那么这是另一个想法,可惜,这需要触发器。

Add a new column to the table called random , which is initialized with rand() . Build an index on 向表中添加一个名为random的新列,该列使用rand()初始化. Build an index on . Build an index on random`. . Build an index on随机. Build an index on Then run a query such as: 然后运行查询,例如:

select t.*
from ((select t.*
       from t
       where random >= @random
       order by random
       limit 10 
      ) union all
      (select t.*
       from t
       where random < @random
       order by random desc
       limit 10 
      )
     ) t
order by rand();
limit 1;

The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. 这个想法是,子查询可以使用索引来选择一组20条非常随意的行-在所选点的前后10条。 The rows are then sorted (some overhead, which you can control with the limit number). 然后对行进行排序(一些开销,您可以使用limit数进行控制)。 These are randomized and returned. 这些被随机化并返回。

The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. 这个想法是,如果您选择随机数,将会有任意的间隔,并且这些间隔会使所选的数字不太统一。 However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. 但是,通过在该值附近进行更大的采样,则选择任何一个值的可能性都应接近均匀分布。 The uniformity would still have edge effects, but these should be minor on a large amount of data. 均匀性仍然会产生边缘效应,但是在大量数据上这些效应应该很小。

If prepared statement can be used, then this should work: 如果可以使用准备好的语句,那么这应该起作用:

SELECT @skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING @skip;

LIMIT in SELECT statement can be used to skip rows SELECT语句中的LIMIT可用于跳过行

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM