简体   繁体   English

如何通过单个查询快速从一个30k的MySQL表中选择3个随机记录?

[英]How to quickly SELECT 3 random records from a 30k MySQL table with a where filter by a single query?

Well, this is a very old question never gotten real solution. 嗯,这是一个非常古老的问题,从未得到真正的解决方案。 We want 3 random rows from a table with about 30k records. 我们想要一个表中有3个随机行,大约有30k记录。 The table is not so big in point of view MySQL, but if it represents products of a store, it's representative. 从MySQL的角度来看,这个表并不是很大,但如果它代表了商店的产品,那么它就具有代表性。 The random selection is useful when one presents 3 random products in a webpage for example. 例如,当在网页中呈现3个随机产品时,随机选择是有用的。 We would like a single SQL string solution that meets these conditions: 我们想要一个满足以下条件的SQL字符串解决方案:

  1. In PHP, the recordset by PDO or MySQLi must have exactly 3 rows. 在PHP中,PDO或MySQLi的记录集必须正好有3行。
  2. They have to be obtained by a single MySQL query without Stored Procedure used. 它们必须通过单个MySQL查询获得,而不使用存储过程。
  3. The solution must be quick as for example a busy apache2 server, MySQL query is in many situations the bottleneck. 解决方案必须很快,例如繁忙的apache2服务器,MySQL查询在很多情况下都是瓶颈。 So it has to avoid temporary table creation, etc. 所以它必须避免临时表创建等。
  4. The 3 records must be not contiguous, ie, they must not to be at the vicinity one to another. 3条记录必须不是连续的,即它们不得彼此相邻。

The table has the following fields: 该表包含以下字段:

CREATE TABLE Products (
  ID INT(8) NOT NULL AUTO_INCREMENT,
  Name VARCHAR(255) default NULL,
  HasImages INT default 0,
  ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

The WHERE constraint is Products.HasImages=1 permitting to fetch only records that have images available to show on the webpage. WHERE约束是Products.HasImages = 1,仅允许获取具有可在网页上显示的图像的记录。 About one-third of records meet the condition of HasImages=1. 大约三分之一的记录符合HasImages = 1的条件。

Searching for a Perfection, we first let aside the existent Solutions that have drawbacks: 寻求完美,我们首先抛开存在缺陷的现有解决方案:


I. This basic solution using ORDER BY RAND(), I.使用ORDER BY RAND()的这个基本解决方案

is too slow but guarantees 3 really random records at each query: 太慢但在每个查询中保证3个真正随机的记录:

SELECT ID, Name FROM Products WHERE HasImages=1 ORDER BY RAND() LIMIT 3;

*CPU about 0.10s, scanning 9690 rows because of WHERE clause, Using where; * CPU约0.10s,因WHERE子句扫描9690行,使用where; Using temporary; 使用临时; Using filesort , on Debian Squeeze Double-Core Linux box, not so bad but 在Debian Squeeze双核Linux机器上使用filesort ,并不是那么糟糕

not so scalable to a bigger table as temporary table and filesort are used, and takes me 8.52s for the first query on the test Windows7::MySQL system. 因为使用临时表和filesort而不能扩展到更大的表,并且在测试Windows7 :: MySQL系统上的第一个查询需要8.52秒。 With such a poor performance, to avoid for a webpage isn't-it ? 如此糟糕的表现,避免网页不是吗?


II. II。 The bright solution of riedsio using JOIN ... RAND(), 使用JOIN ... RAND()的riedsio的明亮解决方案,

from MySQL select 10 random rows from 600K rows fast , adapted here is only valid for a single random record, as the following query results in an almost always contiguous records. MySQL中选择快速600K行的10个随机行 ,此处适用仅对单个随机记录有效,因为以下查询会产生几乎总是连续的记录。 In effect it gets only a random set of 3 continuous records in IDs: 实际上,它只能在ID中随机获得3个连续记录:

SELECT Products.ID, Products.Name
FROM Products
INNER JOIN (SELECT (RAND() * (SELECT MAX(ID) FROM Products)) AS ID)
  AS t ON Products.ID >= t.ID
WHERE (Products.HasImages=1)
ORDER BY Products.ID ASC
LIMIT 3;

*CPU about 0.01 - 0.19s, scanning 3200, 9690, 12000 rows or so randomly, but mostly 9690 records, Using where. * CPU约0.01 - 0.19s,随机扫描3200,9690,12000行,但大多数是9690条记录,使用位置。


III. III。 The best solution seems the following with WHERE ... RAND(), 最好的解决方案似乎如下WHERE ... RAND(),

seen on MySQL select 10 random rows from 600K rows fast proposed by bernardo-siu : MySQL上看到选择 bernardo-siu提出的600K行的10个随机行

SELECT Products.ID, Products.Name FROM Products
WHERE ((Products.Hasimages=1) AND RAND() < 16 * 3/30000) LIMIT 3;

*CPU about 0.01 - 0.03s, scanning 9690 rows, Using where. * CPU约0.01 - 0.03s,扫描9690行,使用位置。

Here 3 is the number of wished rows, 30000 is the RecordCount of the table Products, 16 is the experimental coefficient to enlarge the selection in order to warrant the 3 records selection. 这里3是所希望的行数,30000是表Products的RecordCount,16是实验系数放大选择以保证3条记录的选择。 I don't know on what basis the factor 16 is an acceptable approximation. 我不知道因子16在什么基础上是可接受的近似值。

We so get at the majority of cases 3 random records and it's very quick, but it's not warranted: sometimes the query returns only 2 rows, sometimes even no record at all. 我们在大多数情况下得到3个随机记录并且它非常快,但它没有保证:有时查询只返回2行,有时甚至根本没有记录。

The three above methods scan all records of the table meeting WHERE clause, here 9690 rows. 上述三种方法扫描满足WHERE子句的表的所有记录,这里是9690行。

A better SQL String? 一个更好的SQL字符串?

Ugly, but quick and random. 丑陋,但快速和随机。 Can become very ugly very fast, especially with tuning described below, so make sure you really want it this way. 可以非常快速地变得非常丑陋,尤其是下面描述的调整,所以确保你真的想要这样。

(SELECT Products.ID, Products.Name
FROM Products
    INNER JOIN (SELECT RAND()*(SELECT MAX(ID) FROM Products) AS ID) AS t ON Products.ID >= t.ID
WHERE Products.HasImages=1
ORDER BY Products.ID
LIMIT 1)

UNION ALL

(SELECT Products.ID, Products.Name
FROM Products
    INNER JOIN (SELECT RAND()*(SELECT MAX(ID) FROM Products) AS ID) AS t ON Products.ID >= t.ID
WHERE Products.HasImages=1
ORDER BY Products.ID
LIMIT 1)

UNION ALL

(SELECT Products.ID, Products.Name
FROM Products
    INNER JOIN (SELECT RAND()*(SELECT MAX(ID) FROM Products) AS ID) AS t ON Products.ID >= t.ID
WHERE Products.HasImages=1
ORDER BY Products.ID
LIMIT 1)

First row appears more often than it should 第一行看起来比它应该更频繁

If you have big gaps between IDs in your table, rows right after such gaps will have bigger chance to be fetched by this query. 如果表中的ID之间存在较大差距,则此类间隔之后的行将有更大的机会被此查询提取。 In some cases, they will appear significatnly more often than they should. 在某些情况下,它们会比它们应该更频繁地出现。 This can not be solved in general, but there's a fix for a common particular case: when there's a gap between 0 and the first existing ID in a table. 这通常无法解决,但是对于一个常见的特殊情况有一个修复:当0和表中第一个现有ID之间存在差距时。

Instead of subquery (SELECT RAND()*<max_id> AS ID) use something like (SELECT <min_id> + RAND()*(<max_id> - <min_id>) AS ID) 而不是子查询(SELECT RAND()*<max_id> AS ID)使用类似(SELECT <min_id> + RAND()*(<max_id> - <min_id>) AS ID)

Remove duplicates 删除重复项

The query, if used as is, may return duplicate rows. 查询(如果按原样使用)可能会返回重复的行。 It is possible to avoid that by using UNION instead of UNION ALL . 可以通过使用UNION而不是UNION ALL来避免这种情况。 This way duplicates will be merged, but the query no longer guarantees to return exactly 3 rows. 这样复制将被合并,但查询不再保证返回正好3行。 You can work around that too, by fetching more rows than you need and limiting the outer result like this: 您也可以通过获取超出需要的行来解决这个问题,并限制外部结果,如下所示:

(SELECT ... LIMIT 1)
UNION (SELECT ... LIMIT 1)
UNION (SELECT ... LIMIT 1)
...
UNION (SELECT ... LIMIT 1)
LIMIT 3

There's still no guarantee that 3 rows will be fetched, though. 但是仍然无法保证将获取3行。 It just makes it more likely. 它只是使它更有可能。

SELECT Products.ID, Products.Name
FROM Products
INNER JOIN (SELECT (RAND() * (SELECT MAX(ID) FROM Products)) AS ID) AS t ON Products.ID     >= t.ID
WHERE (Products.HasImages=1)
ORDER BY Products.ID ASC
LIMIT 3;

Of course the above is given "near" contiguous records you are feeding it the same ID every time without much regard to the seed of the rand function. 当然上面给出了“接近”的连续记录,你每次都给它提供相同的ID而不太关注rand函数的seed

This should give more "randomness" 这应该给予更多“随机性”

SELECT Products.ID, Products.Name
FROM Products
INNER JOIN (SELECT (ROUND((RAND() * (max-min))+min)) AS ID) AS t ON Products.ID     >= t.ID
WHERE (Products.HasImages=1)
ORDER BY Products.ID ASC
LIMIT 3;

Where max and min are two values you choose, lets say for example sake: 其中maxmin是你选择的两个值,比方说:

max = select max(id)
min = 225

This statement executes really fast (19 ms on a 30k records table): 此语句执行速度非常快(在30k记录表上为19 ms):

$db = new PDO('mysql:host=localhost;dbname=database;charset=utf8', 'username', 'password');
$stmt = $db->query("SELECT p.ID, p.Name, p.HasImages
                    FROM (SELECT @count := COUNT(*) + 1, @limit := 3 FROM Products WHERE HasImages = 1) vars
                    STRAIGHT_JOIN (SELECT t.*, @limit := @limit - 1 FROM Products t WHERE t.HasImages = 1 AND (@count := @count -1) AND RAND() < @limit / @count) p");
$products = $stmt->fetchAll(PDO::FETCH_ASSOC);

The Idea is to "inject" a new column with randomized values, and then sort by this column. 想法是“注入”具有随机值的新列,然后按此列排序。 The generation of and sorting by this injected column is way faster than the "ORDER BY RAND()" command. 这个注入列的生成和排序比“ORDER BY RAND()”命令快。

There "might" be one caveat: You have to include the WHERE query twice. “可能”有一个警告:您必须包括WHERE查询两次。

I've been testing the following bunch of SQLs on a 10M-record, poorly designed database. 我一直在10M记录,设计不佳的数据库上测试以下一堆SQL。

SELECT COUNT(ID)
INTO @count
FROM Products
WHERE HasImages = 1;

PREPARE random_records FROM
'(
    SELECT * FROM Products WHERE HasImages = 1 LIMIT ?, 1
) UNION (
    SELECT * FROM Products WHERE HasImages = 1 LIMIT ?, 1
) UNION (
    SELECT * FROM Products WHERE HasImages = 1 LIMIT ?, 1
)';

SET @l1 = ROUND(RAND() * @count);
SET @l2 = ROUND(RAND() * @count);
SET @l3 = ROUND(RAND() * @count);

EXECUTE random_records USING @l1
    , @l2
    , @l3;
DEALLOCATE PREPARE random_records;

It took almost 7 minutes to get the three results. 得到三个结果花了差不多7分钟。 But I'm sure its performance will be much better in your case. 但我相信在你的情况下它的性能会好得多。 Yet if you are looking for a better performance I suggest the following ones as they took less than 30 seconds for me to get the job done (on the same database). 然而,如果您正在寻找更好的性能,我建议使用以下内容,因为我们花了不到30秒的时间完成工作(在同一个数据库中)。

SELECT COUNT(ID)
INTO @count
FROM Products
WHERE HasImages = 1;

PREPARE random_records FROM
'SELECT * FROM Products WHERE HasImages = 1 LIMIT ?, 1';

SET @l1 = ROUND(RAND() * @count);
SET @l2 = ROUND(RAND() * @count);
SET @l3 = ROUND(RAND() * @count);

EXECUTE random_records USING @l1;
EXECUTE random_records USING @l2;
EXECUTE random_records USING @l3;

DEALLOCATE PREPARE random_records;

Bear in mind that both these commands require MySQLi driver in PHP if you want to execute them in one go. 请记住,如果要一次执行它们,这两个命令都需要PHP中的MySQLi驱动程序。 And their only difference is that the later one requires calling MySQLi's next_result method to retrieve all three results. 而他们唯一的区别是后者需要调用MySQLi的next_result方法来检索所有三个结果。

My personal belief is that this is the fastest way to do this. 我个人认为这是最快的方法。

What about creating another table containing only items with image ? 如何创建另一个只包含带有图像的项目的表? This table will be much lighter as it will contain only one-third of the items the original table has ! 这个表格要轻得多,因为它只包含原始表格中三分之一的项目!

------------------------------------------
|ID     | Item ID (on the original table)|
------------------------------------------
|0      | 0                              |
------------------------------------------
|1      | 123                            |
------------------------------------------
            .
            .
            .
------------------------------------------
|10 000 | 30 000                         |
------------------------------------------

You can then generate three random IDs in the PHP part of the code and just fetch'em the from the database. 然后,您可以在代码的PHP部分生成三个随机ID,只需从数据库中获取。

On the off-chance that you're willing to accept an 'outside the box' type of answer, I'm going to repeat what I said in some of the comments. 如果您愿意接受“开箱即用”类型的答案,那么我将重复我在一些评论中所说的内容。

The best way to approach your problem is to cache your data in advance (be that in an external JSON or XML file, or in a separate database table, possibly even an in-memory table). 解决问题的最佳方法是提前缓存数据(在外部JSON或XML文件中,或在单独的数据库表中,甚至可能是内存中的表)。

This way you can schedule your performance-hit on the products table to times when you know the server will be quiet, and reduce your worry about creating a performance hit at "random" times when the visitor arrives to your site. 通过这种方式,您可以将产品表中的性能命中安排到您知道服务器安静的时间,并减少您在访问者到达您的站点时“随机”创建性能命中的担忧。

I'm not going to suggest an explicit solution, because there are far too many possibilities on how to build a solution. 我不会建议一个明确的解决方案,因为在如何构建解决方案方面存在太多可能性。 However, the answer suggested by @ahmed is not silly. 但是,@ ahmed提出的答案并不愚蠢。 If you don't want to create a join in your query, then simply load more of the data that you require into the new table instead. 如果您不想在查询中创建连接,则只需将所需的更多数据加载到新表中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM