简体   繁体   中英

SQL Optimization on SELECT random id (with WHERE clause)

I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.

I tried multiple solutions before finding this website :

http://jan.kneschke.de/projects/mysql/order-by-rand/

I tried with the following solution :

SELECT * FROM Table 
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table) 
AS R ON Table.ID > R.Random 
WHERE Table.FOREIGNKEY_ID IS NULL 
LIMIT 1;

It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).

The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.

So I tried to adapt my request, doing this :

SELECT * FROM Table 
JOIN (SELECT FLOOR( 
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() ) 
AS Random FROM Table) 
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL 
LIMIT 1;

With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).

I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.

Thank you for your help, please let me know if I'm not clear or if you need more details.

Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.

A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3 . This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.

To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):

SELECT id FROM table;

You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:

SELECT * FROM table WHERE id = :id

Or:

SELECT * FROM table WHERE id IN (:id1, :id2, :id3)

This should be fast if the id column has an index, and it will give you a proper random distribution.

For your method to work more generally, you want max(id) rather than count(*) :

SELECT t.*
FROM Table t JOIN
     (SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
     ON t.ID > R.Random 
WHERE t.FOREIGNKEY_ID IS NULL 
ORDER BY t.ID
LIMIT 1;

The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.

The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1 , 2 , 3 , 1000 . Your method will never get 1000000 . The above will almost always get it.

Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID) , which the subquery can use. That might speed the query.

I tend to favor something more along these lines:

SELECT t.id
FROM Table t 
WHERE t.FOREIGNKEY_ID IS NULL AND
      RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;

The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.

Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID) .

EDIT:

If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.

Add a new column to the table called random , which is initialized with rand() . Build an index on . Build an index on random`. Then run a query such as:

select t.*
from ((select t.*
       from t
       where random >= @random
       order by random
       limit 10 
      ) union all
      (select t.*
       from t
       where random < @random
       order by random desc
       limit 10 
      )
     ) t
order by rand();
limit 1;

The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.

The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.

If prepared statement can be used, then this should work:

SELECT @skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING @skip;

LIMIT in SELECT statement can be used to skip rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM