简体   繁体   中英

Select a random row per group in a large table

I have a very large table (more than 10M or even 100M records) with this schema:

id int primary key, rule int

and want to select a random entry per rule. I tried this query but this takes a long time (treenode is the name of the table):

SELECT tmp.id,tmp.rule FROM treenode
LEFT JOIN (SELECT * FROM treenode ORDER BY RAND()) tmp ON (treenode.rule = tmp.rule)
GROUP BY tmp.rule;

Keeping the data as a hashtable in the memory takes a huge memory. Another option is to fetch each group from database and select a random entry. Again as the number of groups are about 100k, sending these number of queries to the database takes a long time.

update: I may add that this table is only filled once and there will be no change on it. The id and rule have holes in them.

Maybe I am missing something but is not below query equivalent to your query ?

SELECT * FROM  ( SELECT * FROM treenode ORDER BY RAND()) x GROUP BY x.rule;

It will be faster as there is no join to do.

I found out that going through all the entries take less time than this query. So I added a column as rule*max(id)+id and created an index on it (Should I use a view?).

I run the following query:

SELECT id,rule,temp FROM treenode where temp>? ORDER BY temp LIMIT 0,100000;

At the client go through all returned entries and fill a buffer. Whenever the rule changes I select a random item from the buffer and clear it (put index=0). Then I run the query again with ? as the value of the last returned temp value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM