Select a random row per group in a large table

Question

I have a very large table (more than 10M or even 100M records) with this schema:

id int primary key, rule int

and want to select a random entry per rule. I tried this query but this takes a long time (treenode is the name of the table):

SELECT tmp.id,tmp.rule FROM treenode
LEFT JOIN (SELECT * FROM treenode ORDER BY RAND()) tmp ON (treenode.rule = tmp.rule)
GROUP BY tmp.rule;

Keeping the data as a hashtable in the memory takes a huge memory. Another option is to fetch each group from database and select a random entry. Again as the number of groups are about 100k, sending these number of queries to the database takes a long time.

update: I may add that this table is only filled once and there will be no change on it. The id and rule have holes in them.

Answer 1

Maybe I am missing something but is not below query equivalent to your query ?

SELECT * FROM  ( SELECT * FROM treenode ORDER BY RAND()) x GROUP BY x.rule;

It will be faster as there is no join to do.

Answer 2

I found out that going through all the entries take less time than this query. So I added a column as rule*max(id)+id and created an index on it (Should I use a view?).

I run the following query:

SELECT id,rule,temp FROM treenode where temp>? ORDER BY temp LIMIT 0,100000;

At the client go through all returned entries and fill a buffer. Whenever the rule changes I select a random item from the buffer and clear it (put index=0). Then I run the query again with ? as the value of the last returned temp value.

Select a random row per group in a large table

Question

2 answers

solution1
2 2012-05-30 01:55:40

solution2
0 ACCPTED 2012-05-30 16:33:19

Select a random row per group in a large table

Question

2 answers

solution1 2 2012-05-30 01:55:40

solution2 0 ACCPTED 2012-05-30 16:33:19

solution1
2 2012-05-30 01:55:40

solution2
0 ACCPTED 2012-05-30 16:33:19