I have a very large table (more than 10M or even 100M records) with this schema:
id int primary key, rule int
and want to select a random entry per rule. I tried this query but this takes a long time (treenode is the name of the table):
SELECT tmp.id,tmp.rule FROM treenode
LEFT JOIN (SELECT * FROM treenode ORDER BY RAND()) tmp ON (treenode.rule = tmp.rule)
GROUP BY tmp.rule;
Keeping the data as a hashtable in the memory takes a huge memory. Another option is to fetch each group from database and select a random entry. Again as the number of groups are about 100k, sending these number of queries to the database takes a long time.
update: I may add that this table is only filled once and there will be no change on it. The id and rule have holes in them.
Maybe I am missing something but is not below query equivalent to your query ?
SELECT * FROM ( SELECT * FROM treenode ORDER BY RAND()) x GROUP BY x.rule;
It will be faster as there is no join to do.
I found out that going through all the entries take less time than this query. So I added a column as rule*max(id)+id and created an index on it (Should I use a view?).
I run the following query:
SELECT id,rule,temp FROM treenode where temp>? ORDER BY temp LIMIT 0,100000;
At the client go through all returned entries and fill a buffer. Whenever the rule changes I select a random item from the buffer and clear it (put index=0). Then I run the query again with ? as the value of the last returned temp value.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.