I need to get a stratified sample of my huge table. Specifically, I want to select 1/n rows from my table without bias , ie select randomly, select every nth row, etc.
Before I asked this question, I tried doing this . However, it didn't work for me because I am using the InfiniDB engine and, as I found out later, it doesn't support variables in sub-expressions, or something like that. Does anyone know a way to do this without user variables?
I was thinking about something like this: in my table, every row has a unique alphanumeric string id, which can look like "1234567890"
, or like "abcdef12345"
. I was thinking of somehow converting that string to a number, and then using the modulo function to only select 1/n rows from my table. However, I have no idea how to do the conversion, as this string is not hexadecimal.
Note: my table does not have an autoincremented column.
This is complicated, but you can do it. It requires a self-join and aggregation, implemented in this query using a correlated subquery. My guess is that this will not perform well, because you presumably have a large table. For a 10% sample, it would look like:
select ht.*,
(select count(*)
from hugetable ht2
where ht2.col < ht.col or
(ht2.col = ht.col and ht2.id <= ht.id)
) as rn
from hugetable ht
having rn % 10 = 1;
Note that the use of having
in this context is specific to MySQL. It allows you to filter the rows without using a subquery.
EDIT:
Probably the only feasible approach -- it you can do it -- is to create another table with an auto-incremented id. Here is a stripped down version:
create table temp (
id int auto_increment,
idstring varchar(255),
col varchar(255)
);
insert into temp(idstring, col)
select idstring, col
from hugetable ht
order by col;
select *
from temp
where id % 10 = 1;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.