简体   繁体   中英

Select stratified sample from table without variables

I need to get a stratified sample of my huge table. Specifically, I want to select 1/n rows from my table without bias , ie select randomly, select every nth row, etc.

Before I asked this question, I tried doing this . However, it didn't work for me because I am using the InfiniDB engine and, as I found out later, it doesn't support variables in sub-expressions, or something like that. Does anyone know a way to do this without user variables?

I was thinking about something like this: in my table, every row has a unique alphanumeric string id, which can look like "1234567890" , or like "abcdef12345" . I was thinking of somehow converting that string to a number, and then using the modulo function to only select 1/n rows from my table. However, I have no idea how to do the conversion, as this string is not hexadecimal.

Note: my table does not have an autoincremented column.

This is complicated, but you can do it. It requires a self-join and aggregation, implemented in this query using a correlated subquery. My guess is that this will not perform well, because you presumably have a large table. For a 10% sample, it would look like:

select ht.*,
       (select count(*)
        from hugetable ht2
        where ht2.col < ht.col or
              (ht2.col = ht.col and ht2.id <= ht.id)
       ) as rn
from hugetable ht
having rn % 10 = 1;

Note that the use of having in this context is specific to MySQL. It allows you to filter the rows without using a subquery.

EDIT:

Probably the only feasible approach -- it you can do it -- is to create another table with an auto-incremented id. Here is a stripped down version:

create table temp (
    id int auto_increment,
    idstring varchar(255),
    col varchar(255)
);

insert into temp(idstring, col)
    select idstring, col
    from hugetable ht
    order by col;

select *
from temp
where id % 10 = 1;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM