Select stratified sample from table without variables

Question

I need to get a stratified sample of my huge table. Specifically, I want to select 1/n rows from my table without bias , ie select randomly, select every nth row, etc.

Before I asked this question, I tried doing this . However, it didn't work for me because I am using the InfiniDB engine and, as I found out later, it doesn't support variables in sub-expressions, or something like that. Does anyone know a way to do this without user variables?

I was thinking about something like this: in my table, every row has a unique alphanumeric string id, which can look like "1234567890" , or like "abcdef12345" . I was thinking of somehow converting that string to a number, and then using the modulo function to only select 1/n rows from my table. However, I have no idea how to do the conversion, as this string is not hexadecimal.

Note: my table does not have an autoincremented column.

Answer 1

This is complicated, but you can do it. It requires a self-join and aggregation, implemented in this query using a correlated subquery. My guess is that this will not perform well, because you presumably have a large table. For a 10% sample, it would look like:

select ht.*,
       (select count(*)
        from hugetable ht2
        where ht2.col < ht.col or
              (ht2.col = ht.col and ht2.id <= ht.id)
       ) as rn
from hugetable ht
having rn % 10 = 1;

Note that the use of having in this context is specific to MySQL. It allows you to filter the rows without using a subquery.

EDIT:

Probably the only feasible approach -- it you can do it -- is to create another table with an auto-incremented id. Here is a stripped down version:

create table temp (
    id int auto_increment,
    idstring varchar(255),
    col varchar(255)
);

insert into temp(idstring, col)
    select idstring, col
    from hugetable ht
    order by col;

select *
from temp
where id % 10 = 1;

Select stratified sample from table without variables

Question

1 answers

solution1
3 2014-07-21 13:39:38

Select stratified sample from table without variables

Question

1 answers

solution1 3 2014-07-21 13:39:38

solution1
3 2014-07-21 13:39:38