繁体   English   中英

如何从一列中的不同值中采样,但只返回另一列中唯一的记录?

[英]How to sample from different values in a column but only return records that are unique from another column?

我正在努力解决使用 Teradata 的采样问题

下面是数据的格式

ID    Group     Rank
1     dog       1 
1     cat       1 
1     lion      1  
1     elephant  2 
2     dog       1 
2     cat       1 
2     lion      1 
2     elephant  1 
3     dog       1
3     cat       2 
3     lion      1 
3     elephant  1 
4     dog       2 
4     cat       1 
4     lion      1 
4     elephant  1 
... 

理想情况下,我希望为 Group 中的每个条目返回一个样本编号,但只有 ID 中的唯一值。

下面是我生成的当前查询,但这会返回 ID 的重复项

SELECT ID, Group FROM Table 
WHERE rank = 1 
SAMPLE 
 WHEN group = 'dog' then 10
 WHEN group = 'cat' then 10
 WHEN group = 'elephant' then 5
 WHEN group = 'lion' then 5
END
with cte as
 (
   SELECT ID, Group,
      random(1,10000) as rnd -- RANDOM can't be directly used in OLAP-functions
   FROM Table 
   WHERE rank = 1 
 )
SELECT ID, Group
FROM cte
QUALIFY 
   ROW_NUMBER() -- get one random row per ID
   OVER (PARTITION BY ID 
         ORDER BY rnd) = 1
SAMPLE 
 WHEN group = 'dog' then 10
 WHEN group = 'cat' then 10
 WHEN group = 'elephant' then 5
 WHEN group = 'lion' then 5
END

假设您有足够的记录,为每个 id 选择一个随机行,然后从中选择适当的数字:

select t.*
from (select t.*,
             row_number() over (partition by group order by seqnum) as sequm_g
      from (select t.*,
                   row_number() over (partition by id order by random(1, 1000000))
            from t
           ) t
      where seqnum = 1
     ) t
where (group in ('dog', 'cat') and seqnum_g <= 10) or
      (group in ('elephant', 'lion') and seqnum_g <= 5) ;

这并不能保证这些组在结果集中足够大。 但是,如果您有足够的与组大小相关的数据,那么它应该可以工作。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM