简体   繁体   中英

How to get weighted sample data in Hive SQL?

I know we can do some random sample in hive using something like Row_Number() OVER (partition by deptid ORDER BY rand() desc) (assuming we want to get some row data randomly in some partitions.

However, I don't know how to get some rows based on one column value. For example we want to get some data weighted by salary. How to do that?

Data like

create table employee (empid int ,deptid int ,salary decimal(10,2))
insert into employee values(1,10,5500.00)
insert into employee values(2,10,4500.00)
insert into employee values(3,20,1900.00)
insert into employee values(4,20,4800.00)
insert into employee values(5,40,6500.00)
insert into employee values(6,40,14500.00)
insert into employee values(7,40,44500.00)
insert into employee values(8,50,6500.00)
insert into employee values(9,50,7500.00)

Is there a way to do it in HIVE?

Here is a simple idea to pick sample records based on a partition of salaries. You can use -

  1. partition by buckets of salaries. You can bucket them using cast(salary/10000 as INT). You can change 10000 to any number to create buckets of your choosing. If you choose 10000, per your example, 4500,5500,1900 etc will go to partition 1, 14500 will go to partition 2, 44500 will go to partition 3 etc.
  2. order by rand() - this will give you random order in above partition.

SQL -

select * from (
select e.* ,
row_number() over(partition by cast(salary/10000 as INT) order by rand()) rn
from employee e)rs where rn <3 -- choose 2 records per partition

Now, my thought on weighted sampling can be something simple as pick sample records based on a partition or can be as complex as pick sample records based on how large the salary is. For second scenario, you need to tweak the order by.

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM