I know we can do some random sample in hive using something like Row_Number() OVER (partition by deptid ORDER BY rand() desc)
(assuming we want to get some row data randomly in some partitions.
However, I don't know how to get some rows based on one column value. For example we want to get some data weighted by salary. How to do that?
Data like
create table employee (empid int ,deptid int ,salary decimal(10,2))
insert into employee values(1,10,5500.00)
insert into employee values(2,10,4500.00)
insert into employee values(3,20,1900.00)
insert into employee values(4,20,4800.00)
insert into employee values(5,40,6500.00)
insert into employee values(6,40,14500.00)
insert into employee values(7,40,44500.00)
insert into employee values(8,50,6500.00)
insert into employee values(9,50,7500.00)
Is there a way to do it in HIVE?
Here is a simple idea to pick sample records based on a partition of salaries. You can use -
order by rand()
- this will give you random order in above partition. SQL -
select * from (
select e.* ,
row_number() over(partition by cast(salary/10000 as INT) order by rand()) rn
from employee e)rs where rn <3 -- choose 2 records per partition
Now, my thought on weighted sampling can be something simple as pick sample records based on a partition or can be as complex as pick sample records based on how large the salary is. For second scenario, you need to tweak the order by.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.