简体   繁体   中英

How to generate a large data set using hive / spark-sql?

例如,生成1G记录,其序号在1到1G之间。

Create partitioned seed table

create table seed (i int)
partitioned by (p int)

Populate the seed table with 1K records with sequential numbers between 0 and 999.
Each record is being inserted into a different partition, therefore located on a different HDFS directory and more important - on a different file.

Ps

The following set is needed

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.hadoop.supports.splittable.combineinputformat=false;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

insert into table seed partition (p)
select  i,i 
from    (select 1) x lateral view posexplode (split (space (999),' ')) e as i,x

Generate a table with 1G records.
Each of the 1K records in the seed table is on a different file and is being read by a different container.
Each container generates 1M records.

create table t1g
as
select  s.i*1000000 + e.i + 1  as n
from    seed s lateral view posexplode (split (space (1000000-1),' ')) e as i,x

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM