简体   繁体   English

Hive:如何展平阵列?

[英]Hive : How to flatten an array?

I have this table我有这张桌子

CREATE TABLE `dum`(
  `val` map<string,array<string>>)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
  'some/hdfs/dum'
TBLPROPERTIES (
  'transient_lastDdlTime'='1593230834')

and i insert a simple value here as我在这里插入一个简单的值

insert into dum select map('A',array('1','2','3'),'B',array('4','5','6'));

and i see this data as我将这些数据视为

hive> select * from dum;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200627042934_21c7038a-3499-4076-a67b-7260f0b57030
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1593212599278_0076, Tracking URL = http://ndb4zi3yi7-m:8088/proxy/application_1593212599278_0076/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1593212599278_0076
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-27 04:29:43,830 Stage-1 map = 0%,  reduce = 0%
2020-06-27 04:29:51,000 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.04 sec
MapReduce Total cumulative CPU time: 4 seconds 40 msec
Ended Job = job_1593212599278_0076
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 4.04 sec   HDFS Read: 237 HDFS Write: 124 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 40 msec
OK
{"A":["1","2","3"],"B":["4","5","6"]}

now i want to collect all the values in the above map column.现在我想收集上述 map 列中的所有值。 and i want to collect them as an array我想将它们收集为一个数组

when i do当我做

hive> select map_values(val) from dum;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200627043423_63c2a284-ca5b-4c6a-b8e9-3226f81c8b0d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1593212599278_0077, Tracking URL = http://ndb4zi3yi7-m:8088/proxy/application_1593212599278_0077/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1593212599278_0077
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-27 04:34:31,658 Stage-1 map = 0%,  reduce = 0%
2020-06-27 04:34:38,817 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.9 sec
MapReduce Total cumulative CPU time: 4 seconds 900 msec
Ended Job = job_1593212599278_0077
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 4.9 sec   HDFS Read: 237 HDFS Write: 111 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 900 msec
OK
[["1","2","3"],["4","5","6"]]
Time taken: 16.82 seconds, Fetched: 1 row(s)
hive>

i want the result in an array like this我想要这样的数组中的结果

["1","2","3","4","5","6"]

so basically i just want to flatten the array of arrays所以基本上我只想展平 arrays 的数组

how can i achieve that?我怎样才能做到这一点? i tried collect_set but i got the same result as above我尝试了 collect_set 但我得到了与上面相同的结果

hive> select collect_set(v) from dum lateral view explode(map_values(val)) vl as v;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200627043601_e3483ad4-96be-459d-9384-6d4d2852afa1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1593212599278_0078, Tracking URL = http://ndb4zi3yi7-m:8088/proxy/application_1593212599278_0078/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1593212599278_0078
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-06-27 04:36:11,241 Stage-1 map = 0%,  reduce = 0%
2020-06-27 04:36:19,424 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.36 sec
2020-06-27 04:36:25,556 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.04 sec
MapReduce Total cumulative CPU time: 8 seconds 40 msec
Ended Job = job_1593212599278_0078
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.04 sec   HDFS Read: 237 HDFS Write: 111 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 40 msec
OK
[["1","2","3"],["4","5","6"]]

what am i doing wrong?我究竟做错了什么? Here is one way i found这是我找到的一种方法

hive> select collect_set(vi) from dum lateral view explode(map_values(val)) x as v lateral view explode(v) y as vi;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200627054452_d23a4e84-e68e-4087-8513-23b31d7ad1d1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1593212599278_0090, Tracking URL = http://ndb4zi3yi7-m:8088/proxy/application_1593212599278_0090/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1593212599278_0090
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-06-27 05:44:58,435 Stage-1 map = 0%,  reduce = 0%
2020-06-27 05:45:06,604 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.95 sec
2020-06-27 05:45:12,725 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.5 sec
MapReduce Total cumulative CPU time: 7 seconds 500 msec
Ended Job = job_1593212599278_0090
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.5 sec   HDFS Read: 237 HDFS Write: 111 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 500 msec
OK
["1","2","3","4","5","6"]
Time taken: 21.69 seconds, Fetched: 1 row(s)

I dont know if this is the ideal approach我不知道这是否是理想的方法

also, this does not work with multiple rows.此外,这不适用于多行。 consider this考虑这个

insert into dum select map('A',array('7','8','9'),'B',array('10','11','12'));

so now we have所以现在我们有了

hive> select * from dum;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200627055921_69298f96-ebd4-4022-aa44-228e81203404
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1593212599278_0098, Tracking URL = http://ndb4zi3yi7-m:8088/proxy/application_1593212599278_0098/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1593212599278_0098
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-27 05:59:27,692 Stage-1 map = 0%,  reduce = 0%
2020-06-27 05:59:33,817 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.86 sec
MapReduce Total cumulative CPU time: 3 seconds 860 msec
Ended Job = job_1593212599278_0098
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 3.86 sec   HDFS Read: 336 HDFS Write: 164 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 860 msec
OK
{"A":["1","2","3"],"B":["4","5","6"]}
{"A":["7","8","9"],"B":["10","11","12"]}

but if try the above solution但如果尝试上述解决方案

select collect_set(vi) from dum lateral view explode(map_values(val)) x as v lateral view explode(v) y as vi;

i get我明白了

["1","2","3","4","5","6","7","8","9","10","11","12"]

however, i want to see但是,我想看看

["1","2","3","4","5","6"]
["7","8","9","10","11","12"]

You can also try to use concat_ws on top of collect_list / collect_set to flatten the collected data.您还可以尝试在collect_list / collect_set之上使用concat_ws来展平收集的数据。

select split(concat_ws(',',collect_list(concat_ws(',',pe.v1))),',') from dum lateral view explode(val) pe as k1,v1 group by val

Output: Output:

在此处输入图像描述

I think i figured it out我想我明白了

select k,collect_set(vi) from dum lateral view explode(val) x as k,v lateral view explode(v) y as vi group by k;

and i get我明白了

B   ["4","5","6","10","11","12"]
A   ["1","2","3","7","8","9"]

Is this the correct way?这是正确的方法吗?

There's actually already a brickhouse UDF for this, array_flatten : https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/ArrayFlattenUDF.java实际上已经有一个砖房UDF, array_flattenhttps://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/ArrayFlattenUDF.java

hive (default)> select array_flatten(array(array('a', 'b'), array('b', 'c')));
OK
["a","b","b","c"]
Time taken: 0.302 seconds, Fetched: 1 row(s)

A bit janky, but you can also de-dupe using another brickhouse UDF, array_union , and just pass an additional empty array:有点笨拙,但您也可以使用另一个砖房 UDF array_union进行重复数据删除,然后只传递一个额外的空数组:

hive (default)> select array_union(array_flatten(array(array('a', 'b'), array('b', 'c'))), array());
OK
["a","b","c"]
Time taken: 0.245 seconds, Fetched: 1 row(s)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM