[英]Pig 10.0 - group the tuples and merge bags in a foreach
I'm using Pig 10.0
. 我正在使用Pig 10.0
。 I want to Merge bags in a foreach. 我想在一个foreach合并袋子。 Let's say I have the following visitors
alias: 假设我有以下visitors
别名:
(a, b, {1, 2, 3, 4}),
(a, d, {1, 3, 6}),
(a, e, {7}),
(z, b, {1, 2, 3})
I want to group the tuples on the first field and merge the bags with a set semantic to get the following following tuples: 我想在第一个字段上对元组进行分组,并使用set语义合并包以获得以下元组:
({1, 2, 3, 4, 6, 7}, a, 6)
({1, 2, 3}, z, 3)
The first field is the union of the bags with a set semantic. 第一个字段是具有集合语义的包的联合。 The second field of the tuple is the group field. 元组的第二个字段是组字段。 The third field is the number items in the bag. 第三个字段是包中的数字项。
I tried several variations around the following code (replaced SetUnion by Group/Distinct etc.) but always failed to achieve the wanted behavior: 我在下面的代码中尝试了几种变体(用Group / Distinct替换了SetUnion等)但总是无法实现想要的行为:
DEFINE SetUnion datafu.pig.bags.sets.SetUnion();
grouped = GROUP visitors by (FirstField);
merged = FOREACH grouped {
VU = SetUnion(visitors.ThirdField);
GENERATE
VU as Vu,
group as FirstField,
COUNT(VU) as Cnt;
}
dump merged;
Can you explain where I'm wrong and how to implement the desired behavior? 你能解释我错在哪里以及如何实现所期望的行为吗?
I finally managed to achieve the wanted behavior. 我终于设法实现了通缉行为。 A self contained example of my solution follows: 我的解决方案的一个自包含示例如下:
Data file: 数据文件:
a b 1
a b 2
a b 3
a b 4
a d 1
a b 3
a b 6
a e 7
z b 1
z b 2
z b 3
Code: 码:
-- Prepare data
in = LOAD 'data' USING PigStorage()
AS (One:chararray, Two:chararray, Id:long);
grp = GROUP in by (One, Two);
cnt = FOREACH grp {
ids = DISTINCT in.Id;
GENERATE
ids as Ids,
group.One as One,
group.Two as Two,
COUNT(ids) as Count;
}
-- Interesting code follows
grp2 = GROUP cnt by One;
cnt2 = FOREACH grp2 {
ids = FOREACH cnt.Ids generate FLATTEN($0);
GENERATE
ids as Ids,
group as One,
COUNT(ids) as Count;
}
describe cnt2;
dump grp2;
dump cnt2;
Describe: 描述:
Cnt: {Ids: {(Id: long)},One: chararray,Two: chararray,Count: long}
grp2: GRP2:
(a,{({(1),(2),(3),(4),(6)},a,b,5),({(1)},a,d,1),({(7)},a,e,1)})
(z,{({(1),(2),(3)},z,b,3)})
cnt2: CNT2:
({(1),(2),(3),(4),(6),(1),(7)},a,7)
({(1),(2),(3)},z,3)
Since the code uses a FOREACH nested in a FOREACH it requires Pig > 10.0. 由于代码使用嵌套在FOREACH中的FOREACH,因此它需要Pig> 10.0。
I will let the question as unresolved for a few days since a cleaner solution probably exists. 由于可能存在更清洁的解决方案,我会在几天内将问题解决。
Found a simpler solution for this. 为此找到了一个更简单的解决方案。
current_input = load '/idn/home/ksing143/tuple_related_data/tough_grouping.txt' USING PigStorage() AS (col1:chararray, col2:chararray, col3:int); current_input = load'/idn/home/ksing143/tuple_related_data/tough_grouping.txt'使用PigStorage()AS(col1:chararray,col2:chararray,col3:int);
/* But we do not need column 2. Hence eliminating to avoid confusion */ / * 但我们不需要第2列。因此消除以避免混淆 * /
relevant_input = foreach current_input generate col1, col3; relevant_input = foreach current_input generate col1,col3;
relevant_distinct = DISTINCT relevant_input; relevant_distinct = DISTINCT related_input;
relevant_grouped = group relevant_distinct by col1; related_grouped = group related_distinct by col1;
/* This will give */ / * 这会给 * /
(a,{(a,1),(a,2),(a,3),(a,4),(a,6),(a,7)}) (A,{(A,1),(A,2),(A,3),(A,4),(A,6),(A,7)})
(z,{(z,1),(z,2),(z,3)}) (Z,{(z,1),(Z,2),(Z,3)})
relevant_grouped_advance = foreach relevant_grouped generate (relevant_distinct.col3) as col3, group, COUNT(relevant_distinct.col3) as count_val; relevant_grouped_advance = foreach related_grouped generate(relevant_distinct.col3)as col3,group,COUNT(relevant_distinct.col3)as count_val;
/* This will give desired result */ / *这会产生预期的结果* /
({(1),(2),(3),(4),(6),(7)},a,6) ({(1),(2),(3),(4),(6),(7)},A,6)
({(1),(2),(3)},z,3) ({(1),(2),(3)},Z,3)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.