Pig 10.0 - 将元组组合在一起并将袋子合并在一个foreach中

Question

I'm using Pig 10.0 . 我正在使用Pig 10.0 。 I want to Merge bags in a foreach. 我想在一个foreach合并袋子。 Let's say I have the following visitors alias: 假设我有以下visitors别名：

(a, b, {1, 2, 3, 4}),
(a, d, {1, 3, 6}),
(a, e, {7}),
(z, b, {1, 2, 3})

I want to group the tuples on the first field and merge the bags with a set semantic to get the following following tuples: 我想在第一个字段上对元组进行分组，并使用set语义合并包以获得以下元组：

({1, 2, 3, 4, 6, 7}, a, 6) 
({1, 2, 3}, z, 3)

The first field is the union of the bags with a set semantic. 第一个字段是具有集合语义的包的联合。 The second field of the tuple is the group field. 元组的第二个字段是组字段。 The third field is the number items in the bag. 第三个字段是包中的数字项。

I tried several variations around the following code (replaced SetUnion by Group/Distinct etc.) but always failed to achieve the wanted behavior: 我在下面的代码中尝试了几种变体（用Group / Distinct替换了SetUnion等）但总是无法实现想要的行为：

DEFINE SetUnion        datafu.pig.bags.sets.SetUnion();

grouped = GROUP visitors by (FirstField);
merged = FOREACH grouped {
    VU = SetUnion(visitors.ThirdField);
    GENERATE 
        VU        as Vu,
        group     as FirstField,
        COUNT(VU) as Cnt;
    }
dump merged;

Can you explain where I'm wrong and how to implement the desired behavior? 你能解释我错在哪里以及如何实现所期望的行为吗？

Answer 1

I finally managed to achieve the wanted behavior. 我终于设法实现了通缉行为。 A self contained example of my solution follows: 我的解决方案的一个自包含示例如下：

Data file: 数据文件：

a       b       1
a       b       2
a       b       3
a       b       4
a       d       1
a       b       3
a       b       6
a       e       7
z       b       1
z       b       2
z       b       3

Code: 码：

-- Prepare data
in = LOAD 'data' USING PigStorage() 
        AS (One:chararray, Two:chararray, Id:long);

grp = GROUP in by (One, Two);
cnt = FOREACH grp {
        ids = DISTINCT in.Id;
        GENERATE
                ids        as Ids,
                group.One  as One,
                group.Two  as Two,
                COUNT(ids) as Count;
}       

-- Interesting code follows
grp2 = GROUP cnt by One;
cnt2 = FOREACH grp2 {
        ids = FOREACH cnt.Ids generate FLATTEN($0);
        GENERATE
                ids  as Ids,
                group      as One,
                COUNT(ids) as Count;
}               

describe cnt2;
dump grp2;
dump cnt2;

Describe: 描述：

Cnt: {Ids: {(Id: long)},One: chararray,Two: chararray,Count: long}

grp2: GRP2：

(a,{({(1),(2),(3),(4),(6)},a,b,5),({(1)},a,d,1),({(7)},a,e,1)})
(z,{({(1),(2),(3)},z,b,3)})

cnt2: CNT2：

({(1),(2),(3),(4),(6),(1),(7)},a,7)
({(1),(2),(3)},z,3)

Since the code uses a FOREACH nested in a FOREACH it requires Pig > 10.0. 由于代码使用嵌套在FOREACH中的FOREACH，因此它需要Pig> 10.0。

I will let the question as unresolved for a few days since a cleaner solution probably exists. 由于可能存在更清洁的解决方案，我会在几天内将问题解决。

Answer 2

Found a simpler solution for this. 为此找到了一个更简单的解决方案。

current_input = load '/idn/home/ksing143/tuple_related_data/tough_grouping.txt' USING PigStorage() AS (col1:chararray, col2:chararray, col3:int); current_input = load'/idn/home/ksing143/tuple_related_data/tough_grouping.txt'使用PigStorage（）AS（col1：chararray，col2：chararray，col3：int）;

/* But we do not need column 2. Hence eliminating to avoid confusion */ / * 但我们不需要第2列。因此消除以避免混淆 * /

relevant_input = foreach current_input generate col1, col3; relevant_input = foreach current_input generate col1，col3;

relevant_distinct = DISTINCT relevant_input; relevant_distinct = DISTINCT related_input;

relevant_grouped = group relevant_distinct by col1; related_grouped = group related_distinct by col1;

/* This will give */ / * 这会给 * /

(a,{(a,1),(a,2),(a,3),(a,4),(a,6),(a,7)}) （A，{（A，1），（A，2），（A，3），（A，4），（A，6），（A，7）}）

(z,{(z,1),(z,2),(z,3)}) （Z，{（z，1），（Z，2），（Z，3）}）

relevant_grouped_advance = foreach relevant_grouped generate (relevant_distinct.col3) as col3, group, COUNT(relevant_distinct.col3) as count_val; relevant_grouped_advance = foreach related_grouped generate（relevant_distinct.col3）as col3，group，COUNT（relevant_distinct.col3）as count_val;

/* This will give desired result */ / *这会产生预期的结果* /

({(1),(2),(3),(4),(6),(7)},a,6) （{（1），（2），（3），（4），（6），（7）}，A，6）

({(1),(2),(3)},z,3) （{（1），（2），（3）}，Z，3）

Pig 10.0 - 将元组组合在一起并将袋子合并在一个foreach中

问题描述

2 个解决方案

解决方案1
4 已采纳 2013-03-27 14:57:49

解决方案2
0 2016-12-20 07:28:59

Pig 10.0 - 将元组组合在一起并将袋子合并在一个foreach中

问题描述

2 个解决方案

解决方案1 4 已采纳 2013-03-27 14:57:49

解决方案2 0 2016-12-20 07:28:59

解决方案1
4 已采纳 2013-03-27 14:57:49

解决方案2
0 2016-12-20 07:28:59