简体   繁体   English

Pig 10.0 - 将元组组合在一起并将袋子合并在一个foreach中

[英]Pig 10.0 - group the tuples and merge bags in a foreach

I'm using Pig 10.0 . 我正在使用Pig 10.0 I want to Merge bags in a foreach. 我想在一个foreach合并袋子。 Let's say I have the following visitors alias: 假设我有以下visitors别名:

(a, b, {1, 2, 3, 4}),
(a, d, {1, 3, 6}),
(a, e, {7}),
(z, b, {1, 2, 3})

I want to group the tuples on the first field and merge the bags with a set semantic to get the following following tuples: 我想在第一个字段上对元组进行分组,并使用set语义合并包以获得以下元组:

({1, 2, 3, 4, 6, 7}, a, 6) 
({1, 2, 3}, z, 3) 

The first field is the union of the bags with a set semantic. 第一个字段是具有集合语义的包的联合。 The second field of the tuple is the group field. 元组的第二个字段是组字段。 The third field is the number items in the bag. 第三个字段是包中的数字项。

I tried several variations around the following code (replaced SetUnion by Group/Distinct etc.) but always failed to achieve the wanted behavior: 我在下面的代码中尝试了几种变体(用Group / Distinct替换了SetUnion等)但总是无法实现想要的行为:

DEFINE SetUnion        datafu.pig.bags.sets.SetUnion();

grouped = GROUP visitors by (FirstField);
merged = FOREACH grouped {
    VU = SetUnion(visitors.ThirdField);
    GENERATE 
        VU        as Vu,
        group     as FirstField,
        COUNT(VU) as Cnt;
    }
dump merged;

Can you explain where I'm wrong and how to implement the desired behavior? 你能解释我错在哪里以及如何实现所期望的行为吗?

I finally managed to achieve the wanted behavior. 我终于设法实现了通缉行为。 A self contained example of my solution follows: 我的解决方案的一个自包含示例如下:

Data file: 数据文件:

a       b       1
a       b       2
a       b       3
a       b       4
a       d       1
a       b       3
a       b       6
a       e       7
z       b       1
z       b       2
z       b       3

Code: 码:

-- Prepare data
in = LOAD 'data' USING PigStorage() 
        AS (One:chararray, Two:chararray, Id:long);

grp = GROUP in by (One, Two);
cnt = FOREACH grp {
        ids = DISTINCT in.Id;
        GENERATE
                ids        as Ids,
                group.One  as One,
                group.Two  as Two,
                COUNT(ids) as Count;
}       

-- Interesting code follows
grp2 = GROUP cnt by One;
cnt2 = FOREACH grp2 {
        ids = FOREACH cnt.Ids generate FLATTEN($0);
        GENERATE
                ids  as Ids,
                group      as One,
                COUNT(ids) as Count;
}               

describe cnt2;
dump grp2;
dump cnt2;

Describe: 描述:

Cnt: {Ids: {(Id: long)},One: chararray,Two: chararray,Count: long}

grp2: GRP2:

(a,{({(1),(2),(3),(4),(6)},a,b,5),({(1)},a,d,1),({(7)},a,e,1)})
(z,{({(1),(2),(3)},z,b,3)})

cnt2: CNT2:

({(1),(2),(3),(4),(6),(1),(7)},a,7)
({(1),(2),(3)},z,3)

Since the code uses a FOREACH nested in a FOREACH it requires Pig > 10.0. 由于代码使用嵌套在FOREACH中的FOREACH,因此它需要Pig> 10.0。

I will let the question as unresolved for a few days since a cleaner solution probably exists. 由于可能存在更清洁的解决方案,我会在几天内将问题解决。

Found a simpler solution for this. 为此找到了一个更简单的解决方案。

current_input = load '/idn/home/ksing143/tuple_related_data/tough_grouping.txt' USING PigStorage() AS (col1:chararray, col2:chararray, col3:int); current_input = load'/idn/home/ksing143/tuple_related_data/tough_grouping.txt'使用PigStorage()AS(col1:chararray,col2:chararray,col3:int);

/* But we do not need column 2. Hence eliminating to avoid confusion */ / * 但我们不需要第2列。因此消除以避免混淆 * /

relevant_input = foreach current_input generate col1, col3; relevant_input = foreach current_input generate col1,col3;

relevant_distinct = DISTINCT relevant_input; relevant_distinct = DISTINCT related_input;

relevant_grouped = group relevant_distinct by col1; related_grouped = group related_distinct by col1;

/* This will give */ / * 这会给 * /

(a,{(a,1),(a,2),(a,3),(a,4),(a,6),(a,7)}) (A,{(A,1),(A,2),(A,3),(A,4),(A,6),(A,7)})

(z,{(z,1),(z,2),(z,3)}) (Z,{(z,1),(Z,2),(Z,3)})

relevant_grouped_advance = foreach relevant_grouped generate (relevant_distinct.col3) as col3, group, COUNT(relevant_distinct.col3) as count_val; relevant_grouped_advance = foreach related_grouped generate(relevant_distinct.col3)as col3,group,COUNT(relevant_distinct.col3)as count_val;

/* This will give desired result */ / *这会产生预期的结果* /

({(1),(2),(3),(4),(6),(7)},a,6) ({(1),(2),(3),(4),(6),(7)},A,6)

({(1),(2),(3)},z,3) ({(1),(2),(3)},Z,3)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM