简体   繁体   English

如何在Hadoop的Apache Pig中基于两个子包构建超级包

[英]how to build a super bag based on two sub-bags in Apache Pig on Hadoop

Suppose I have two bags, B1 and B2, and wondering how to make a super bag contain the two bags? 假设我有两个袋子B1和B2,并且想知道如何使一个超级袋子包含两个袋子? The purpose I want to have one super bag containing two sub-bag is because I want to call UDF SetDifference of datafu, which seems to be called on a relation which contains two bags? 我想要一个包含两个子袋的超级袋的目的是因为我想调用datafu的UDF SetDifference,这似乎是在包含两个袋的关系上调用的?

In my case, I already have two bags, B1 and B2. 就我而言,我已经有两个袋子,B1和B2。 I think I need to assemble a super bag "input" in this sample. 我想我需要在此示例中组装一个超级包“输入”。

http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html

differenced = FOREACH input {
  -- input bags must be sorted
  sorted_b1 = ORDER B1 by val;
  sorted_b2 = ORDER B2 by val;
  GENERATE SetDifference(sorted_b1,sorted_b2);
}

Update: 更新:

Here is my code and related error message, if anyone have any good ideas, it will be great. 这是我的代码和相关的错误消息,如果有人有什么好主意,那就太好了。

register datafu-1.2.0.jar;
define setDifference datafu.pig.sets.SetDifference();

-- input1.txt: {(3),(4),(1),(2),(7),(5),(6)}
-- input2.txt: {(1),(3),(5),(12)}
A = load 'input1.txt' AS (B1:bag{T:tuple(val:int)});
B = load 'input2.txt' AS (B1:bag{T:tuple(val:int)});

sorted_b1 = ORDER A by val;
sorted_b2 = ORDER B by val;
differenced = setDifference(sorted_b1,sorted_b2);

-- expected produces: ({(2),(4),(6),(7)})
DUMP differenced;

[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file TestDataFu3.pig, line 11> Cannot expand macro 'setDifference'. Reason: Macro must be defined before expansion.

thanks in advance, Lin 预先感谢林

Okay I see what you are asking; 好吧,我明白你的要求了; your bags are in different files. 您的行李袋存放在不同的文件中。 You will need to import and then join them so that they are in the same relation. 您将需要导入然后加入它们,以便它们处于相同的关系。

Script : 剧本

REGISTER /path/to/jars/datafu-1.2.0.jar;
DEFINE SetDifference datafu.pig.sets.SetDifference();

data1 = LOAD 'input1' AS (B1:bag{T1:tuple(val1:int)});
data2 = LOAD 'input2' AS (B2:bag{T2:tuple(val2:int)});
A = JOIN data1 BY 1, data2 BY 1;
diff = FOREACH A {
  S1 = ORDER B1 BY val1;
  S2 = ORDER B2 BY val2;
  GENERATE SetDifference(S1, S2);
};
DUMP A;

Output : 输出

({(2),(4),(6),(7)})

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM