简体   繁体   English

在Apache PIG中按父值过滤包

[英]Filter bag by parent value in Apache PIG

I have the following relation in Apache PIG. 我在Apache PIG中具有以下关系。

TSERIES: {ORDERED: {(timestamp: long,contentHost: chararray)},ts1: long}

And I want to do the following: 我要执行以下操作:

F = foreach TSERIES {
    ts = filter ORDERED by timestamp > TSERIES.ts1;
    generate ts;
}

In short, I want to keep all elements of bag ORDERED with a timestmap higher than ts1, but pig won't allow, specifically this part ts = filter ORDERED by timestamp > TSERIES.ts1; 简而言之,我想使bag ORDERED包中的所有元素的时间戳都比ts1高,但是Pig不允许,特别是这部分ts = filter ORDERED by timestamp > TSERIES.ts1; .

Is this possible? 这可能吗? I'm using version 0.9.2-cdh4.0.1 (cloudera). 我正在使用0.9.2-cdh4.0.1版(cloudera)。

Did you tried : 您是否尝试过:

Test = filter tseries By (ordered.timestamp > ts1); 测试=筛选器tseries By(ordered.timestamp> ts1);

I'm not sure if there's a way to do this without a UDF... it seems like there should be, but I can't figure it out either. 我不确定如果没有UDF,是否有办法做到这一点……似乎应该有,但我也无法弄清楚。 Anyway, you could either write a UDF to do this directly: go through the bag, filter out some, and return a bag. 无论如何,您可以编写一个UDF直接执行此操作:检查袋子,过滤掉一些袋子,然后返回袋子。 Or, you could write a UDF to generate UUIDs and then flatten the bag and re-group it - smoething like this: 或者,您可以编写一个UDF生成UUID,然后将袋子放平并重新分组-像这样顺滑:

a = foreach TSERIES generate ORDERED, ts1, myudfs.GenerateUUID() as id;
b = foreach a generate FLATTEN(ORDERED) as ts, ts1, id;
c = filter b by ts.timestamp > ts1;
d = group c by id;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM