如何动态获取Pig组中的前N个百分比记录

Question

I have a problem that I am not sure how to solve in Pig. 我有一个问题，我不确定如何用Pig解决。 I have a dataset on Hadoop (approx. 4 million records) which contains product titles by product category. 我在Hadoop上有一个数据集（约400万条记录），其中包含按产品类别列出的产品标题。 Each title has the no. 每个标题都有编号。 of times it showed up on the web page, and no. 它出现在网页上的次数，没有。 of times it was clicked on to go to a product details page. 单击它的次数可以转到产品详细信息页面。 The no. 没有 of titles within a product category can vary. 产品类别中的标题可能有所不同。

Sample Data - 样本数据 -

I want to get the top N % of records within each product category, based on the 3rd column (appearances on the web page). 我想根据第三列（网页上的外观）获取每个产品类别中前N％个记录。 However, the N % has to vary based on the weight/importance of the category. 但是，N％必须根据类别的重量/重要性而变化。 Eg. 例如。 For Video Games, I want to get the Top 15 % records; 对于电子游戏，我想获得前15％的记录； For Camera & Photo, I want to get the Top 5 %, etc. Is there a way to dynamically set the % or Integer value in the LIMIT clause within a nested FOREACH block of code in Pig? 对于Camera＆Photo，我想获得前5％的值，等等。有没有办法在Pig中嵌套的FOREACH代码块内的LIMIT子句中动态设置％或Integer值？

PRODUCT_DATA = LOAD '<PRODUCT FILE PATH>' USING PigStorage('|') AS (categ_name:chararray, product_titl:chararray, impression_cnt:long, click_through_cnt:long);

GRP_PROD_DATA = GROUP PRODUCT_DATA BY categ_name;

TOP_PROD_LIST = FOREACH GRP_PROD_DATA {

                  SORTED_TOP_PROD = ORDER PRODUCT_DATA BY impression_cnt DESC;
                  SAMPLED_DATA = LIMIT SORTED_TOP_PROD <CATEGORY % OR INTEGER VALUE>;
                  GENERATE flatten(SAMPLED_DATA);
                }

STORE TOP_PROD_TITLE_LIST INTO '<SOME PATH>' USING PigStorage('|');

How can I dynamically (by category) set the % or integer value for the given group? 如何动态（按类别）设置给定组的％或整数值？ I thought of using a MACRO but MACROS can not be called from within a NESTED FOREACH block. 我想到要使用MACRO，但无法从NESTED FOREACH块中调用MACROS。 Can I write a UDF which will take category name as a parameter, and output the % OR INTEGER value, and have this UDF be called from a LIMIT operation? 是否可以编写将类别名称作为参数并输出％OR INTEGER值的UDF，并从LIMIT操作中调用此UDF？

SAMPLED_DATA = LIMIT SORTED_TOP_PROD categLimitVal(categ_name);

Any suggestions? 有什么建议么？ I am using version 0.10 of Pig. 我正在使用Pig的0.10版本。

Answer 1

Something like this may work. 这样的事情可能会起作用。 However, I've never had the need to look up variable keys in a Pig map, and this other SO question doesn't have an answer, so you'll need to do some trial and error to make it work: 但是，我从未需要在Pig映射中查找变量键，而另一个SO问题没有答案，因此您需要进行一些尝试和错误才能使其起作用：

--Load your dynamic percentages as a map
A = LOAD 'percentages' AS (categ_name:chararray, perc:float);
PERCENTAGES = FOREACH A GENERATE TOMAP(categ_name, perc);

PRODUCT_DATA = LOAD ...;
GRP_PROD_DATA = GROUP PRODUCT_DATA BY categ_name;

--Count the elements per group; needed to calculate pecentages
C = FOREACH GRP_PROD_DATA generate FLATTEN(group) AS categ_name, COUNT(*) as count;
c_MAP = FOREACH C GENERATE TOMAP(categ_name, count);

TOP_PROD_LIST = FOREACH GRP_PROD_DATA {
    SORTED_TOP_PROD = ORDER PRODUCT_DATA BY impression_cnt DESC;
    SAMPLED_DATA = LIMIT SORTED_TOP_PROD (C_MAP#group * PERCENTAGES#group);
    GENERATE flatten(SAMPLED_DATA);
}

You could also try using Pig's TOP function instead of ORDER + LIMIT . 您也可以尝试使用Pig的TOP函数代替ORDER + LIMIT 。

Answer 2

I think I solved it using a slightly different approach. 我想我使用稍微不同的方法解决了它。 I am not sure how optimized it is, maybe there is a better way to organize/optimize the script. 我不确定它的优化程度如何，也许有更好的方法来组织/优化脚本。 Basically, if I rank the product titles within each category in ASC order of impression count and filter when the RANK <= SAMPLE LIMIT of the category, then I can simulate the dynamic sampling. 基本上，如果我按印象计数的ASC顺序对每个类别中的产品标题进行排名，并且在类别的RANK <= SAMPLE LIMIT时进行过滤，那么我可以模拟动态采样。 The SAMPLE LIMIT is nothing but the COUNT of titles per category * PERCENT WEIGHT defined per category. 该SAMPLE LIMIT不过是COUNT每个类别*职称的PERCENT WEIGHT每个类别定义。 To RANK the tuples, I am leveraging LinkedIn's DataFu open source jar that provides an ENUMERATE UDF. 为了RANK元组，我利用LinkedIn的DataFu开放源代码的jar，提供了一个ENUMERATE UDF。

Again, if anyone has suggestions on improving/better organizing the code, I am all ears :) Thanks for your input Cabad, it really helped! 再说一次，如果有人对改进/更好地组织代码提出建议，我非常高兴:)感谢您输入Cabad，它真的很有帮助！

Script: 脚本：

REGISTER '/tmp/udf/datafu-1.0.0.jar';
define Enumerate datafu.pig.bags.Enumerate('1');
set default_parallel 10;

LKP_DATA = LOAD '/tmp/lkp.dat' USING PigStorage('|') AS (categ_name:chararray, perc:float);
PRODUCT_DATA = LOAD '/tmp/meta.dat' USING PigStorage('|') AS (categ_name:chararray, product_titl:chararray, impression_cnt:long, click_through_cnt:long);

GRP_PROD_DATA = GROUP PRODUCT_DATA BY categ_name;

CATEG_COUNT = FOREACH GRP_PROD_DATA generate FLATTEN(group) AS categ_name, COUNT(PRODUCT_DATA) as cnt;

CATEG_JOINED = JOIN CATEG_COUNT BY categ_name, LKP_DATA BY categ_name USING 'replicated';

CATEG_PERCENT = FOREACH CATEG_JOINED GENERATE CATEG_COUNT::categ_name AS categ_name, CATEG_COUNT::cnt AS record_cnt, LKP_DATA::perc AS  percentage;

PRCNT_PROD_DATA = JOIN PRODUCT_DATA BY categ_name, CATEG_PERCENT BY categ_name;

PRCNT_PROD_DATA = FOREACH PRCNT_PROD_DATA GENERATE PRODUCT_DATA::categ_name AS categ_name, PRODUCT_DATA::product_titl AS product_titl, PRODUCT_DATA::impression_cnt AS impression_cnt, PRODUCT_DATA::click_through_cnt AS click_through_cnt,  CATEG_PERCENT::record_cnt*CATEG_PERCENT::percentage AS sample_size;

GRP_PRCNT_PROD_DATA = GROUP PRCNT_PROD_DATA BY categ_name;

ORDRD_PROD_LIST = FOREACH GRP_PRCNT_PROD_DATA {
                             SORTED_TOP_PROD = ORDER PRCNT_PROD_DATA BY impression_cnt DESC;
                             GENERATE flatten(SORTED_TOP_PROD);
                          }

GRP_PROD_LIST = GROUP ORDRD_PROD_LIST BY categ_name;

GRP_PRCNT_PROD_DATA = FOREACH GRP_PROD_LIST GENERATE flatten(Enumerate(ORDRD_PROD_LIST)) AS (categ_name, product_titl, impression_cnt, click_through_cnt,  sample_size, rnk);

SAMPLED_DATA = FILTER GRP_PRCNT_PROD_DATA BY rnk <= sample_size;

SAMPLED_DATA = FOREACH SAMPLED_DATA GENERATE categ_name, product_titl, impression_cnt, click_through_cnt, rnk;

DUMP SAMPLED_DATA;

如何动态获取Pig组中的前N个百分比记录

问题描述

2 个解决方案

解决方案1
0 2013-10-03 22:25:09

解决方案2
0 2013-10-04 22:35:27

如何动态获取Pig组中的前N个百分比记录

问题描述

2 个解决方案

解决方案1 0 2013-10-03 22:25:09

解决方案2 0 2013-10-04 22:35:27

解决方案1
0 2013-10-03 22:25:09

解决方案2
0 2013-10-04 22:35:27