简体   繁体   English

PIG脚本如何

[英]PIG Script How to

I am trying clean up this employee volunteer data. 我正在尝试清理此员工志愿者数据。 There is no way to track if employee already is registered volunteer so he can sign up as new volunteer and will get a new VOLUNTEER_ID. 无法跟踪员工是否已经是注册志愿者,因此他可以注册成为新志愿者并获得新的VOLUNTEER_ID。 I have a data feeding into where i can tie each VOLUNTEER_ID to its EMP_ID. 我有一个数据馈送到哪里,我可以将每个VOLUNTEER_ID与其EMP_ID关联起来。 The volunteer data needs to be cleaned up so we can figure out how the employee moved from a volunteer_level to another and when. 志愿者数据需要清理,这样我们才能弄清楚员工是如何从志愿者级别转移到另一个级别的,以及何时进行的。

The business logic is that, when there is a overlaping dates, we give the highest level to the employee for the timeframe of between start_date and end_date. 业务逻辑是,当日期重叠时,我们在start_date和end_date之间的时间范围内为员工提供最高级别。

I posted a Input sample of data and what the output should be. 我发布了一个输入数据样本以及应该是什么输出。

Is it possible to do this a PIG script ? 是否可以执行PIG脚本? Can someone please help me 有人可以帮帮我吗

INPUT: 输入:

EMP_ID  VOLUNTEER_ID    V_LEVEL STATUS  START_DATE  END_DATE
10001   100               1      A       1/1/2006   12/31/2007
10001   200               1      A       5/1/2006   
10001   100               1      A       1/1/2008   
10001   300               3      P       3/1/2008   3/1/2008
10001   300               3      A       3/2/2008   12/1/2008
10001   1001              2      A       5/1/2008   6/30/2008
10001   1001              3      A       7/1/2008   
10001   300               2      A       12/2/2008  

OUTPUT NEEDED:( VOLUNTEER_ID is not needed in output but adding below to show which ID was selected for output and which did not) 需要输出:(输出中不需要VOLUNTEER_ID,但在下面添加以显示选择了哪个ID用于输出,而没有选择)

EMP_ID  VOLUNTEER_ID    V_LEVEL STATUS  START_DATE  END_DATE
10001   100              1       A       1/1/2006   12/31/2007
10001   300              3       P       3/1/2008   3/1/2008
10001   300              3       A       3/2/2008   12/1/2008
10001   1001             2       A       5/1/2008   6/30/2008
10001   1001             3       A       7/1/2008   

It seems like you want the row in your data with the earliest start date for each V_LEVEL , STATUS , EMP_ID , and VOLUNTEER_ID 好像要与最早开始日期为每个数据行V_LEVELSTATUSEMP_IDVOLUNTEER_ID

First we add a unix time column and then find the min for that column (this is in the latest version of pig so you may need to update your version). 首先,我们添加一个unix时间列,然后找到该列的分钟(这是Pig的最新版本,因此您可能需要更新版本)。

data_with_unix = foreach data generate EMP_ID, VOLUNTEER_ID, V_LEVEL, STATUS, START_DATE, END_DATE, ToUnixTime((datetime)START_DATE) as unix_time;
grp = group data_with_unix by (EMP_ID, VOLUNTEER_ID, V_LEVEL, STATUS);
max_date = foreach grp generate group, MIN(data_with_unix.unix_time);

Then join the start and end date back into your dataset since there it doesn't look like there is currently a way to convert unix time back to date. 然后将开始日期和结束日期重新添加到您的数据集中,因为目前看来尚不存在将unix时间转换回日期的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM