简体   繁体   中英

how do I mark one row and store its particular value in Hive using standard Query or using UDF?

I need to write a query in Hive or define a function that needs to do the followings:

The dataset:

Student || Time    || ComuputerPool
-------------------------------------
  A     ||  9:15AM ||  Pool1.Machine2
-------------------------------------
  A     ||  9:45AM ||  Pool1.Machine7
-------------------------------------
  A     ||  10:15AM||  Pool1.Machine9
-------------------------------------
  A     ||  11:00AM||  Pool2.Machine2
-------------------------------------
  A     ||  12:05  ||  Pool2.Machine3
-------------------------------------
  A     ||  12:40  ||  Pool3.Machine5
------------------------------------- 
  A     ||  13:10  ||  Pool1.Machine3
-------------------------------------
  A     ||  13:50  ||  Pool1.Machine10
-------------------------------------
  B   ..........................

so now the query should find out how long a particular student has spent in a particular computer pool by calculating the difference of when he first used a machine in a pool and when he first start using a machine in another pool. So this example the time he spent would be the difference of : 11:00AM - 9:15AM = 1Hour45Mins

My question here is how am I going to mark the first use in one store time value and use it later when I find the next pool data.

Hive supports row_number() , so you can use a trick for grouping sequential values. The difference between two row numbers defines a group, which can then be used for aggregation. The resulting query looks like:

select student, grp, min(time), max(time)
from (select t.*,
             (row_number() over (partition by student order by time) -
              row_number() over (partition by student, computerpool order by time)
             ) as grp
      from dataset t
     ) t
group by student, grp, computerpool;

EDIT:

Your question is about pools not machines . So, you only care about the part before the period, presumably. (By the way, you should store this information in separate columns).

select student, substr(computerpool, 1, 6), min(time), max(time)
from (select t.*,
             (row_number() over (partition by student order by time) -
              row_number() over (partition by student, substr(computerpool, 1, 6) order by time)
             ) as grp
      from dataset t
     ) t
group by student, grp, substr(computerpool, 1, 6);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM