Need to calculate average per user for a 30 day history for each user using hive

Question

I am confused about my datamodel. ie if I need to use a UDAF to solve this and how hive deals with this when it comes down to splitting the task.

Problem statement: I need to compute the average(slightly different version then what is available so it needs to be a custom formula) per user based on events that are from 30 days as I look back in time. So I have my data in the form of:

userid date counts
user1 day30 34
user1 day30 23
user1 day4 22
user1 day1 21
user2 day30 23
user2 day23 12
usern ....

What I need as a ouput is the following:

user1 avg: (34+23+22+21...)/30
user2 avg: (23+12...)/30

What is the most memory efficient way to approach this problem? I probably need to use a udf but how does this work with a user with a lot of rows? Does the UDAF deal with this per user or do I need to do something like restrict all rows pertaining to one user in a reducer. Thanks!

Answer 1

I don't know what is the challange here i used the sample data as below:

userid,date1,counts
user1,day30,34
user1,day30,23
user1,day4,22
user1,day1,21
user2,day30,23
user2,day23,12

and below is the table defination

create external table table1 (
 userid string,
 date1 string,
 counts int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/tempd';

and the query is:

select userid , sum(counts)/30 
from table1 
group by userid;

Output

user1   3.3333333333333335
user2   1.1666666666666667

let me know if my assumption is wrong.

Need to calculate average per user for a 30 day history for each user using hive

Question

1 answers

solution1
0 2016-02-18 13:05:16

Need to calculate average per user for a 30 day history for each user using hive

Question

1 answers

solution1 0 2016-02-18 13:05:16

solution1
0 2016-02-18 13:05:16