I am confused about my datamodel. ie if I need to use a UDAF to solve this and how hive deals with this when it comes down to splitting the task.
Problem statement: I need to compute the average(slightly different version then what is available so it needs to be a custom formula) per user based on events that are from 30 days as I look back in time. So I have my data in the form of:
userid date counts
user1 day30 34
user1 day30 23
user1 day4 22
user1 day1 21
user2 day30 23
user2 day23 12
usern ....
What I need as a ouput is the following:
user1 avg: (34+23+22+21...)/30
user2 avg: (23+12...)/30
What is the most memory efficient way to approach this problem? I probably need to use a udf but how does this work with a user with a lot of rows? Does the UDAF deal with this per user or do I need to do something like restrict all rows pertaining to one user in a reducer. Thanks!
I don't know what is the challange here i used the sample data as below:
userid,date1,counts
user1,day30,34
user1,day30,23
user1,day4,22
user1,day1,21
user2,day30,23
user2,day23,12
and below is the table defination
create external table table1 (
userid string,
date1 string,
counts int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/tempd';
and the query is:
select userid , sum(counts)/30
from table1
group by userid;
Output
user1 3.3333333333333335
user2 1.1666666666666667
let me know if my assumption is wrong.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.