hive sql aggregate

Question

I have two tables in Hive, t1 and t2

>describe t1;
>date_id    string

>describe t2;
>messageid string,
 createddate string,
 userid int

> select * from t1 limit 3;        
> 2011-01-01 00:00:00 
  2011-01-02 00:00:00 
  2011-01-03 00:00:00 

> select * from t2 limit 3;
87211389    2011-01-03 23:57:01 13864753
87211656    2011-01-03 23:57:59 13864769
87211746    2011-01-03 23:58:25 13864785

What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03 , I want to count distinct userid from 2011-01-01 to 2011-01-03 .
for date 2011-01-04 , I want to count distinct userid from 2011-01-02 to 2011-01-04

I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

`to_date()` and `date_sub()` are date function in Hive.

That said, the following part does not take effect.

WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id)

EDIT: One solution can be (but it is super slow):

SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
 SELECT * FROM t1  LEFT OUTER JOIN t2
 WHERE 
 (date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
  AND to_date(t2.createddate) <= to_date(t1.date_id)
 )
) t3 
GROUP by to_date(t3.date_id);

UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE. My current solution is to use UNION ALL .

 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
 UNION ALL
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
 UNION ALL 
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)

Then, I do group by and count . In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join .

Answer 1

The following should seem to work in standard SQL...

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON  to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
    AND to_date(t2.createddate) <  date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)

It will , however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can't be used, and the SQL engine can't do Anything clever to reduce the effort being expended.

As a result, every possible combination of rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.

If you store these values as dates, you don't need to_date() . And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.

NOTE: The format of the ON clause means that you do not need to round t2.createddate down to a daily value.

EDIT Why your code didn't work...

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

This joins t1 to t2 with an ON clause of (to_date(t2.createddate) = to_date(t1.date_id)) . As the join is a LEFT OUTER JOIN, the values in t2.createddate MUST now either be NULL (no matches) or be the same as t1.date_id .

The WHERE clause allows a much wider range (3 days). But the ON clause of the JOIN has already restricted you data down to a single day.

The example I gave above simply takes your WHERE clause and put's it in place of the old ON clause.

EDIT

Hive doesn't allow <= and >= in the ON clause? Are you really fixed in to using HIVE???

If you really are, what about BETWEEN?

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)

Alternatively, refactor your table of dates to enumerate the dates you want to include...

TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
  2011-01-03, 2011-01-02
  2011-01-03, 2011-01-03

  2011-01-04, 2011-01-02
  2011-01-04, 2011-01-03
  2011-01-04, 2011-01-04

  2011-01-05, 2011-01-03
  2011-01-05, 2011-01-04
  2011-01-05, 2011-01-05 }

SELECT
  to_date(t1.calendar_date),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
  to_date(t1.calendar_date)

Answer 2

You need a subquery:

try something like this (i cannot test because i don't have hive)

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE t2.messageid in 
    (
    select t2.messageid from t2 where 
    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
   )
GROUP by to_date(t1.date_id);

the key is that with subquery FOR EACH date in t1, the right records are selected in t2.

EDIT:

Forcing subquery in from clause you could try this:

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN 

(select userid, createddate  from t2 where 

    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
) as t2

ON (to_date(t2.createddate) = to_date(t1.date_id))  

GROUP by to_date(t1.date_id);

but don't know if could work.

Answer 3

I am making an assumption that t1 is used to define the 3 day period. I suspect the puzzling approach is due to Hive's shortcomings. This allows you to have an arbitrary number of 3 day periods. Try the following 2 queries

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON substring(t2.createddate,1,10) >= date_sub(substring(t1.date_id,1,10), 2) 
AND substring(t2.createddate,1,10) <=  substring(t1.date_id,1,10) 
GROUP BY t1.date_id

--or--

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON t2.createddate like substring(t1.date_id ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 1) ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 2) ,1,10) + '%' 
GROUP BY t1.date_id

The latter minimizes the function calls on the t2 table. I am also assuming that t1 is the smaller of the 2. substring should return the same result as to_date. According to the documentation, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions , to_date returns a string data type. Support for date data types seems minimal but I am not familiar with hive.

Answer 4

1.I am not familiar with Hive.

2.You could try using a subquery in FROM clase:

SELECT  T1.date_id, COUNT(x.userid) AS UserCount
FROM    T1
LEFT OUTER JOIN
(
    SELECT  TO_DATE(createddate) AS date_id, userid
    FROM    T2
    GROUP BY TO_DATE(createddate), userid
) X ON DATE_SUB(TO_DATE(T1.date_id),3) <= X.date_id AND X.date_id <= TO_DATE(T1.date_id)
GROUP BY T1.date_id;

hive sql aggregate

Question

4 answers

solution1
11 ACCPTED 2011-10-05 22:07:12

solution2
3 2011-09-29 18:23:21

solution3
2 2011-10-05 22:33:00

solution4
1 2011-10-06 09:11:23

hive sql aggregate

Question

4 answers

solution1 11 ACCPTED 2011-10-05 22:07:12

solution2 3 2011-09-29 18:23:21

solution3 2 2011-10-05 22:33:00

solution4 1 2011-10-06 09:11:23

solution1
11 ACCPTED 2011-10-05 22:07:12

solution2
3 2011-09-29 18:23:21

solution3
2 2011-10-05 22:33:00

solution4
1 2011-10-06 09:11:23