I'm newbie to Hivesql. I have a raw table with 6 million records like this:
I want to count the number of IP_address access to each Modem_id everyweek. The result table I want will be like this:
I did it with left join, and it worked. But since using join will be time-consuming, I want do it with case when statement - but I can't write a correct statement. Do you have any ideas?
This is the join statement I used:
select a.modem_id,
a.Number_of_IP_in_Day_1,
b.Number_of_IP_in_Day_2
from
(select modem_id,
count(distinct ip_address) as Number_of_IP_in_Day_1
from F_ACS_DEVICE_INFORMATION_NEW
where day=1
group by modem_id) a
left join
(select modem_id,
count(distinct param_value) as Number_of_IP_in_Day_2
from F_ACS_DEVICE_INFORMATION_NEW
where day=2
group by modem_id) b
on a.modem_id= b.modem_id;
Based on your question and further comments, you would like
eg, result would be 5 columns
My answer here is based on knowledge of SQL - I haven't used Hive but it appears to support the things I use (eg, CTEs). You may need to tweak the answer a bit.
The first key step is to turn the day_number into a week_number. A straightforward way to do this is FLOOR((day_num-1)/7)+1
so days 1-7 become week 1, days 8-14 become week2, etc.
Note - it is up to you to make sure the day_nums are correct. I would guess you'd actually want info the the last 4 weeks, not the first four weeks of data - and as such you'd probably calculate the day_num as something like SELECT DATEDIFF(day, IP_access_date, CAST(getdate() AS date))
- whatever the equivalent is in Hive.
There are a few ways to do this - I think the clearest is to use a CTE to convert your dataset to what you need eg,
COUNT(DISTINCT ...)
- I assume this is what you want) - I'm doing this with SELECT DISTINCT (rather than grouping by all fields)From there, you could PIVOT the data to get it into your table, or just use SUM of CASE statements. I'll use SUM of CASE here as I think it's clearer to understand.
WITH IPs_per_week AS
(SELECT DISTINCT
modem_id,
ip_address,
FLOOR((day-1)/7)+1 AS week_num -- Note I've referred to it as day_num in text for clarity
FROM F_ACS_DEVICE_INFORMATION_NEW
)
SELECT modem_id,
SUM(CASE WHEN week_num = 1 THEN 1 ELSE 0 END) AS IPs_access_week1,
SUM(CASE WHEN week_num = 2 THEN 1 ELSE 0 END) AS IPs_access_week2,
SUM(CASE WHEN week_num = 3 THEN 1 ELSE 0 END) AS IPs_access_week3,
SUM(CASE WHEN week_num = 4 THEN 1 ELSE 0 END) AS IPs_access_week4
FROM IPs_per_week
GROUP BY modem_id;
You can express your logic using just aggregatoin:
select a.modem_id,
count(distinct case when date = 1 then ip_address end) as day_1,
count(distinct case when date = 2 then ip_address end) as day_2
from F_ACS_DEVICE_INFORMATION_NEW a
group by a.modem_id;
You can obviously extend this for more days.
Note: As your question and code are written, this assumes that your base table has data for only one week. Otherwise, I would expect some date filtering. Presumably, that is what the _NEW
suffix means on the table name.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.