简体   繁体   中英

how to use count with case when

I'm newbie to Hivesql. I have a raw table with 6 million records like this:

数据表

I want to count the number of IP_address access to each Modem_id everyweek. The result table I want will be like this:

结果表

I did it with left join, and it worked. But since using join will be time-consuming, I want do it with case when statement - but I can't write a correct statement. Do you have any ideas?

This is the join statement I used:

select a.modem_id, 
       a.Number_of_IP_in_Day_1, 
       b.Number_of_IP_in_Day_2
from 
(select modem_id,
        count(distinct ip_address) as Number_of_IP_in_Day_1 
  from F_ACS_DEVICE_INFORMATION_NEW 
  where day=1
  group by modem_id) a 
left join 
(select modem_id,
        count(distinct param_value) as Number_of_IP_in_Day_2 
  from F_ACS_DEVICE_INFORMATION_NEW 
  where day=2
  group by modem_id) b 
on a.modem_id= b.modem_id; 

Based on your question and further comments, you would like

  • The number of different IP addresses accessed by each modem
  • In counts by week (as columns) for 4 weeks

eg, result would be 5 columns

  • modem_id
  • IPs_accessed_week1
  • IPs_accessed_week2
  • IPs_accessed_week3
  • IPs_accessed_week4

My answer here is based on knowledge of SQL - I haven't used Hive but it appears to support the things I use (eg, CTEs). You may need to tweak the answer a bit.

The first key step is to turn the day_number into a week_number. A straightforward way to do this is FLOOR((day_num-1)/7)+1 so days 1-7 become week 1, days 8-14 become week2, etc.

Note - it is up to you to make sure the day_nums are correct. I would guess you'd actually want info the the last 4 weeks, not the first four weeks of data - and as such you'd probably calculate the day_num as something like SELECT DATEDIFF(day, IP_access_date, CAST(getdate() AS date)) - whatever the equivalent is in Hive.

There are a few ways to do this - I think the clearest is to use a CTE to convert your dataset to what you need eg,

  • convert day_nums to weeknums
  • get rid of duplicates within the week (your code has COUNT(DISTINCT ...) - I assume this is what you want) - I'm doing this with SELECT DISTINCT (rather than grouping by all fields)

From there, you could PIVOT the data to get it into your table, or just use SUM of CASE statements. I'll use SUM of CASE here as I think it's clearer to understand.

WITH IPs_per_week AS
    (SELECT DISTINCT 
            modem_id,
            ip_address,
            FLOOR((day-1)/7)+1 AS week_num    -- Note I've referred to it as day_num in text for clarity
     FROM   F_ACS_DEVICE_INFORMATION_NEW
    )
SELECT modem_id,
       SUM(CASE WHEN week_num = 1 THEN 1 ELSE 0 END) AS IPs_access_week1,
       SUM(CASE WHEN week_num = 2 THEN 1 ELSE 0 END) AS IPs_access_week2,
       SUM(CASE WHEN week_num = 3 THEN 1 ELSE 0 END) AS IPs_access_week3,
       SUM(CASE WHEN week_num = 4 THEN 1 ELSE 0 END) AS IPs_access_week4
FROM   IPs_per_week
GROUP BY modem_id;

You can express your logic using just aggregatoin:

select a.modem_id, 
       count(distinct case when date = 1 then ip_address end) as day_1,
       count(distinct case when date = 2 then ip_address end) as day_2
from F_ACS_DEVICE_INFORMATION_NEW a
group by a.modem_id;

You can obviously extend this for more days.

Note: As your question and code are written, this assumes that your base table has data for only one week. Otherwise, I would expect some date filtering. Presumably, that is what the _NEW suffix means on the table name.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM