简体   繁体   English

如何在 case 中使用计数

[英]how to use count with case when

I'm newbie to Hivesql.我是 Hivesql 的新手。 I have a raw table with 6 million records like this:我有一个包含 600 万条记录的原始表,如下所示:

数据表

I want to count the number of IP_address access to each Modem_id everyweek.我想每周计算每个 Modem_id 的 IP_address 访问次数。 The result table I want will be like this:我想要的结果表是这样的:

结果表

I did it with left join, and it worked.我用左连接做到了,它奏效了。 But since using join will be time-consuming, I want do it with case when statement - but I can't write a correct statement.但是由于使用 join 会很耗时,我想用 case when 语句来做 - 但我不能写出正确的语句。 Do you have any ideas?你有什么想法?

This is the join statement I used:这是我使用的连接语句:

select a.modem_id, 
       a.Number_of_IP_in_Day_1, 
       b.Number_of_IP_in_Day_2
from 
(select modem_id,
        count(distinct ip_address) as Number_of_IP_in_Day_1 
  from F_ACS_DEVICE_INFORMATION_NEW 
  where day=1
  group by modem_id) a 
left join 
(select modem_id,
        count(distinct param_value) as Number_of_IP_in_Day_2 
  from F_ACS_DEVICE_INFORMATION_NEW 
  where day=2
  group by modem_id) b 
on a.modem_id= b.modem_id; 

Based on your question and further comments, you would like根据您的问题和进一步的评论,您希望

  • The number of different IP addresses accessed by each modem每个调制解调器访问的不同 IP 地址的数量
  • In counts by week (as columns) for 4 weeks在 4 周内按周计数(作为列)

eg, result would be 5 columns例如,结果将是 5 列

  • modem_id调制解调器 ID
  • IPs_accessed_week1 IPs_accessed_week1
  • IPs_accessed_week2 IPs_accessed_week2
  • IPs_accessed_week3 IPs_accessed_week3
  • IPs_accessed_week4 IPs_accessed_week4

My answer here is based on knowledge of SQL - I haven't used Hive but it appears to support the things I use (eg, CTEs).我的答案是基于 SQL 的知识——我没有使用过 Hive,但它似乎支持我使用的东西(例如,CTE)。 You may need to tweak the answer a bit.您可能需要稍微调整一下答案。

The first key step is to turn the day_number into a week_number.第一个关键步骤是将 day_number 转换为 week_number。 A straightforward way to do this is FLOOR((day_num-1)/7)+1 so days 1-7 become week 1, days 8-14 become week2, etc.一个简单的方法是FLOOR((day_num-1)/7)+1所以第 1-7 天变成第 1 周,第 8-14 天变成第 2 周,依此类推。

Note - it is up to you to make sure the day_nums are correct.注意 - 由您来确保 day_nums 是正确的。 I would guess you'd actually want info the the last 4 weeks, not the first four weeks of data - and as such you'd probably calculate the day_num as something like SELECT DATEDIFF(day, IP_access_date, CAST(getdate() AS date)) - whatever the equivalent is in Hive.我猜你实际上想要的是过去4 周的信息,而不是前 4 周的数据——因此你可能会将 day_num 计算为类似于SELECT DATEDIFF(day, IP_access_date, CAST(getdate() AS date)) - 无论在 Hive 中的等价物是什么。

There are a few ways to do this - I think the clearest is to use a CTE to convert your dataset to what you need eg,有几种方法可以做到这一点 - 我认为最清楚的是使用 CTE 将您的数据集转换为您需要的数据,例如,

  • convert day_nums to weeknums将 day_nums 转换为 weeknums
  • get rid of duplicates within the week (your code has COUNT(DISTINCT ...) - I assume this is what you want) - I'm doing this with SELECT DISTINCT (rather than grouping by all fields)在一周内摆脱重复项(您的代码有COUNT(DISTINCT ...) - 我认为这就是您想要的) - 我正在使用 SELECT DISTINCT (而不是按所有字段分组)

From there, you could PIVOT the data to get it into your table, or just use SUM of CASE statements.从那里,您可以 PIVOT 数据以将其放入您的表中,或者仅使用 CASE 语句的 SUM。 I'll use SUM of CASE here as I think it's clearer to understand.我将在这里使用 CASE 的总和,因为我认为它更容易理解。

WITH IPs_per_week AS
    (SELECT DISTINCT 
            modem_id,
            ip_address,
            FLOOR((day-1)/7)+1 AS week_num    -- Note I've referred to it as day_num in text for clarity
     FROM   F_ACS_DEVICE_INFORMATION_NEW
    )
SELECT modem_id,
       SUM(CASE WHEN week_num = 1 THEN 1 ELSE 0 END) AS IPs_access_week1,
       SUM(CASE WHEN week_num = 2 THEN 1 ELSE 0 END) AS IPs_access_week2,
       SUM(CASE WHEN week_num = 3 THEN 1 ELSE 0 END) AS IPs_access_week3,
       SUM(CASE WHEN week_num = 4 THEN 1 ELSE 0 END) AS IPs_access_week4
FROM   IPs_per_week
GROUP BY modem_id;

You can express your logic using just aggregatoin:您可以仅使用聚合来表达您的逻辑:

select a.modem_id, 
       count(distinct case when date = 1 then ip_address end) as day_1,
       count(distinct case when date = 2 then ip_address end) as day_2
from F_ACS_DEVICE_INFORMATION_NEW a
group by a.modem_id;

You can obviously extend this for more days.您显然可以将其延长更多天。

Note: As your question and code are written, this assumes that your base table has data for only one week.注意:在编写您的问题和代码时,这里假设您的基表只有一周的数据。 Otherwise, I would expect some date filtering.否则,我希望有一些日期过滤。 Presumably, that is what the _NEW suffix means on the table name.据推测,这就是表名上的_NEW后缀的含义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM