简体   繁体   English

在 Snowflake 中根据日期和 Window function 过滤查询

[英]Filtering a Query based on a Date and Window function in Snowflake

I was asked to pull information about three different types of clients in the last year (visited once, visited <10 times, and visited over 10 times) see if the likelihood of them returning compared to a few different factors.我被要求提取有关去年三种不同类型客户的信息(访问过一次、访问过 <10 次和访问过 10 次以上),看看他们返回的可能性是否与几个不同的因素相比。

For this reason, I created a pretty broad query.出于这个原因,我创建了一个非常广泛的查询。 Currently I have a joined query of three tables: client information, visit information, and staff information.目前我有三张表的联合查询:客户信息、访问信息和员工信息。 I created a calculated column in my select statement:我在 select 语句中创建了一个计算列:

COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits

Now I just need to group by totalvisits and filter by date they visited.现在我只需要按 totalvisits 分组并按他们访问的日期进行过滤。

I tried:我试过了:

where visitdate> 01/01/2021
group by totalvisits
having total visits<10

But I get an error that the visitno is not a valid group by expression.但是我收到一个错误,指出 visitno 不是一个有效的 group by expression。

What might I be doing wrong?我可能做错了什么?

In snowflake, you can use the QUALIFY clause to filter window functions post window aggregation.在雪花中,您可以使用QUALIFY子句过滤 window 函数后 window 聚合。

So, the query would look like this:因此,查询将如下所示:

SELECT
  clientid,
  COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits
FROM <your_table>
WHERE visitdate >= 2021-01-01::date
  AND visitdate < 2022-01-01::date
QUALIFY totalvisits < 10;

*Make sure that visitdate has a date type beforehand, though! *不过,请确保visitdate事先有一个日期类型!

[Referring to the comment below] : If you wanted to see the total amount of visits historically, plus the total amount of visits on a given year, you can do the following: [参考下面的评论] :如果您想查看历史总访问量,加上给定年份的总访问量,您可以执行以下操作:

SELECT
  clientid,
  YEAR(visitdate) as visit_date_year,
  COUNT(DISTINCT visitno) OVER (PARTITION BY clientid) as totalvisits,
  COUNT(DISTINCT visitno) OVER (PARTITION BY clientid, YEAR(visitdate) as total_visits_by_year
FROM <your_table>
QUALIFY total_visits_by_year < 10;

Ok, so lets make some fake data, and do the count thing:好的,让我们做一些假数据,然后做一些事情:

WITH fake_data(client_id, visit_date) as (
    SELECT * FROM VALUES
    -- this person has visted once
    (1, '2022-04-14'::date),
    -- this person has visited 3 timw in the year
    (3, '2022-04-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-02-13'::date),
    -- this person is a huge vistor, but 1 is outside the with in last year.
    (5, '2022-04-12'::date),
    (5, '2022-03-12'::date),
    (5, '2022-02-12'::date),
    (5, '2022-01-12'::date),
    (5, '2020-02-12'::date)
)
SELECT *,
    count(distinct visit_date) over (partition by client_id) as total_visits
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)

boom:繁荣:

CLIENT_ID客户编号 VISIT_DATE VISIT_DATE TOTAL_VISITS TOTAL_VISITS 次
1 1个 2022-04-14 2022-04-14 1 1个
3 3个 2022-04-13 2022-04-13 3 3个
3 3个 2022-03-13 2022-03-13 3 3个
3 3个 2022-02-13 2022-02-13 3 3个
5 5个 2022-04-12 2022-04-12 4 4个
5 5个 2022-03-12 2022-03-12 4 4个
5 5个 2022-02-12 2022-02-12 4 4个

Now to make those into those thee group/categories.现在将它们放入那些组/类别中。

SELECT *,
    count(distinct visit_date) over (partition by client_id) as total_visits,
    case 
        when total_visits = 1 then 1
        when total_visits <= 3 then 2
        when total_visits > 3 then 3
    end as group_id
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)

Now some math, of which I will wrap that into a sub-select (but also push a couple things down into it)现在一些数学,我将把它包装到一个子选择中(但也将一些东西压入其中)

WITH fake_data(client_id, visit_date) as (
    SELECT * FROM VALUES
    -- this person has visted once
    (1, '2022-04-14'::date),
    -- this person has visited 3 timw in the year
    (3, '2022-04-13'::date),
    (3, '2022-04-11'::date),
    (3, '2022-04-09'::date),
    -- this person is a huge vistor, but 1 is outside the with in last year.
    (5, '2022-04-12'::date),
    (5, '2022-03-12'::date),
    (5, '2022-02-12'::date),
    (5, '2022-01-12'::date),
    (5, '2020-02-12'::date)
)
SELECT group_id
    ,count(distinct client_id) as count_of_group_members
    ,sum(total_visits) as sum_of_group_visit
    ,avg(visit_gap_in_days) as avg_group_day_diff
    ,stddev(visit_gap_in_days) as stddev_group_day_diff
FROM (
SELECT *,
    count(distinct visit_date) over (partition by client_id) as total_visits,
    case 
        when total_visits = 1 then 1
        when total_visits <= 3 then 2
        when total_visits > 3 then 3
    end as group_id,
    lag(visit_date) over (partition by client_id order by visit_date) as prior_visit_date,
    datediff('day', prior_visit_date, visit_date) as visit_gap_in_days
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
)
GROUP BY 1
ORDER BY 1
GROUP_ID群组编号 COUNT_OF_GROUP_MEMBERS COUNT_OF_GROUP_MEMBERS 个成员 SUM_OF_GROUP_VISIT SUM_OF_GROUP_VISIT AVG_GROUP_DAY_DIFF AVG_GROUP_DAY_DIFF STDDEV_GROUP_DAY_DIFF STDDEV_GROUP_DAY_DIFF
1 1个 1 1个 1 1个
2 2个 1 1个 9 9 2 2个 0 0
3 3个 1 1个 16 16 30 30 1.732050808 1.732050808

Wozers, that sum of visits is wrong, I have summed my sums.. Wozers,访问总和是错误的,我已经总结了我的总和..

So here given the count(distinct visitno) I cannot sum that, as it becomes the sum of sums, AND I cannot do a count(*) because we have just noticed there are duplicates (otherwise the distinct is not needed).所以这里给定count(distinct visitno)我不能求和,因为它变成了总和,而且我不能做 count(*) 因为我们刚刚注意到有重复项(否则不需要 distinct )。 And I assume you have not stripped the rows down, as there is some "other details that you will want"而且我假设您没有删除行,因为有一些“您需要的其他详细信息”

But anyways.但无论如何。 This is the great things about SQL, you can answer anything, but you have to know the Question, and know the Data so you can know which assumptions can be held true for your data.这是关于 SQL 的伟大之处,你可以回答任何问题,但你必须知道问题,并了解数据,这样你才能知道哪些假设可以适用于你的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM