（SQL）如何根据使用OVER计算的值进行条件过滤

Question

I have a log of customers going through a workflow. 我有通过工作流程的客户日志。 I want to do two things, and I am struggling with either of them. 我想做两件事，而我都在努力。

First is: I wish to filter out customers who didn't start by entering the first state at the beginning of the workflow (enter state 0). 首先是：我希望通过在工作流开始时输入第一个状态（输入状态0）来筛选出没有开始的客户。

Second is: For remaining customers I want to know how much time they spent in each step of the workflow. 其次是：对于剩余的客户，我想知道他们在工作流程的每个步骤中花费了多少时间。

Each record has: 每条记录都有：

CUSTOMER_ID (an integer) CUSTOMER_ID（整数）
STATE (an integer) STATE（整数）
ACTION (enter or exit this state, a varchar) ACTION（输入或退出此状态，即varchar）
UPDATE_DT (timestamp of entry) UPDATE_DT（输入时间戳）

I tried to do a query that would allow me to get the timestamp of entry and exit grouped by customer and state like so: 我尝试执行一个查询，该查询将允许我获取按客户和状态分组的进入和退出的时间戳，如下所示：

SELECT
    CUSTOMER_ID,
    STATE,
    MIN(UPDATE_DT) AS ENTRY_DATE,
    MAX(UPDATE_DT) AS EXIT_DATE
FROM LOG_DATA
GROUP BY CUSTOMER_ID, STATE
ORDER BY CUSTOMER_ID, STATE;

But I immediately run into a few problems. 但是我立即遇到了一些问题。 The query will run just fine but: 该查询将正常运行，但：

I haven't removed the customers who didn't start by entering at state 0 我还没有删除没有从状态0进入的客户
Not all customers are guaranteed to have both an entry and exit date for each state so sometimes my MIN / MAX doesn't work out 并非所有客户都能保证每个州都有出入境日期，所以有时我的MIN / MAX无效

I tried to focus on the first problem by introducing an additional attribute in my select thusly: 我试图通过在选择中引入一个附加属性来关注第一个问题：

MIN(STATE) OVER(PARTITION BY CUSTOMER_ID) AS EARLIEST_STATE

But then ran into a few problems. 但是随后遇到了一些问题。 I am unable to include EARLIEST_STATE as a condition of the WHERE or the GROUP BY HAVING because to the WHERE it does not exist, and the GROUP BY will not allow me to include EARLIEST_STATE. 我无法将EARLIEST_STATE包含为WHERE或GROUP BY HAVING的条件，因为对于WHERE而言，它不存在，并且GROUP BY不允许我包含EARLIEST_STATE。

As I thought this through it gets worse - MIN(STATE) can only prove, at best, customer has STATE = 0 but not that they have a record that says ACTION = "enter" and STATE = 0. So this approach fails not only because I can't get it to run but because it's also logically not correct. 正如我所认为的那样，这种情况变得越来越糟-MIN（STATE）最多只能证明客户的STATE = 0，但不能证明他们有一条记录说ACTION =“ enter”和STATE =0。所以这种方法不仅失败因为我无法让它运行，但因为从逻辑上讲也是不正确的。

I know I could do multiple SELECT with SELECTs but this feels clunky and I want to learn the right way to do this. 我知道我可以对SELECT进行多个SELECT，但这感觉很笨拙，我想学习正确的方法。 It also doesn't help that I am dealing with 10 million rows of data so efficiency is important. 处理1000万行数据也无济于事，因此效率很重要。

I am using Postgres 9.5, I have no control over either the DB technology or the schema of the data in this instance. 我使用的是Postgres 9.5，在这种情况下我无法控制数据库技术或数据模式。

It would be slow but I could use something my Python to do this, but I would really like to know the correct way to do this using the DB. 这会很慢，但是我可以使用我的Python来执行此操作，但是我真的很想知道使用数据库执行此操作的正确方法。

Answer 1

If I understand correctly, you want at least one row with Action = 'Enter' and state = 0 for any customer that is in the result set. 如果我理解正确，那么对于结果集中的任何客户，您都希望至少有一行行，其中Action = 'Enter'并且state = 0 。 That suggests a window function: 这暗示了一个窗口函数：

SELECT CUSTOMER_ID, STATE,
       MIN(UPDATE_DT) AS ENTRY_DATE,
       MAX(UPDATE_DT) AS EXIT_DATE,
FROM (SELECT l.*,
             SUM(CASE WHEN ACTION = 'Enter' AND state = 0 THEN 1 ELSE 0 END) OVER (PARTITION BY CUSTOMER_ID) as num_validenter
      FROM LOG_DATA l
     ) l
WHERE num_validenter > 0
GROUP BY CUSTOMER_ID, STATE
ORDER BY CUSTOMER_ID, STATE

（SQL）如何根据使用OVER计算的值进行条件过滤

问题描述

1 个解决方案

解决方案1
0 2018-06-08 18:43:14

（SQL）如何根据使用OVER计算的值进行条件过滤

问题描述

1 个解决方案

解决方案1 0 2018-06-08 18:43:14

解决方案1
0 2018-06-08 18:43:14