简体   繁体   中英

(SQL) How to conditionally filter based on a value calculated using OVER

I have a log of customers going through a workflow. I want to do two things, and I am struggling with either of them.

First is: I wish to filter out customers who didn't start by entering the first state at the beginning of the workflow (enter state 0).

Second is: For remaining customers I want to know how much time they spent in each step of the workflow.

Each record has:

  • CUSTOMER_ID (an integer)
  • STATE (an integer)
  • ACTION (enter or exit this state, a varchar)
  • UPDATE_DT (timestamp of entry)

I tried to do a query that would allow me to get the timestamp of entry and exit grouped by customer and state like so:

SELECT
    CUSTOMER_ID,
    STATE,
    MIN(UPDATE_DT) AS ENTRY_DATE,
    MAX(UPDATE_DT) AS EXIT_DATE
FROM LOG_DATA
GROUP BY CUSTOMER_ID, STATE
ORDER BY CUSTOMER_ID, STATE;

But I immediately run into a few problems. The query will run just fine but:

  • I haven't removed the customers who didn't start by entering at state 0
  • Not all customers are guaranteed to have both an entry and exit date for each state so sometimes my MIN / MAX doesn't work out

I tried to focus on the first problem by introducing an additional attribute in my select thusly:

MIN(STATE) OVER(PARTITION BY CUSTOMER_ID) AS EARLIEST_STATE

But then ran into a few problems. I am unable to include EARLIEST_STATE as a condition of the WHERE or the GROUP BY HAVING because to the WHERE it does not exist, and the GROUP BY will not allow me to include EARLIEST_STATE.

As I thought this through it gets worse - MIN(STATE) can only prove, at best, customer has STATE = 0 but not that they have a record that says ACTION = "enter" and STATE = 0. So this approach fails not only because I can't get it to run but because it's also logically not correct.

I know I could do multiple SELECT with SELECTs but this feels clunky and I want to learn the right way to do this. It also doesn't help that I am dealing with 10 million rows of data so efficiency is important.

I am using Postgres 9.5, I have no control over either the DB technology or the schema of the data in this instance.

It would be slow but I could use something my Python to do this, but I would really like to know the correct way to do this using the DB.

If I understand correctly, you want at least one row with Action = 'Enter' and state = 0 for any customer that is in the result set. That suggests a window function:

SELECT CUSTOMER_ID, STATE,
       MIN(UPDATE_DT) AS ENTRY_DATE,
       MAX(UPDATE_DT) AS EXIT_DATE,
FROM (SELECT l.*,
             SUM(CASE WHEN ACTION = 'Enter' AND state = 0 THEN 1 ELSE 0 END) OVER (PARTITION BY CUSTOMER_ID) as num_validenter
      FROM LOG_DATA l
     ) l
WHERE num_validenter > 0
GROUP BY CUSTOMER_ID, STATE
ORDER BY CUSTOMER_ID, STATE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM