简体   繁体   中英

Filter rows in PostgreSQL based on values of consecutive rows in one column

So I'm working with the following postgresql table:

10 rows from PostGreSQL table 在此处输入图片说明

For each business_id, I want to filter out those businesses where the review_count isn't above a specific review_count threshold for 2 consecutive months (or rows) . Depending on the city the business_id is in, the threshold will be different (so for example, in the screenshot above, we can assume rows with city = Charlotte has a review_count threshold of >= 2, and those with city = Las Vegas has a review_count threshold of >= 3. If a business_id does not have at least one instance of consecutive months with review_counts above the specified threshold, I want to filter it out.

I want this query to return only the business_ids that meet this condition (as well as all the other columns in the table that go along with that business_id). The composite primary key on this table is (business_id, year, month).

Some months, as you may notice, are missing from the data (month 9 of the second business_id). If that is the case, I do NOT want to count 2 rows as 'consecutive months'. For example, for the business in Las Vegas, I do NOT want to consider month 8 to 10 as 'consecutive months', even though they appear in consecutive rows.

I've tried something like this, but have kind of run into a wall and don't think its getting me far:

SELECT *
FROM us_business_monthly_review_growth
WHERE business_id IN (SELECT DISTINCT(business_id)
                      FROM us_business_monthly_review_growth
                      GROUP BY business_id, year, month
                      HAVING (city = 'Las Vegas' 
                             AND (CASE WHEN COUNT(review_count >= 2 * 2.21) >= 2))
                             OR (city = 'Charlotte' AND (CASE WHEN COUNT(review_count >= 2 * 1.95) >= 2))

I'm new to Postgre and StackOverflow, so if you have any feedback on the way I asked this question please don't hesitate to let me know! =)

UPDATE :

Thanks to some help from @ Gordon Linoff , I found the following solution:

SELECT *
FROM us_businesses_monthly_growth_and_avg
WHERE business_id IN (SELECT distinct(business_id)
FROM (SELECT *,
             lag(year) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_year,
             lag(month) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_month,
             lag(review_count) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_review_count
      FROM us_businesses_monthly_growth_and_avg 
     ) AS usga
WHERE (city = 'Charlotte' AND review_count >= 4 * 1.95 AND prev_review_count >= 4 * 1.95 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1)
        OR (city = 'Las Vegas' AND review_count >= 4 * 3.31 AND prev_review_count >= 4 * 3.31 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1);

You can do this with lag() :

select distinct business_id
from (select t.*,
             lag(year) over (partition by business_id order by year, month) as prev_year,
             lag(month) over (partition by business_id order by year, month) as prev_month,
             lag(rating) over (partition by business_id order by year, month) as prev_rating
      from us_business_monthly_review_growth t
     ) t
where rating >= $threshhold and prev_rating >= $threshhold and
      (year * 12 + month) = (prev_year * 12 + prev_month) + 1;

The only trick is setting the threshold value. I have no idea how you plan on doing that.

Please try...

SELECT business_id
FROM
(       
    SELECT business_id AS business_id,
           LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
           city,
           LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates,
           review_count AS review_count
    FROM us_business_monthly_review_growth
        order BY business_id,
                 year,
                 month
) tempTable
JOIN tblCityThresholds ON tblCityThresholds.city = tempTable.city
WHERE business_id = lag_in_business_id
  AND diffInDates = 1
  AND tblCityThresholds.threshold <= review_count
GROUP BY business_id;

In formulating this answer I first used the following code to test that LAG() performed as hoped...

SELECT business_id,
       LAG( business_id, 1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
       year,
       month,
       LAG( year, 1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, 1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
         year,
         month;

Here I was trying to get LAG() to refer to values on the next row, but the output showed that it was referring to the previous row in that comparison. Unfortunately I wanted to compare current values with the next one to see if the next record had the same business_id , etc. So I changed the 1 in LAG() to `-1', giving me...

SELECT business_id,
       LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
       year,
       month,
       LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
         year,
         month;

As this gave me the desired results I added city, to allow a JOIN between the results and an assumed table holding the details of each city and its corresponding threshold. I chose the name tblCityThresholds as a suggestion since I am not sure what you have / would call it. This completed the inner SELECT statement.

I then joined the results of the inner SELECT statement to tblCityThresholds and refined the output as per your criteria. Note : It is assumed that the city field will always have a corresponding entry in tblCityThresholds ;

I then used GROUP BY to ensure no repetition of a business_id .

If you have any questions or comments, then please feel free to post a Comment accordingly.

Further Reading

https://www.postgresql.org/docs/8.4/static/functions-window.html (in regards LAG() )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM