[英]Filter rows in PostgreSQL based on values of consecutive rows in one column
因此,我正在使用以下PostgreSQL表:
對於每個business_id,我想連續兩個月(或連續)過濾出那些review_count不超過特定的review_count閾值的企業。 根據business_id所在的城市,閾值會有所不同(因此,例如,在上面的屏幕截圖中,我們可以假設city = Charlotte的行的review_count閾值> = 2,而city = Las Vegas的行的review_count閾值> ==3。 如果business_id至少沒有連續一個實例的review_counts高於指定閾值,我想將其過濾掉。
我希望此查詢僅返回滿足此條件的business_id(以及與該business_id一起使用的表中的所有其他列)。 該表上的組合主鍵是(business_id,年,月)。
您可能會注意到,數據中缺少某些月份(第二個business_id的第9個月)。 在這種情況下,我不想將2行計為“連續月份”。 例如,對於拉斯維加斯的企業,我不希望將第8到10個月視為“連續月份”,即使它們連續出現。
我已經嘗試過類似的方法,但是有點碰壁,不要認為它使我走得更遠:
SELECT *
FROM us_business_monthly_review_growth
WHERE business_id IN (SELECT DISTINCT(business_id)
FROM us_business_monthly_review_growth
GROUP BY business_id, year, month
HAVING (city = 'Las Vegas'
AND (CASE WHEN COUNT(review_count >= 2 * 2.21) >= 2))
OR (city = 'Charlotte' AND (CASE WHEN COUNT(review_count >= 2 * 1.95) >= 2))
我是Postgre和StackOverflow的新手,所以如果您對我提出這個問題的方式有任何反饋,請隨時聯系我們! =)
更新 :
感謝@ Gordon Linoff的幫助,我找到了以下解決方案:
SELECT *
FROM us_businesses_monthly_growth_and_avg
WHERE business_id IN (SELECT distinct(business_id)
FROM (SELECT *,
lag(year) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_year,
lag(month) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_month,
lag(review_count) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_review_count
FROM us_businesses_monthly_growth_and_avg
) AS usga
WHERE (city = 'Charlotte' AND review_count >= 4 * 1.95 AND prev_review_count >= 4 * 1.95 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1)
OR (city = 'Las Vegas' AND review_count >= 4 * 3.31 AND prev_review_count >= 4 * 3.31 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1);
您可以使用lag()
做到這一點:
select distinct business_id
from (select t.*,
lag(year) over (partition by business_id order by year, month) as prev_year,
lag(month) over (partition by business_id order by year, month) as prev_month,
lag(rating) over (partition by business_id order by year, month) as prev_rating
from us_business_monthly_review_growth t
) t
where rating >= $threshhold and prev_rating >= $threshhold and
(year * 12 + month) = (prev_year * 12 + prev_month) + 1;
唯一的技巧是設置閾值。 我不知道你打算怎么做。
請試試...
SELECT business_id
FROM
(
SELECT business_id AS business_id,
LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
city,
LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates,
review_count AS review_count
FROM us_business_monthly_review_growth
order BY business_id,
year,
month
) tempTable
JOIN tblCityThresholds ON tblCityThresholds.city = tempTable.city
WHERE business_id = lag_in_business_id
AND diffInDates = 1
AND tblCityThresholds.threshold <= review_count
GROUP BY business_id;
在制定此答案時,我首先使用以下代碼測試了LAG()
按預期執行了...
SELECT business_id,
LAG( business_id, 1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
year,
month,
LAG( year, 1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, 1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
year,
month;
在這里,我試圖讓LAG()
引用下一行中的值,但是輸出顯示在該比較中它引用了上一行。 不幸的是,我想將當前值與下一個記錄進行比較,以查看下一個記錄是否具有相同的business_id
等。因此,我將LAG()
的1
更改為-1,從而得到了...
SELECT business_id,
LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
year,
month,
LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
year,
month;
因為這給了我理想的結果,所以我添加了city,
以允許在結果和假設的表之間進行JOIN
,該表包含每個城市的詳細信息及其對應的閾值。 我選擇名稱tblCityThresholds
作為建議,因為我不確定您擁有什么/會稱之為它。 這樣就完成了內部SELECT
語句。
然后,我將內部SELECT
語句的結果加入到tblCityThresholds
並根據您的條件優化輸出。 注意:假設city
字段將始終在tblCityThresholds
具有相應的條目;
然后,我使用GROUP BY
確保沒有重復的business_id
。
如果您有任何問題或意見,請隨時發表評論。
進一步閱讀
https://www.postgresql.org/docs/8.4/static/functions-window.html (關於LAG()
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.