簡體   English   中英

根據一列中連續行的值過濾PostgreSQL中的行

[英]Filter rows in PostgreSQL based on values of consecutive rows in one column

因此,我正在使用以下PostgreSQL表:

PostGreSQL表中的10行 在此處輸入圖片說明

對於每個business_id,我想連續兩個月(或連續)過濾出那些review_count不超過特定的review_count閾值的企業。 根據business_id所在的城市,閾值會有所不同(因此,例如,在上面的屏幕截圖中,我們可以假設city = Charlotte的行的review_count閾值> = 2,而city = Las Vegas的行的review_count閾值> ==3。 如果business_id至少沒有連續一個實例的review_counts高於指定閾值,我想將其過濾掉。

我希望此查詢僅返回滿足此條件的business_id(以及與該business_id一起使用的表中的所有其他列)。 該表上的組合主鍵是(business_id,年,月)。

您可能會注意到,數據中缺少某些月份(第二個business_id的第9個月)。 在這種情況下,我不想將2行計為“連續月份”。 例如,對於拉斯維加斯的企業,我不希望將第8到10個月視為“連續月份”,即使它們連續出現。

我已經嘗試過類似的方法,但是有點碰壁,不要認為它使我走得更遠:

SELECT *
FROM us_business_monthly_review_growth
WHERE business_id IN (SELECT DISTINCT(business_id)
                      FROM us_business_monthly_review_growth
                      GROUP BY business_id, year, month
                      HAVING (city = 'Las Vegas' 
                             AND (CASE WHEN COUNT(review_count >= 2 * 2.21) >= 2))
                             OR (city = 'Charlotte' AND (CASE WHEN COUNT(review_count >= 2 * 1.95) >= 2))

我是Postgre和StackOverflow的新手,所以如果您對我提出這個問題的方式有任何反饋,請隨時聯系我們! =)

更新

感謝@ Gordon Linoff的幫助,我找到了以下解決方案:

SELECT *
FROM us_businesses_monthly_growth_and_avg
WHERE business_id IN (SELECT distinct(business_id)
FROM (SELECT *,
             lag(year) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_year,
             lag(month) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_month,
             lag(review_count) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_review_count
      FROM us_businesses_monthly_growth_and_avg 
     ) AS usga
WHERE (city = 'Charlotte' AND review_count >= 4 * 1.95 AND prev_review_count >= 4 * 1.95 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1)
        OR (city = 'Las Vegas' AND review_count >= 4 * 3.31 AND prev_review_count >= 4 * 3.31 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1);

您可以使用lag()做到這一點:

select distinct business_id
from (select t.*,
             lag(year) over (partition by business_id order by year, month) as prev_year,
             lag(month) over (partition by business_id order by year, month) as prev_month,
             lag(rating) over (partition by business_id order by year, month) as prev_rating
      from us_business_monthly_review_growth t
     ) t
where rating >= $threshhold and prev_rating >= $threshhold and
      (year * 12 + month) = (prev_year * 12 + prev_month) + 1;

唯一的技巧是設置閾值。 我不知道你打算怎么做。

請試試...

SELECT business_id
FROM
(       
    SELECT business_id AS business_id,
           LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
           city,
           LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates,
           review_count AS review_count
    FROM us_business_monthly_review_growth
        order BY business_id,
                 year,
                 month
) tempTable
JOIN tblCityThresholds ON tblCityThresholds.city = tempTable.city
WHERE business_id = lag_in_business_id
  AND diffInDates = 1
  AND tblCityThresholds.threshold <= review_count
GROUP BY business_id;

在制定此答案時,我首先使用以下代碼測試了LAG()按預期執行了...

SELECT business_id,
       LAG( business_id, 1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
       year,
       month,
       LAG( year, 1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, 1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
         year,
         month;

在這里,我試圖讓LAG()引用下一行中的值,但是輸出顯示在該比較中它引用了上一行。 不幸的是,我想將當前值與下一個記錄進行比較,以查看下一個記錄是否具有相同的business_id等。因此,我將LAG()1更改為-1,從而得到了...

SELECT business_id,
       LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
       year,
       month,
       LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
         year,
         month;

因為這給了我理想的結果,所以我添加了city,以允許在結果和假設的表之間進行JOIN ,該表包含每個城市的詳細信息及其對應的閾值。 我選擇名稱tblCityThresholds作為建議,因為我不確定您擁有什么/會稱之為它。 這樣就完成了內部SELECT語句。

然后,我將內部SELECT語句的結果加入到tblCityThresholds並根據您的條件優化輸出。 注意:假設city字段將始終在tblCityThresholds具有相應的條目;

然后,我使用GROUP BY確保沒有重復的business_id

如果您有任何問題或意見,請隨時發表評論。

進一步閱讀

https://www.postgresql.org/docs/8.4/static/functions-window.html (關於LAG()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM