简体   繁体   English

PostgreSQL / Python-获取最后N行,不再重复

[英]PostgreSQL/Python - Get last N rows no repeated

is there any way I can do this? 有什么办法可以做到吗?


Eg. 例如。 If my table contains the following elements: 如果我的表包含以下元素:

id | username | profile_photo
---+----------+--------------
 1 |     juan | urlphoto/juan
 2 |   nestor | urlphoto/nestor
 3 |    pablo | urlphoto/pablo
 4 |    pablo | urlphoto/pablo

And, I want get last 2(two) rows should get: 而且,我想获得最后两(两)行应该得到:

id 2 -> nestor | urlphoto/nestor
id 3 -> pablo  | urlphoto/pablo

Thanks for your time. 谢谢你的时间。

SOLUTION: 解:

The solution is to insert an item if not already in the first n elements 解决方案是在前n个元素中插入一个项目(如果尚未插入的话)

import psycopg2, psycopg2.extras, json
db = psycopg2.connect("")

cursor = db.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cursor.execute("SELECT * FROM users ORDER BY id DESC LIMIT n;")
row = [item['user_id'] for item in cursor.fetchall()]

if not user_id in row:
    cursor.execute("INSERT..")
    db.commit()
cursor.close()
db.close()

How about 怎么样

SELECT id, username, profile_photo
FROM (select min(id), username, profile_photo FROM table
      GROUP BY username, profile_photo) tmp ORDER BY id DESC LIMIT 2

If you don't care about the final row order, here you go 如果您不在乎最终的行顺序,请按此处

SELECT min(id), username, profile_photo 
FROM oh_my_table
GROUP BY username, profile_photo
ORDER BY min(id) DESC 
LIMIT 2

You didn't describe what constitutes a duplicate row (in your example nothing is repeated, because all rows are unique thanks to id), but I'm assuming you want the rows to be distinct on all columns except id and you don't care which id from a few possible duplicates it might be. 您没有描述什么构成重复的行(在您的示例中,没有重复,因为感谢id,所有行都是唯一的),但是我假设您希望所有行在id以外的行上都是唯一的,并且您不请注意可能有几个重复项中的哪个ID。

Let's start with some test data: 让我们从一些测试数据开始:

CREATE UNLOGGED TABLE profile_photos (id int, username text, profile_photo text);
Time: 417.014 ms

INSERT INTO profile_photos
SELECT g.id, r.username, 'urlphoto/' || r.username
FROM generate_series(1, 10000000) g (id)
CROSS JOIN substr(md5(g.id::text), 0, 8) r (username);
INSERT 0 10000000
Time: 24497.335 ms

I'll test two possible solutions, and these are two indexes for each solution: 我将测试两个可能的解决方案,这些是每个解决方案的两个索引:

CREATE INDEX id_btree ON profile_photos USING btree (id);
CREATE INDEX
Time: 8139.347 ms

CREATE INDEX username_profile_photo_id_btree ON profile_photos USING btree (username, profile_photo, id DESC);
CREATE INDEX
Time: 81667.411 ms

VACUUM ANALYZE profile_photos;
VACUUM
Time: 1338.034 ms

So the first solution is the one given by Sami and Clément (their queries are essentially the same): 因此,第一个解决方案是Sami和Clément提供的解决方案(它们的查询本质上是相同的):

SELECT min(id), username, profile_photo 
FROM profile_photos
GROUP BY username, profile_photo
ORDER BY min(id) DESC 
LIMIT 2;

   min    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 5088.611 ms

The result looks right, but this query can yield undesired results if any of those users have posted a profile photo before. 结果看起来不错,但是如果这些用户中的任何一个之前发布了个人资料照片,则此查询可能会产生不希望的结果。 Let's emulate that: 让我们模拟一下:

UPDATE profile_photos
SET (username, profile_photo) = ('d1ca3aa', 'urlphoto/d1ca3aa')
WHERE id = 1;
UPDATE 1
Time: 1.313 ms

SELECT min(id), username, profile_photo 
FROM profile_photos
GROUP BY username, profile_photo
ORDER BY min(id) DESC 
LIMIT 2;

   min   | username |  profile_photo   
---------+----------+------------------
 9999999 | 283f427  | urlphoto/283f427
 9999998 | facf1f3  | urlphoto/facf1f3
(2 rows)
Time: 5032.213 ms

So the query is ignoring anything newer the user might have added. 因此,查询将忽略用户可能添加的任何新内容。 It doesn't look like what you want, so I suggest replacing min(id) with max(id): 它看起来不像您想要的,所以我建议将min(id)替换为max(id):

SELECT max(id), username, profile_photo 
FROM profile_photos
GROUP BY username, profile_photo
ORDER BY max(id) DESC 
LIMIT 2;

   max    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 5068.507 ms

Right, but it looks slow. 是的,但是看起来很慢。 The query plan is: 查询计划为:

                                                                                         QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=655369.97..655369.98 rows=2 width=29) (actual time=6215.284..6215.285 rows=2 loops=1)
   ->  Sort  (cost=655369.97..678809.36 rows=9375755 width=29) (actual time=6215.282..6215.282 rows=2 loops=1)
         Sort Key: (max(id))
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=0.56..561612.42 rows=9375755 width=29) (actual time=0.104..4945.534 rows=9816449 loops=1)
               ->  Index Only Scan using username_profile_photo_id_btree on profile_photos  (cost=0.56..392855.43 rows=9999925 width=29) (actual time=0.089..1849.036 rows=10000000 loops=1)
                     Heap Fetches: 0
 Total runtime: 6215.344 ms
(8 rows)

The thing to notice here is that there's no legitimate use of an aggregate that would entail a GROUP BY: the GROUP BY in this case is used to filter out duplicates and the only aggregate here is a work-around to pick any one of them. 这里要注意的是,没有合法使用会导致GROUP BY的聚合:在这种情况下,GROUP BY用于过滤出重复项,这里唯一的聚合是一种变通方法,可以选择其中的任何一个。 Postgres has an extension that lets you discard duplicates on a set of columns: Postgres有一个扩展,可让您丢弃一组列上的重复项:

SELECT *
FROM (    
    SELECT DISTINCT ON (username, profile_photo) *
    FROM profile_photos
) X
ORDER BY id DESC
LIMIT 2;

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 3779.723 ms

That's a bit faster, and here's why: 这会快一点,这就是原因:

                                                                                         QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=630370.16..630370.17 rows=2 width=29) (actual time=4921.031..4921.031 rows=2 loops=1)
   ->  Sort  (cost=630370.16..653809.55 rows=9375755 width=29) (actual time=4921.030..4921.030 rows=2 loops=1)
         Sort Key: profile_photos.id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Unique  (cost=0.56..442855.06 rows=9375755 width=29) (actual time=0.114..4220.410 rows=9816449 loops=1)
               ->  Index Only Scan using username_profile_photo_id_btree on profile_photos  (cost=0.56..392855.43 rows=9999925 width=29) (actual time=0.111..2040.601 rows=10000000 loops=1)
                     Heap Fetches: 0
 Total runtime: 4921.081 ms
(8 rows)

What if we could somehow fetch the last row with a simple ORDER BY id DESC LIMIT 1, and look for another row from the end of the table, that wouldn't be a duplicate? 如果我们能以某种简单的ORDER BY id DESC LIMIT 1来获取最后一行,并从表末尾查找另一行,那将不会重复呢?

WITH first AS (
    SELECT *
    FROM profile_photos
    ORDER BY id DESC
    LIMIT 1
)
SELECT *
FROM first
UNION ALL
(SELECT *
FROM profile_photos p
WHERE EXISTS (
    SELECT 1
    FROM first
    WHERE (first.username, first.profile_photo) <> (p.username, p.profile_photo))
ORDER BY id DESC
LIMIT 1);

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 1.217 ms

This is very fast, but hand-tailored to yield only two rows. 这是非常快的,但是手工定制仅产​​生两行。 Let's replace it with something more "automatic": 让我们用更“自动”的东西代替它:

WITH RECURSIVE last (id, username, profile_photo, a) AS (
    (SELECT id, username, profile_photo, ARRAY[ROW(username, profile_photo)] a
    FROM profile_photos
    ORDER BY id DESC
    LIMIT 1)
    UNION ALL
    (SELECT older.id, older.username, older.profile_photo, last.a || ROW(older.username, older.profile_photo)
    FROM last
    JOIN profile_photos older ON last.id > older.id AND NOT ROW(older.username, older.profile_photo) = ANY(last.a)
    WHERE array_length(a, 1) < 10
    ORDER BY id DESC
    LIMIT 1)
)
SELECT id, username, profile_photo
FROM last;

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
  9999998 | facf1f3  | urlphoto/facf1f3
  9999997 | 305ebab  | urlphoto/305ebab
  9999996 | 74ab43a  | urlphoto/74ab43a
  9999995 | 23f2458  | urlphoto/23f2458
  9999994 | 6b465af  | urlphoto/6b465af
  9999993 | 33ee85a  | urlphoto/33ee85a
  9999992 | c0b9ef4  | urlphoto/c0b9ef4
  9999991 | b63d5bf  | urlphoto/b63d5bf
(10 rows)
Time: 2706.837 ms

This is faster than the previous queries, but as you can see in the query plan below, for each yielded row it has to scan the index on id. 这比以前的查询要快,但是如您在下面的查询计划中所见,对于每个产生的行,它必须扫描id上的索引。

                                                                                      QUERY PLAN                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on last  (cost=6.52..6.74 rows=11 width=68) (actual time=0.104..4439.807 rows=10 loops=1)
   CTE last
     ->  Recursive Union  (cost=0.43..6.52 rows=11 width=61) (actual time=0.098..4439.780 rows=10 loops=1)
           ->  Limit  (cost=0.43..0.47 rows=1 width=29) (actual time=0.095..0.095 rows=1 loops=1)
                 ->  Index Scan Backward using id_btree on profile_photos  (cost=0.43..333219.47 rows=9999869 width=29) (actual time=0.093..0.093 rows=1 loops=1)
           ->  Limit  (cost=0.43..0.58 rows=1 width=61) (actual time=443.965..443.966 rows=1 loops=10)
                 ->  Nested Loop  (cost=0.43..1406983.38 rows=9510977 width=61) (actual time=443.964..443.964 rows=1 loops=10)
                       Join Filter: ((last_1.id > older.id) AND (ROW(older.username, older.profile_photo) <> ALL (last_1.a)))
                       Rows Removed by Join Filter: 8
                       ->  Index Scan Backward using id_btree on profile_photos older  (cost=0.43..333219.47 rows=9999869 width=29) (actual time=0.008..167.755 rows=1000010 loops=10)
                       ->  WorkTable Scan on last last_1  (cost=0.00..0.25 rows=3 width=36) (actual time=0.000..0.000 rows=0 loops=10000102)
                             Filter: (array_length(a, 1) < 10)
                             Rows Removed by Filter: 1
 Total runtime: 4439.907 ms
(14 rows)

Since Postgres 9.3 there's a new JOIN type available, LATERAL JOIN. 从Postgres 9.3开始,有一个新的JOIN类型可用,即LATERAL JOIN。 It lets you make a join decision at a row level (ie it works "for each row"). 它使您可以在行级别上做出联接决定(即,“对于每一行”都有效)。 We can use that to implement following logic: "for as long as we don't have N rows, for each of the generated rows see if there's an older row than the last one and if there is, add that row to the generated result". 我们可以使用它来实现以下逻辑:“只要没有N行,就为每个生成的行查看是否存在比最后一行还旧的行,如果存在,则将该行添加到生成的结果中”。

WITH RECURSIVE last (id, username, profile_photo, a) AS (
    (SELECT id, username, profile_photo, ARRAY[ROW(username, profile_photo)] a
    FROM profile_photos
    ORDER BY id DESC
    LIMIT 1)
    UNION ALL
    (SELECT older.id, older.username, older.profile_photo, last.a || ROW(older.username, older.profile_photo)
    FROM last
    CROSS JOIN LATERAL (
        SELECT *
        FROM profile_photos older
        WHERE last.id > older.id AND NOT ROW(older.username, older.profile_photo) = ANY(last.a)
        ORDER BY id DESC
        LIMIT 1
    ) older
    WHERE array_length(a, 1) < 10
    ORDER BY id DESC
    LIMIT 1)
)
SELECT id, username, profile_photo
FROM last;

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
  9999998 | facf1f3  | urlphoto/facf1f3
  9999997 | 305ebab  | urlphoto/305ebab
  9999996 | 74ab43a  | urlphoto/74ab43a
  9999995 | 23f2458  | urlphoto/23f2458
  9999994 | 6b465af  | urlphoto/6b465af
  9999993 | 33ee85a  | urlphoto/33ee85a
  9999992 | c0b9ef4  | urlphoto/c0b9ef4
  9999991 | b63d5bf  | urlphoto/b63d5bf
(10 rows)
Time: 1.966 ms

Now that's fast... until N is too big. 现在快了...直到N太大为止。

                                                                                        QUERY PLAN                                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on last  (cost=18.61..18.83 rows=11 width=68) (actual time=0.074..0.359 rows=10 loops=1)
   CTE last
     ->  Recursive Union  (cost=0.43..18.61 rows=11 width=61) (actual time=0.070..0.346 rows=10 loops=1)
           ->  Limit  (cost=0.43..0.47 rows=1 width=29) (actual time=0.067..0.068 rows=1 loops=1)
                 ->  Index Scan Backward using id_btree on profile_photos  (cost=0.43..333219.47 rows=9999869 width=29) (actual time=0.065..0.065 rows=1 loops=1)
           ->  Limit  (cost=1.79..1.79 rows=1 width=61) (actual time=0.026..0.026 rows=1 loops=10)
                 ->  Sort  (cost=1.79..1.80 rows=3 width=61) (actual time=0.025..0.025 rows=1 loops=10)
                       Sort Key: older.id
                       Sort Method: quicksort  Memory: 25kB
                       ->  Nested Loop  (cost=0.43..1.77 rows=3 width=61) (actual time=0.020..0.021 rows=1 loops=10)
                             ->  WorkTable Scan on last last_1  (cost=0.00..0.25 rows=3 width=36) (actual time=0.001..0.001 rows=1 loops=10)
                                   Filter: (array_length(a, 1) < 10)
                                   Rows Removed by Filter: 0
                             ->  Limit  (cost=0.43..0.49 rows=1 width=29) (actual time=0.017..0.017 rows=1 loops=9)
                                   ->  Index Scan Backward using id_btree on profile_photos older  (cost=0.43..161076.14 rows=3170326 width=29) (actual time=0.016..0.016 rows=1 loops=9)
                                         Index Cond: (last_1.id > id)
                                         Filter: (ROW(username, profile_photo) <> ALL (last_1.a))
                                         Rows Removed by Filter: 0
 Total runtime: 0.439 ms
(19 rows)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM