简体   繁体   English

PostgreSQL并发事务问题

[英]PostgreSQL concurrent transaction issues

I'm currently building a crawler. 我正在构建一个爬虫。 Multiple crawling workers access the same PostgreSQL database. 多个爬网工作者访问同一个PostgreSQL数据库。 Sadly I'm encountering issues with the main transaction presented here: 可悲的是,我遇到了这里提出的主要交易的问题:

BEGIN ISOLATION LEVEL SERIALIZABLE;
    UPDATE webpages
    SET locked = TRUE
    WHERE url IN 
        (
            SELECT DISTINCT ON (source) url
            FROM webpages
            WHERE
                (
                    last IS NULL
                    OR
                    last < refreshFrequency
                )
                AND
                locked = FALSE
            LIMIT limit
        )
    RETURNING *;
COMMIT;
  • url is a URL (String) url是一个URL(String)
  • source is a domain name (String) source是域名(String)
  • last is the last time a page was crawled (Date) last是抓取页面的最后一次(日期)
  • locked is a boolean that is set to indicate that a webpage is currently being crawled (Boolean) locked是一个布尔值,设置为表示当前正在抓取网页(布尔值)

I tried two different transaction isolation levels: 我尝试了两种不同的事务隔离级别:

  • ISOLATION LEVEL SERIALIZABLE , I get errors like could not serialize access due to concurrent update ISOLATION LEVEL SERIALIZABLE ,我得到错误, could not serialize access due to concurrent update
  • ISOLATION LEVEL READ COMMITTED , I get duplicate url s from concurrent transactions due to the data being "frozen" from the time the transaction was first committed (I think) ISOLATION LEVEL READ COMMITTED ,我从并发事务中获取重复的url ,因为数据从事务首次提交时被“冻结”(我认为)

I'm fairly new to PostgreSQL and SQL in general so I'm really not sure what I could do to fix this issue. 我对PostgreSQL和SQL很新,所以我真的不确定我能做些什么来解决这个问题。

Update: 更新:
PostgreSQL version is 9.2.x. PostgreSQL版本是9.2.x.
webpage table definition: webpage表定义:

CREATE TABLE webpages (
  last timestamp with time zone,
  locked boolean DEFAULT false,
  url text NOT NULL,
  source character varying(255) PRIMARY KEY
);

Clarification 澄清

The question leaves room for interpretation. 这个问题留下了解释的空间。 This is how I understand the task: 这就是我理解任务的方式:

Lock a maximum of limit URLs which fulfill some criteria and are not locked, yet. 锁定最多符合某些条件且尚未锁定的limit URL。 To spread out the load on sources, every URL should come from a different source. 为了分散源上的负载,每个URL应来自不同的源。

DB design 数据库设计

Assuming a separate table source : this makes the job faster and easier. 假设一个单独的表source :这使得工作更快更容易。 If you don't have such a table, create it, it's the proper design anyway: 如果你没有这样的表,创建它,无论如何它是正确的设计:

CREATE TABLE source (
  source_id serial NOT NULL PRIMARY KEY
, source    text NOT NULL
);

CREATE TABLE webpage (
  source_id int NOT NULL REFERENCES source
  url       text NOT NULL PRIMARY KEY
  locked    boolean NOT NULL DEFAULT false,        -- may not be needed
  last      timestamp NOT NULL DEFAULT '-infinity' -- makes query simpler
);

Alternatively you can use a recursive CTE efficiently: 或者,您可以有效地使用递归CTE:

Basic solution with advisory locks 带咨询锁的基本解决方案

I am using advisory locks to make this safe and cheap even in default read committed isolation level: 即使在默认的read committed隔离级别,我正在使用建议锁来使这个安全且便宜:

UPDATE webpage w
SET    locked = TRUE
FROM  (
   SELECT (SELECT url
           FROM   webpage
           WHERE  source_id = s.source_id
           AND   (last >= refreshFrequency) IS NOT TRUE
           AND    locked = FALSE
           AND    pg_try_advisory_xact_lock(url)  -- only true is free
           LIMIT  1     -- get 1 URL per source
          ) AS url
   FROM  (
      SELECT source_id  -- the FK column in webpage
      FROM   source
      ORDER  BY random()
      LIMIT  limit      --  random selection of "limit" sources
      ) s
   FOR    UPDATE
   ) l
WHERE  w.url = l.url
RETURNING *;

Alternatively, you could work with only advisory locks and not use the table column locked at all. 或者,您可以仅使用咨询锁,而不使用已locked的表列。 Basically just run the the SELECT statement. 基本上只需运行SELECT语句。 Locks are kept until the end of the transaction. 锁保留到交易结束。 You can use pg_try_advisory_lock() instead to keep the locks till the end of the session. 您可以使用pg_try_advisory_lock()来保持锁定直到会话结束。 Only UPDATE once to set last when done (and possible release the advisory lock). 只有UPDATE 一次设置last完成时(并可能释放的咨询锁)。

Other major points 其他要点

  • In Postgres 9.3 or later you would use a LATERAL join instead of the correlated subquery. 在Postgres 9.3或更高版本中,您将使用LATERAL连接而不是相关子查询。

  • I chose pg_try_advisory_xact_lock() because the lock can (and should) be released at the end of the transaction. 我选择了pg_try_advisory_xact_lock()因为锁可以(并且应该)在事务结束时释放。 Detailed explanation for advisory locks: 咨询锁的详细说明:

  • You get less than limit rows if some sources have no more URL to crawl. 如果某些源没有要爬网的URL,则会获得少于行的limit

  • The random selection of sources is my wild but educated guess, since information is not available. 随机选择的来源是我疯狂但有根据的猜测,因为没有信息。 If your source table is big, there are faster ways: 如果你的source表很大,有更快的方法:

  • refreshFrequency should really be called something like lastest_last , since it's not a "frequency", but a timestamp or date . refreshFrequency应该被称为像lastest_last ,因为它不是一个“频率”,而是一个timestampdate

Recursive alternatve 递归交替

To get the full limit number of rows if available , use a RECURSIVE CTE and iterate all sources until you found enough or no more can be found. 要获得完整限制行数( 如果可用) ,请使用RECURSIVE CTE并迭代所有源,直到找到足够的或不能找到更多。

As I mentioned above, you may not need the column locked at all and operate with advisory locks only (cheaper). 正如我上面提到的,您可能根本不需要locked列,只使用咨询锁(更便宜)。 Just set last at the end of the transaction, before you start the next round. 在开始下一轮之前,只需在交易结束时设置last一个。

WITH RECURSIVE s AS (
   SELECT source_id, row_number() OVER (ORDER BY random()) AS rn
   FROM source  -- you might exclude "empty" sources early ...
   )
, page(source_id, rn, ct, url) AS (
   SELECT 0, 0, 0, ''::text   -- dummy init row
   UNION ALL
   SELECT s.source_id, s.rn
        , CASE WHEN t.url <> ''
               THEN p.ct + 1
               ELSE p.ct END  -- only inc. if url found last round
        , (SELECT url
           FROM   webpage
           WHERE  source_id = t.source_id
           AND   (last >= refreshFrequency) IS NOT TRUE
           AND    locked = FALSE  -- may not be needed
           AND    pg_try_advisory_xact_lock(url)  -- only true is free
           LIMIT  1           -- get 1 URL per source
          ) AS url            -- try, may come up empty
   FROM   page p
   JOIN   s ON s.rn = p.rn + 1
   WHERE  CASE WHEN p.url <> ''
               THEN p.ct + 1
               ELSE p.ct END < limit  -- your limit here
   )
SELECT url
FROM   page
WHERE  url <> '';             -- exclude '' and NULL

Alternatively, if you need to manage locked , too, use this query with the above UPDATE . 或者,如果您还需要管理locked ,请将此查询与上述UPDATE

Further reading 进一步阅读

You will love SKIP LOCKED in the the upcoming Postgres 9.5 : 你会喜欢即将推出的Postgres 9.5中的SKIP LOCKED

Related: 有关:

First try: 第一次尝试:

UPDATE webpages
SET locked = TRUE
WHERE url IN 
    (
        SELECT DISTINCT ON (source) url
        FROM webpages
        WHERE
            (
                last IS NULL
                OR
                last < refreshFrequency
            )
            AND
            locked = FALSE
        LIMIT limit
    )
    WHERE
       (
           last IS NULL
           OR
           last < refreshFrequency
        )
        AND
        locked = FALSE

You are trying to update only records with locked = FALSE . 您正尝试仅更新locked = FALSE记录。
Imagine that there are the following records in the table: 想象一下,表中有以下记录:

URL       locked
----------------
A         false
A         true

The subquery in your update will retrun A . 更新中的子查询将重新启动A
Then the outer update will perform: 然后外部更新将执行:

   UPDATE webpages
    SET locked = TRUE
    WHERE url IN ( 'A' )

and in effect all records in the table containing url= A will be updated, 实际上,包含url = A的表中的所有记录都将被更新,
regardess of their values in locked column. locked列中的值的重新定义。

You need to apply to the outer update the same WHERE condition as in the subquery. 您需要向外部更新应用与子查询中相同的WHERE条件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM