在PostgreSQL中批量存储过程

Question

I need to save a lot of entities into a database. 我需要将许多实体保存到数据库中。 Saving an entity involves adding rows to different tables with keys autogenerated by inserting a row in one table being used for inserting some row into another table. 保存实体涉及使用通过在一个表中插入一行而自动生成的键将行添加到不同的表中，该键用于将某行插入到另一个表中。 Such a logic made me create and use a stored procedure. 这种逻辑使我可以创建和使用存储过程。 Calling this stored procedure for every entity separately (ie via statement.execute(...)) works fine except there'll be billions of entities to save. 为每个实体分别调用该存储过程（即通过statement.execute（...））可以正常工作，除了要保存数十亿个实体。 So I made an attempt to do this in batches. 因此，我尝试分批进行此操作。 However, in case of batches a batch execution leads to throwing org.postgresql.util.PSQLException with a message 'A result was returned when none was expected.' 但是，在进行批处理的情况下，批处理执行会引发org.postgresql.util.PSQLException，并显示消息“如果没有预期结果，则返回结果”。

My stored procedure is such: 我的存储过程是这样的：

CREATE OR REPLACE FUNCTION insertSentence(warcinfoID varchar, recordID varchar, sentence varchar,
    sent_timestamp bigint, sect_ids smallint[]) RETURNS void AS $$
DECLARE
    warcinfoIdId integer := 0;
    recordIdId integer := 0;
    sentId integer := 0;
    id integer := 0;
BEGIN
    SELECT warcinfo_id_id INTO warcinfoIdId FROM warcinfo_id WHERE warcinfo_id_value = warcinfoID;
    IF NOT FOUND THEN
        INSERT INTO warcinfo_id (warcinfo_id_value) VALUES (warcinfoID)
            RETURNING warcinfo_id_id INTO STRICT warcinfoIdId;
    END IF;
    SELECT record_id_id INTO recordIdId FROM record_id WHERE record_id_value = recordID;
    IF NOT FOUND THEN
        INSERT INTO record_id (record_id_value) VALUES (recordID)
            RETURNING record_id_id INTO STRICT recordIdId;
    END IF;
    LOOP
        SELECT sent_id INTO sentId FROM sentence_text
            WHERE md5(sent_text) = md5(sentence) AND sent_text = sentence;
        EXIT WHEN FOUND;
        BEGIN
            INSERT INTO sentence_text (sent_text) VALUES (sentence) RETURNING sent_id INTO STRICT sentId;
        EXCEPTION WHEN unique_violation THEN
            sentId := 0;
        END;
    END LOOP;
    INSERT INTO sentence_occurrence (warcinfo_id, record_id, sent_id, timestamp, sect_ids)
        VALUES (warcinfoIdId, recordIdId, sentId, TO_TIMESTAMP(sent_timestamp), sect_ids)
        RETURNING entry_id INTO STRICT id;
END;
$$ LANGUAGE plpgsql;

and the Scala code is such: Scala代码是这样的：

def partition2DB(iterator: Iterator[(String, String, String, Long, Array[Int])]): Unit = {
  Class.forName(driver)
  val conn = DriverManager.getConnection(connectionString)

  try {
    val statement = conn.createStatement()
    var i = 0
    iterator.foreach(r => {
      i += 1
      statement.addBatch(
        "select insertSentence('%s', '%s', '%s', %d, '{%s}');".format(
          r._1, r._2, r._3.replaceAll("'", "''"), r._4, r._5.mkString(","))
      )
      if (i % 1000 == 0) statement.executeBatch()
    })
    if (i % 1000 != 0) statement.executeBatch()
  } catch {
    case e: SQLException => println("exception caught: " + e.getNextException());
  } finally {
    conn.close
  }
}

Strangely, even though statement.executeBatch() throw an exception, it saves entities before this. 奇怪的是，即使statement.executeBatch（）抛出异常，它也会在此之前保存实体。 So this workaround makes things work: 因此，此解决方法使事情正常进行：

def partition2DB(iterator: Iterator[(String, String, String, Long, Array[Int])]): Unit = {
  Class.forName(driver)
  val conn = DriverManager.getConnection(connectionString)

  try {
    var statement = conn.createStatement()
    var i = 0
    iterator.foreach(r => {
      i += 1
      statement.addBatch(
        "select insertSentence('%s', '%s', '%s', %d, '{%s}');".format(
          r._1, r._2, r._3.replaceAll("'", "''"), r._4, r._5.mkString(","))
      )
      if (i % 1000 == 0) {
        i = 0
        try {
          statement.executeBatch()
        } catch {
          case e: SQLException => statement = conn.createStatement()
        }
      }
    })
    if (i % 1000 != 0) {
      try {
        statement.executeBatch()
      } catch {
        case e: SQLException => statement = conn.createStatement()
      }
    }
  } catch {
    case e: SQLException => println("exception caught: " + e.getNextException());
  } finally {
    conn.close
  }
}

However, I would want not to rely on an undocumented feature of PostgreSQL I use currently. 但是，我不希望依赖当前使用的PostgreSQL的未记录功能。 I see that other people also came across this problem: 我看到其他人也遇到了这个问题：

Can someone suggest a solution? 有人可以提出解决方案吗？

Answer 1

Strangely, even though statement.executeBatch() throw an exception, it saves entities before this. 奇怪的是，即使statement.executeBatch（）抛出异常，它也会在此之前保存实体。

That's because you didn't wrap the batch in a transaction. 那是因为您没有将批处理包装在事务中。 The JDBC spec doesn't IIRC make it clear whether batches should be implicitly wrapped in a transaction if one isn't already in progress, or fired as individual statements. JDBC规范没有IIRC明确指出如果批处理尚未进行，是将批处理隐式包装在事务中，还是作为单个语句触发。 Nor whether the implementation should carry on with the batch after an error or not. 错误发生后是否也应继续执行该批处理。

To get well defined behaviour (and better performance), wrap the batch in a transaction. 为了获得明确定义的行为（以及更好的性能），请将批处理包装在事务中。

  statement.addBatch( "select insertSentence('%s', '%s', '%s', %d, '{%s}');".format( r._1, r._2, r._3.replaceAll("'", "''"), r._4, r._5.mkString(",")) )

No! 没有！ Step away from the keyboard! 远离键盘！ Come on, you're not a PHP programmer :p 来吧，您不是PHP程序员：p

You know better than to interpolate strings into SQL. 您知道的比将字符串插值到SQL还好。 Don't do that. 不要那样做 Use PreparedStatement . 使用PreparedStatement 。 In addition to being safer and more secure, it'll also be faster since PgJDBC only has to send one statement to parse, then re-uses it. 除了更安全，更安全之外，它还会更快，因为PgJDBC只需要发送一个语句来进行解析，然后重新使用它即可。 PreparedStatement is ideally suited for use in JDBC batches. PreparedStatement非常适合用于JDBC批处理。

Now, stepping back a bit... 现在，退后一步...

Saving an entity involves adding rows to different tables with keys autogenerated by inserting a row in one table being used for inserting some row into another table. 保存实体涉及使用通过在一个表中插入一行而自动生成的键将行添加到不同的表中，该键用于将某行插入到另一个表中。 Such a logic made me create and use a stored procedure. 这种逻辑使我可以创建和使用存储过程。

That's the simple way to write it, but it won't perform wonderfully. 那是编写它的简单方法，但是效果并不理想。 You're doing lots of independent operations on different tables, lots of piecemeal index updates, etc. There's also procedure call overhead, the overhead of each individual query, etc. Each BEGIN ... EXCEPTION ... block in pl/pgsql has a nontrivial overhead too. 您要在不同的表上执行许多独立的操作，需要进行零散的索引更新，等等。还有过程调用开销，每个查询的开销等。pl / pgsql中的每个BEGIN ... EXCEPTION ...块都有一笔不小的开销。

You're going to have problems with hundreds of thousands or millions of rows with this approach, let alone billions. 这种方法将使成千上万的行出现问题，更不用说数十亿行了。

Relational databases think best in sets. 关系数据库最好地考虑集合。 If you're really looking a billions of rows, a proc-based approach will not work. 如果您真的要查找数十亿行，那么基于proc的方法将行不通。 You'll want to take batches of the raw input, insert them into a temp table, then use a series of queries over the temp data to insert them into the destination table(s). 您需要分批处理原始输入，将它们插入到临时表中，然后对临时数据使用一系列查询将它们插入到目标表中。

You'll need to get familiar with INSERT INTO ... SELECT ... , UPDATE ... FROM ... , data-modifying common-table expressions etc. If you're on PostgreSQL 9.5 you'll benefit from using INSERT ... ON CONFLICT ... for your upsert-like operations. 您需要熟悉INSERT INTO ... SELECT ... UPDATE ... FROM ... ，数据修改公用表表达式等。如果您使用的是PostgreSQL 9.5，则可以从使用INSERT ... ON CONFLICT ...受益。 INSERT ... ON CONFLICT ...适用于像upsup这样的操作。

It'll hurt to think this way for a while, but it's well worth it, you won't believe the performance you get when you work in sets not individual items. 这种方式思考一段时间会很痛苦，但这很值得，您不会相信当您在套装而不是单个物品中工作时所获得的性能。

I can't write the whole thing for you - you've shown no raw data, no schema, and no explanation for the details. 我不能为您编写全部内容-您没有显示任何原始数据，没有架构，也没有详细说明。 Which is fine, since that wasn't your question. 没关系，因为那不是您的问题。 It'd be too long anyway, and SO isn't a code-it-for-me site. 无论如何都太长了，所以SO并不是我的代码站点。

Answer 2

Okay, I got rid of stored procedures to prevent batch from failing and as a result relying on undocumented behaviour in case of batch failure. 好的，我摆脱了存储过程，以防止批处理失败，因此在批处理失败的情况下依赖未记录的行为。 Batches are now wrapped in transactions, and Statement were replaced with PreparedStatement (in fact, it didn't lead to a better speed performance in this script). 现在，将批处理包装在事务中，并将Statement替换为PreparedStatement（实际上，此脚本并没有导致更好的速度性能）。 I made use of INSERT INTO ... SELECT ... and INSERT ... ON CONFLICT ... so a lot of logic moved from a stored procedure to SQL commands. 我利用了INSERT INTO ... SELECT ...和INSERT ... ON CONFLICT ...，因此许多逻辑从存储过程转移到SQL命令。

Now it looks like this: 现在看起来像这样：

def partition2DB(iterator: Iterator[(String, String, String, Long, Array[Short])]): Unit = {
  val batchSize = 1000
  val nRetries = 10

  def updStatements(item: (String, String, String, Long, Array[Short]), c: Connection, statement1: PreparedStatement,
                    statement2: PreparedStatement, statement3: PreparedStatement, statement4: PreparedStatement) = {
    val sentence = if (item._3.length > 2712) item._3.substring(0, 2712) else item._3
    statement1.setString(1, item._1)
    statement2.setString(1, item._2)
    statement3.setString(1, sentence)
    statement4.setString(1, item._1)
    statement4.setString(2, item._2)
    statement4.setString(3, sentence)
    statement4.setString(4, sentence)
    statement4.setLong(5, item._4)
    statement4.setArray(6, c.createArrayOf("int4", item._5.map(new Integer(_)).asInstanceOf[Array[Object]]))
    statement1.addBatch()
    statement2.addBatch()
    statement3.addBatch()
    statement4.addBatch()
  }
  def executeStatements(statement1: PreparedStatement, statement2: PreparedStatement,
                        statement3: PreparedStatement, statement4: PreparedStatement) = {
    statement1.executeBatch()
    statement2.executeBatch()
    statement3.executeBatch()
    statement4.executeBatch()
  }

  Class.forName(driver)
  var conn: Connection = null

  try {
    conn = DriverManager.getConnection(connectionString)
    conn.setAutoCommit(false)
    val statement1 = conn.prepareStatement("INSERT INTO warcinfo_id (warcinfo_id_value) VALUES (?) ON CONFLICT (warcinfo_id_value) DO NOTHING;")
    val statement2 = conn.prepareStatement("INSERT INTO record_id (record_id_value) VALUES (?) ON CONFLICT (record_id_value) DO NOTHING;")
    val statement3 = conn.prepareStatement("INSERT INTO sentence_text (sent_text) VALUES (?) ON CONFLICT (sent_text) DO NOTHING;")
    val statement4 = conn.prepareStatement(
      """
        |INSERT INTO sentence_occurrence (warcinfo_id, record_id, sent_id, timestamp, sect_ids) VALUES (
        |    (SELECT warcinfo_id_id FROM warcinfo_id WHERE warcinfo_id_value = ?),
        |    (SELECT record_id_id FROM record_id WHERE record_id_value = ?),
        |    (SELECT sent_id FROM sentence_text WHERE md5(sent_text) = md5(?) AND sent_text = ?),
        |    TO_TIMESTAMP(?),
        |    ?
        |)
      """.stripMargin)
    var i = 0
    val batch = ListBuffer[(String, String, String, Long, Array[Short])]()
    conn.setAutoCommit(false)

    def executeBatch() = {
      var attempts = 0
      while (attempts < nRetries) {
        try {
          for (item <- batch) updStatements(item, conn, statement1, statement2, statement3, statement4)
          executeStatements(statement1, statement2, statement3, statement4)
          conn.commit()
          batch.clear()
          attempts += nRetries
        } catch {
          case e: SQLException => {
            attempts += 1
            println("exception caught: " + e.getNextException)
            conn.rollback()
          }
        }
      }
    }

    iterator.foreach(r => {
      i += 1
      batch += r
      if (i % batchSize == 0) {
        executeBatch()
      }
    })
    if (i % batchSize != 0) {
      executeBatch()
    }
  } catch {
    case e: SQLException => println("exception caught: " + e)
  } finally {
    conn.close()
  }
}

This code doesn't seem quite neat for me though... 这段代码对我来说似乎并不整齐...

Data are stream of items corresponding to some sentence with its timestamp and some identifiers. 数据是与带有时间戳和某些标识符的某些句子相对应的项目流。 So content of the r variable is something like this: ('4af93233-3515-43da-8b47-71b0dad99ccc', 'd5ea8a14-be65-4281-9a87-24dcbdc3f879', 'the definitive guide being hater the internet', 1362484800, [1]) 因此r变量的内容是这样的：（'4af93233-3515-43da-8b47-71b0dad99ccc'，'d5ea8a14-be65-4281-9a87-24dcbdc3f879'，'权威指南讨厌互联网'，1362484800，[1 ]）

Each item is stored tables 'sentence_occurrence', and if needed in 'warcinfo_id', 'record_id', 'sentence_text'. 每个项目都存储有“ sentence_occurrence”表，如果需要，则存储在“ warcinfo_id”，“ record_id”，“ sentence_text”表中。

Schema is following: 架构如下：

statement.executeUpdate(
  """
    |CREATE TABLE warcinfo_id (
    |    warcinfo_id_id serial PRIMARY KEY,
    |    warcinfo_id_value char(36) UNIQUE NOT NULL
    |);
  """.stripMargin)
statement.executeUpdate(
  """
    |CREATE TABLE record_id (
    |    record_id_id serial PRIMARY KEY,
    |    record_id_value char(36) UNIQUE NOT NULL
    |);
  """.stripMargin)
statement.executeUpdate(
  """
    |CREATE TABLE sentence_text (
    |    sent_id serial PRIMARY KEY,
    |    sent_text varchar UNIQUE NOT NULL
    |);
  """.stripMargin)
statement.executeUpdate(
  """
    |CREATE TABLE sentence_occurrence (
    |    entry_id serial PRIMARY KEY,
    |    warcinfo_id integer NOT NULL,
    |    record_id integer NOT NULL,
    |    sent_id integer NOT NULL,
    |    timestamp timestamp NOT NULL,
    |    sect_ids smallint ARRAY
    |);
  """.stripMargin)

Added after Craig's comment: 在Craig评论后添加：

Thanks, Craig. 谢谢，克雷格。 What is operation on sets of input? 输入集上的操作是什么？ Could you post a link to some example? 您能否发布一些示例的链接？

Additionally, I've got the following problem. 此外，我有以下问题。 If two batches attempt to insert the same record in some table simultaneously, I get an java.sql.BatchUpdateException with a message like this "ERROR: deadlock detected. Detail: Process 31959 waits for ShareLock on transaction 24298876; blocked by process 31955. Process 31955 waits for ShareLock on transaction 24298877; blocked by process 31959." 如果两个批次试图同时在某个表中插入同一条记录，我将收到一个java.sql.BatchUpdateException并显示以下消息：“错误：检测到死锁。详细信息：进程31959等待事务24298876上的ShareLock；被进程31955阻止。 31955等待事务24298877上的ShareLock；被进程31959阻止。” What is the right solution for such cases? 对于此类情况，正确的解决方案是什么？ I can think of retrying the failed attempt until it succeeds or reaches the limit for number of retries, storing duplicated and then producing the final result table with SELECT DISTICT..., playing around with isolation levels (for example trying 'Read uncommitted'). 我可以考虑重试失败的尝试，直到成功或达到重试次数的限制，然后重复存储，然后使用SELECT DISTICT ...生成最终结果表，并尝试隔离级别（例如尝试“读未提交”）。 However, all of them seem dangerous workarounds (number of retries reaches the limit, I run out of disk space, I get some wrong data in the database). 但是，所有这些方法似乎都是危险的解决方法（重试次数达到限制，磁盘空间不足，数据库中有错误数据）。

在PostgreSQL中批量存储过程

问题描述

2 个解决方案

解决方案1
1 2016-07-27 15:39:11

解决方案2
1 2016-07-29 19:45:10

在PostgreSQL中批量存储过程

问题描述

2 个解决方案

解决方案1 1 2016-07-27 15:39:11

解决方案2 1 2016-07-29 19:45:10

解决方案1
1 2016-07-27 15:39:11

解决方案2
1 2016-07-29 19:45:10