简体   繁体   English

Redshift JDBC 批量插入对多行工作缓慢

[英]Redshift JDBC batch insert works slow for multiple rows

I have Java code for inserting into Redshift like this:我有 Java 代码用于像这样插入 Redshift:

String query = "INSERT INTO table (id, name, value) VALUES (?, ?, ?)";
PreparedStatement ps = connection.prepareStatement(query);            
for (Record record : records) {
    ps.setInt(1, record.id);
    ps.setString(2, record.name);
    ps.setInt(3, record.value);
    ps.addBatch();
}
ps.executeBatch();

records contains a few thousands items. records包含几千项。 I tried to run this code with Postgres - it inserted all of them almost instantly while with Redshift it takes 10+ minutes.我尝试使用 Postgres 运行此代码 - 它几乎立即插入了所有代码,而使用 Redshift 则需要 10 多分钟。 After that I rewrote it to the next code:之后我将其重写为下一个代码:

String query = "INSERT INTO table (id, name, value) VALUES ";
for (Record record : records) {
    query += "(" + record.id + ",'" + record.name + "'," + record.value + "),";
}
query = query.substring(1, query.length() - 1);
PreparedStatement ps = connection.prepareStatement(query);
ps.executeUpdate();

And this fixed performance.而这种固定的表现。 This code works fine for both Redshift and Postgres.此代码适用于 Redshift 和 Postgres。 My question is next - what's wrong with the first code snapshot and how I can fix it?接下来是我的问题 - 第一个代码快照有什么问题,我该如何修复它? (I assume that first code snapshot for Redshift simply ignores batching.) (我假设 Redshift 的第一个代码快照只是忽略了批处理。)

Inserting single rows multiple times is never a good plan on a columnar database.在列式数据库上多次插入单行从来都不是一个好的计划。 Postgres is row-based and Redshift is column-based. Postgres 是基于行的,而 Redshift 是基于列的。

Each INSERT on Postgres just makes another row but on Redshift each insert requires that the column data is read, one element is added, and then the column written back. Postgres 上的每个 INSERT 只是生成另一行,但在 Redshift 上,每个插入都需要读取列数据,添加一个元素,然后写回该列。 While Redshift doesn't work on the entire column, just the last 1MB block, it has to read this block for each INSERT.虽然 Redshift 不适用于整个列,仅适用于最后 1MB 块,但它必须为每个 INSERT 读取该块。

Also since Redshift is clustered and the data for your table is distributed around the cluster, each subsequent INSERT is accessing a different block on a different slice of the table.此外,由于 Redshift 是集群的,并且您的表的数据分布在集群周围,因此每个后续的 INSERT 都会访问表的不同切片上的不同块。 All these INSERTs that are accessing single slices of the cluster are serialized by the single-threaded nature of your code so each access to a single slice has to complete before the next INSERT can be issued.所有这些访问集群单个切片的 INSERT 都由代码的单线程性质序列化,因此每次对单个切片的访问都必须在下一个 INSERT 发出之前完成。

You second code adds lots of rows of data into a single INSERT statement which is compiled and the data is sent to all slices of the database where only data for each slice is stored and rest discarded.您的第二个代码将大量数据行添加到单个 INSERT 语句中,该语句被编译,并且数据被发送到数据库的所有切片,其中仅存储每个切片的数据并丢弃 rest 。 This uses the parallelism of Redshift and only has to open the 1MB block on each slice once.这利用了 Redshift 的并行性,只需在每个切片上打开 1MB 块一次。 However, there are still performance and scalability issues with this approach (common to approach #1 but not as bad).但是,这种方法仍然存在性能和可扩展性问题(方法 #1 很常见,但没那么糟糕)。 All the data is being sent through the query compiler and then on to every slice.所有数据都通过查询编译器发送,然后发送到每个切片。 This can slow down compile time and waste.network bandwidth.这会减慢编译时间并浪费网络带宽。 All the data has to flow through the leader node which is responsible for many database functions and doing this for large amounts of data can lead to significant cluster wide performance issues.所有数据都必须流经负责许多数据库功能的领导节点,对大量数据执行此操作可能会导致严重的集群范围性能问题。 The amount of data you can insert in this manner is limited by the size (in characters) of the max query length (16MB).您可以通过这种方式插入的数据量受限于最大查询长度 (16MB) 的大小(以字符为单位)。 There is more but I'll stop there.还有更多,但我会停在那里。 While this approach is better, from Redshift's point of view, it is far from ideal.虽然这种方法更好,但从 Redshift 的角度来看,它远非理想。

Bottom line - Postgres is a single instance (scale-up), row-based, OLTP database designed for single row inserts and Redshift is a clustered (scale-out), column-based, OLAP database designed for parallel bulk inserts.底线——Postgres 是一个单实例(纵向扩展)、基于行的 OLTP 数据库,专为单行插入而设计,而 Redshift 是一个集群(横向扩展)、基于列的 OLAP 数据库,专为并行批量插入而设计。 The COPY command causes each compute node in Redshift to connect to S3 to read component files of the input data. COPY 命令使 Redshift 中的每个计算节点连接到 S3 以读取输入数据的组件文件。 This allows for parallel actions by the Redshift cluster, independent.network access to S3, and parallel processing of the read data.这允许 Redshift 集群的并行操作、对 S3 的独立网络访问以及读取数据的并行处理。 (If you really want your code to run fast make it multi-threaded and write your S3 files in parallel then issue a COPY command to Redshift.) (如果你真的想让你的代码运行得更快,让它成为多线程并并行写入你的 S3 文件,然后向 Redshift 发出 COPY 命令。)

If you want better performance in Redshift JDBC batch insert, you can follow the code below.如果你想在 Redshift JDBC 批量插入中获得更好的性能,你可以按照下面的代码进行操作。

public void testInsert(List<TestObject> testLst) {
String my_query = new String("INSERT INTO table");
    try(Connection my_connect = config_to.getConnection()) {
    try(Statement my_statement = my_connect.createStatement()){
        for(int n=0; n<testLst.size();n++) {
            if (n == 0) {
                query = my_query.append(getSqlForm(testLst.get(n)));
            }else if(n % 200 != 0) {
                query = my_query.append("," + getSqlForm(testLst.get(n)));
            } else {
                my_statement.addBatch(my_query.toString());
            query = new String("INSERT INTO table");
            query = my_query.append(getSqlForm(testLst.get(n)));
            }

            if(n % 1000 == 0) {
            statement.executeBatch();
            }
            n++;
        }
    my_statement.addBatch(my_query.toString());
    my_statement.executeBatch();
    }
    } catch(Exception ex) {
    ex.printStackTrace();
    }
}
private static String getSqlForm(TestObject obj) {
String result = "('%s','%s', %s)";
result = String.format(result, Obj.getGuid(), Obj.getName(), Obj.getId());
return result;
}

Managing the insertion of multiple rows in a single transaction is fairly challenging and necessitates some manual programming.在单个事务中管理多行的插入是相当具有挑战性的,并且需要一些手动编程。 Using Statement, you can create a sizable SQL statement and quickly insert your data.使用 Statement,您可以创建一个相当大的 SQL 语句并快速插入您的数据。 However, there is a concern here since it uses a statement rather than a prepared statement, which leaves the program vulnerable to SQL injection and somewhat lowers performance.但是,这里有一个问题,因为它使用的是语句而不是准备好的语句,这使得程序容易受到 SQL 注入的攻击,并在一定程度上降低了性能。 This strategy only works for non-interactive programs.此策略仅适用于非交互式程序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM