简体   繁体   English

如何有效地将数据从 CSV 加载到数据库中?

[英]How to efficiently load data from CSV into Database?

I have a CSV/TSV file with data and want to load that CSV data into Database.我有一个包含数据的 CSV/TSV 文件,并且想将该 CSV 数据加载到数据库中。 I am using Java or Python and PostgreSQL to do that (I can't change that).我正在使用 Java 或 Python 和 PostgreSQL 来做到这一点(我无法更改)。

The problem is that for each row I make an INSERT query and it is not that efficient if I have let's say 600.000 rows.问题是,对于每一行,我都进行了 INSERT 查询,如果我有 600.000 行,那么效率就不高了。 Is there any more efficient way to do it?有没有更有效的方法来做到这一点?

I was wondering if I can take more rows and create just one big query and execute it on my database but I'm not sure if that helps at all or should I divide the data in maybe let's say 100 pieces and execute 100 queries?我想知道我是否可以获取更多行并只创建一个大查询并在我的数据库上执行它,但我不确定这是否有帮助,或者我应该将数据分成 100 个部分并执行 100 个查询?

If the CSV file is compatible with the format required by copy from stdin , then the most efficient way is to use the CopyManager API.如果 CSV 文件与copy from stdin所需的格式兼容,那么最有效的方法是使用CopyManager API。

See this answer or this answer for example code.有关示例代码,请参阅此答案此答案


If your input file isn't compatible with Postgres' copy command, you will need to write the INSERT yourself.如果您的输入文件与 Postgres 的复制命令不兼容,您将需要自己编写 INSERT。 But you can speed up the process by using JDBC batching:但是您可以通过使用 JDBC 批处理来加速该过程:

Something along the lines:沿线的东西:

PreparedStatement insert = connection.prepareStatement("insert into ...");
int batchSize = 1000;
int batchRow = 0;
// iterate over the lines from the file
while (...) {
   ... parse the line, extract the columns ...
   insert.setInt(1, ...);
   insert.setString(2, ...);
   insert.setXXX(...);
   insert.addBatch();
   batchRow ++;
   if (batchRow == batchSize) {
     insert.executeBatch();
     batchRow = 0);
   }
}
insert.executeBatch();

Using reWriteBatchedInserts=true in your JDBC URL will improve performance even more.JDBC URL 中使用reWriteBatchedInserts=true将进一步提高性能。

Assuming the server can access the file directly, you could try using the COPY FROM command.假设服务器可以直接访问该文件,您可以尝试使用COPY FROM命令。 If your CSV is not of the right format it might still be faster to transcript it to something the COPY command will handle (eg while copying to a location that the server can access).如果您的 CSV 格式不正确,将其转录为 COPY 命令将处理的内容可能会更快(例如,在复制到服务器可以访问的位置时)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM