简体繁体 English

如何优化大量数据的 postgres 插入/更新请求？

[英]How can I optimize postgres insert/update request of huge amount of data?

原文 2022-12-27 09:46:23 2 1 python/ postgresql/ psycopg2

I'm working on a pathfinding project that use topographic data of huge areas.我正在从事一个使用大面积地形数据的寻路项目。 In order to reduce the huge memory load, my plan is to pre-process the map data by creating nodes that are saved in a PostgresDB on start-up, and then accessed as needed by the algorithm.为了减少 memory 的巨大负载，我的计划是通过创建节点来预处理 map 数据，这些节点在启动时保存在 PostgresDB 中，然后根据需要由算法访问。

I've created 3 docker containers for that, the postgres DB, Adminer and my python app.我为此创建了 3 个 docker 容器，即 postgres 数据库、管理员和我的 python 应用程序。 It works as expected with small amount of data, so the communications between the containers or the application isn't a problem.它按预期使用少量数据工作，因此容器或应用程序之间的通信不是问题。

The way it works is that you give a 2D array, it takes the first row, convert each element in node and save it in the DB using an psycopg2.extras.execute_value before going to the second row, then third... Once all nodes are registered, it updates each of them by searching for their neighbors and adding their id in the right column.它的工作方式是你给出一个二维数组，它占据第一行，转换节点中的每个元素并使用 psycopg2.extras.execute_value 将其保存在数据库中，然后再转到第二行，然后是第三行......一次所有节点已注册，它通过搜索邻居并在右列中添加它们的 id 来更新每个节点。 That way it takes longer to pre-process the data, but I have easier access when running the algorithm.这样预处理数据需要更长的时间，但在运行算法时我更容易访问。

However, I think the DB have trouble processing the data past a certain point.但是，我认为数据库在处理超过某个点的数据时遇到了麻烦。 The map I gave comes from a.tif file of 9600x14400, and even when ignoring useless/invalid data, that amount to more than 10 millions of nodes.我给出的map来自9600x14400的.tif文件，即使忽略无用/无效数据，也超过1000万个节点。

Basically, it worked quite slow but okay, until around 90% of the node creation process, where the data stopped being processed.基本上，它工作得相当慢但还可以，直到大约 90% 的节点创建过程，数据停止处理。 Both python and postgres container were still running and responsive, but there was no more node being created, and the neighbor-linking part of the pre-processing didn't start either. python 和 postgres 容器仍在运行和响应，但没有更多的节点被创建，并且预处理的邻居链接部分也没有启动。 Also there were no error message in either sides.双方也没有错误消息。

I've read that the rows limit in a postgres table is absurdly high, but the table also become really slow once a lot of elements are in it, so could it be that it didn't crash or freeze, but just takes an insane amount of time to complete the remaining node creations request?我读过 postgres 表中的行数限制高得离谱，但是一旦其中包含很多元素，该表也会变得非常慢，所以它可能没有崩溃或冻结，但只是疯了完成剩余节点创建请求的时间？

Would reducing the batch size even more help in that regard?在这方面减少批量大小会更有帮助吗？ Or would maybe splitting the table into multiple smaller ones be better?或者将表格分成多个较小的表格会更好吗？

1 个解决方案

My queries and psycopg function I've used were not optimized for the mass inserts and update I was doing.我使用的查询和 psycopg function 并未针对我正在执行的大量插入和更新进行优化。

The changes I've made were:我所做的更改是：

Reduce batch size from 14k to 1k将批量大小从 14k 减少到 1k
Making a larger SELECT queries instead of smaller ones进行较大的 SELECT 查询而不是较小的查询
Creating indexes on importants columns在重要列上创建索引
Changing a normal UPDATE query to the format of an UPDATE FROM with also an executing_value instead of cursor.execute将普通 UPDATE 查询更改为 UPDATE FROM 的格式，同时使用 executing_value 而不是 cursor.execute

It made the execution time go from around an estimated 5.5 days to around 8 hours.它使执行时间 go 从大约 5.5 天缩短到大约 8 小时。