从PostgreSQL处理大量数据

Question

I am looking for a way how to process a large amount of data that are loaded from the database in a reasonable time. 我正在寻找一种如何在合理的时间内处理从数据库加载的大量数据的方法。

The problem I am facing is that I have to read all the data from the database (currently around 30M of rows) and then process them in Java. 我面临的问题是我必须从数据库中读取所有数据（当前大约30M行），然后用Java处理它们。 The processing itself is not the problem but fetching the data from the database is. 处理本身不是问题，但从数据库中获取数据却是问题。 The fetching generally takes from 1-2 minutes. 提取通常需要1-2分钟。 However, I need it to be much faster than that. 但是，我需要它快得多。 I am loading the data from db straight to DTO using following query: 我正在使用以下查询将数据从db直接加载到DTO：

select id, id_post, id_comment, col_a, col_b from post_comment

Where id is primary key, id_post and id_comment are foreign keys to respective tables and col_a and col_b are columns of small int data types. 其中id是主键， id_post和id_comment是相应表的外键，而col_a和col_b是小型int数据类型的列。 The columns with foreign keys have indexes. 具有外键的列具有索引。 The tools I am using for the job currently are Java, Spring Boot, Hibernate and PostgreSQL. 我目前用于这项工作的工具是Java，Spring Boot，Hibernate和PostgreSQL。

So far the only options that came to my mind were 到目前为止，我想到的唯一选择是

Ditch hibernate for this query and try to use plain jdbc connection hoping that it will be faster. 放弃休眠此查询，并尝试使用纯jdbc连接，希望它会更快。
Completely rewrite the processing algorithm from Java to SQL procedure. 将处理算法从Java完全重写为SQL过程。

Did I miss something or these are my only options? 我错过了什么吗，或者这是我唯一的选择？ I am open to any ideas. 我愿意接受任何想法。 Note that I only need to read the data, not change them in any way. 请注意，我只需要读取数据，而无需进行任何更改。

EDIT: The explain analyze of the used query 编辑：使用的查询的解释分析

"Seq Scan on post_comment (cost=0.00..397818.16 rows=21809216 width=28) (actual time=0.044..6287.066 rows=21812469 loops=1), Planning Time: 0.124 ms, Execution Time: 8237.090 ms"

Answer 1

Do you need to process all rows at once, or can you process them one at a time? 您是否需要一次处理所有行，还是可以一次处理一个行？

If you can process them one at a time, you should try using a scrollable result set. 如果一次可以处理它们，则应尝试使用可滚动的结果集。

org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);

while(sr.next())
{
    MyClass myObject = (MyClass)sr.get()[0];
    ... process row for myObject ... 
}

This will still remember every object in the entity manager, and so will get progressively slower and slower. 这仍然会记住实体管理器中的每个对象，因此会变得越来越慢。 To avoid that issue, you might detach the object from the entity manager after you're done. 为避免该问题，您可以在完成后从实体管理器分离对象。 This can only be done if the objects are not modified. 仅在未修改对象的情况下才能执行此操作。 If they are modified, the changes will NOT be persisted. 如果对其进行了修改，则更改将不会保留。

org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);

while(sr.next())
{
    MyClass myObject = (MyClass)sr.get()[0];
    ... process row for myObject ... 
    entityManager.detach(myObject);
}

Answer 2

If I was in your shoes I would definitely bypass hibernate and go directly to JDBC for this query. 如果我不知所措，我肯定会绕过休眠状态，而直接进入JDBC进行此查询。 Hibernate is not made for dealing with large result sets, and it represents an additional overhead for benefits that are not applicable to cases like this one. Hibernate不是用于处理大型结果集的，它代表了不适用于此类情况的额外开销。

When you use JDBC, do not forget to set autocommit to false and set some large fetch size (of the order of thousands) or else postgres will first fetch all 21 million rows into memory before starting to yield them to you. 当您使用JDBC时，请不要忘记将autocommit设置为false并设置一些较大的提取大小（成千上万个数量级），否则postgres首先将所有2,100万行提取到内存中，然后再开始将它们提供给您。 (See https://stackoverflow.com/a/10959288/773113 ) （请参阅https://stackoverflow.com/a/10959288/773113 ）

Answer 3

Since you asked for ideas, I have seen this problem being resolved in below options depending on how it fits in your environment: 1) First try with JDBC and Java, simple code and you can do a test run on your database and data to see if this improvement is enough. 自您提出想法以来，我已经看到此问题可以通过以下选项解决，具体取决于您的环境：1）首先尝试使用JDBC和Java，简单的代码，然后可以对数据库和数据进行测试以查看如果这个改进足够了。 You will here need to compromise on the other benefits of Hibernate. 您将需要在这里牺牲Hibernate的其他好处。 2) In point 1, use Multi-threading with multiple connections pulling data to one queue and then you can use that queue to process further or print as you need. 2）在第1点中，使用具有多个连接的多线程将数据拉到一个队列中，然后可以使用该队列进一步处理或根据需要进行打印。 you may consider Kafka also. 您也可以考虑使用Kafka。 3) If data is going to further keep on increasing you can consider Spark as the latest technology which can make it all in memory and will be much more faster. 3）如果数据将继续增长，可以考虑将Spark作为最新技术，它可以全部存储在内存中，并且速度更快。

These are some of the options, please like if these ideas help you anywhere. 这些是一些选项，如果这些想法可以在任何地方为您提供帮助，请喜欢。

Answer 4

Why do you 30M keep in memory ?? 您为什么要保留30M内存？ it's better to rewrite it to pure sql and use pagination based on id 最好将其重写为纯sql并使用基于id的分页

you will be sent 5 as the id of the last comment and you will issue 您将被发送5作为最后评论的ID，然后您将发出

select id, id_post, id_comment, col_a, col_b from post_comment where id > 5 limit 20

if you need to update the entire table then you need to put the task in the cron but also there to process it in parts the memory of the road and downloading 30M is very expensive - you need to process parts 0-20 20-n n+20 如果您需要更新整个表，则需要将任务放在cron中，但也要在其中进行部分处理，这需要占用内存，而下载30M则非常昂贵-您需要处理0-20 20-n n +20

从PostgreSQL处理大量数据

问题描述

4 个解决方案

解决方案1
1 2019-02-25 21:17:41

解决方案2
1 2019-02-25 22:45:43

解决方案3
1 2019-02-25 22:54:18

解决方案4
-1 2019-02-25 20:49:47

从PostgreSQL处理大量数据

问题描述

4 个解决方案

解决方案1 1 2019-02-25 21:17:41

解决方案2 1 2019-02-25 22:45:43

解决方案3 1 2019-02-25 22:54:18

解决方案4 -1 2019-02-25 20:49:47

解决方案1
1 2019-02-25 21:17:41

解决方案2
1 2019-02-25 22:45:43

解决方案3
1 2019-02-25 22:54:18

解决方案4
-1 2019-02-25 20:49:47