简体   繁体   English

JVM 不会为已处理的 ResultSet 对象释放内存

[英]JVM does not release memory for processed ResultSet objects

I need to write ~50 million rows fetched from a jdbc ResultSet to a CSV file.我需要将从 jdbc ResultSet 中提取的约 5000 万行写入 CSV 文件。
1.5 million rows written to a CSV file amounts to 1 GB approximately.写入 CSV 文件的 150 万行大约相当于 1 GB。

jdbcTemplate.query(new CustomPreparedStatementCreator(arg), new ResultSetExtractor<Void>() {
    @Override
    public Void extractData(ResultSet rs) {
    while (rs.next()) {
        // transform each row's data (involves creation of objects)
        // write the transformed strings to csv file
    }
}

The problem is I have a heap of 8 GB and it gets filled up pretty fast.问题是我有一个 8 GB 的堆,它很快就被填满了。
Hence I run into java.lang.OutOfMemoryError before I get to 10 million rows.因此,在达到 1000 万行之前,我遇到了 java.lang.OutOfMemoryError。
Another limitation I have is the query read/write timeout which is set to 30 minutes.我的另一个限制是查询读/写超时设置为 30 分钟。

What can I do to recycle and reuse the JVM heap memory?如何回收和重用JVM堆内存?
Especially the memory allocated for objects that I don't need anymore.特别是为我不再需要的对象分配的内存。

I read that forcing GC to run does not guarantee memory will be reclaimed.读到强制 GC 运行并不能保证内存会被回收。

What are my options?我有哪些选择? Should I defer the responsibility to non-GC languages我应该将责任推给非 GC 语言吗
like C,C++ via JNA or JNI to process the ResultSet?像 C、C++ 通过 JNA 或 JNI 来处理 ResultSet?

[EDIT] It seems I am in a tough spot :D Adding more info as pointed out by @rzwitserloot [编辑]看来我处境艰难:D 添加@rzwitserloot 指出的更多信息

  1. I am reading (SELECT queries only) data from a data-virtualization tool that is hooked to a data lake.我正在从连接到数据湖的数据虚拟化工具中读取(仅限 SELECT 查询)数据。
  2. The data-virtualization tool's jdbc driver does support LIMIT but the queries are designed by the business to return huge volumes of data.数据虚拟化工具的 jdbc 驱动程序确实支持 LIMIT,但业务设计查询以返回大量数据。 So I've got one-shot to pull the data and generate a CSV - meaning, I cannot avoid the giant SELECT or put a LIMIT clause所以我可以一次性提取数据并生成一个 CSV - 这意味着,我无法避免巨大的 SELECT 或放置一个 LIMIT 子句
  3. I need to check these properties: resultSetType , resultSetConcurrency , resultSetHoldability .我需要检查这些属性: resultSetTyperesultSetConcurrencyresultSetHoldability

What I have already done:我已经做了什么:

First, I used a Producer-Consumer pattern to separate the jdbc fetch operations from slow file write operations.首先,我使用生产者-消费者模式将 jdbc 获取操作与慢速文件写入操作分开。 This helped create CSV files containing 1-5 million rows before 30 mins timeout.这有助于在 30 分钟超时之前创建包含 1-5 百万行的 CSV 文件。

Second, I increased the number of consumer threads and have them write to their own separate part-file only to be merged later into a single CSV file.其次,我增加了消费者线程的数量,让它们写入自己单独的部分文件,以便稍后合并到单个 CSV 文件中。 This sped up file write and create a CSV file containing 10-20 million rows before the 30 mins timeout.这加快了文件写入并在 30 分钟超时之前创建了一个包含 10-20 百万行的 CSV 文件。

I am creating objects inside the ResultSetExtractor and passing it to consumer threads via a bounded queue.我在 ResultSetExtractor 中创建对象并通过有界队列将其传递给消费者线程。 These objects are not needed once the data from them is written to the file.一旦将数据写入文件,就不需要这些对象。

You've pasted very little code;你粘贴的代码很少; one of the key clues is that by design , the code you pasted has no memory issues - ResultSet is intentionally designed to be cursor-esque, meaning, in theory every .next() call results in TCP/IP traffic, asking the DB to fetch another row.关键线索之一是,根据设计,您粘贴的代码没有内存问题- ResultSet 有意设计为游标式,这意味着理论上每个.next()调用都会导致 TCP/IP 流量,要求 DB获取另一行。 This is why resultsets need to be closed (Because the database is maintaining a separate 'version' so that, assuming you're using serializable or some other clean-reads isolation level, any other transaction that was started (or rather, wasnt yet committed) when you opened the one you are in* doesn't have any effect on the data you are witnessing as you go through the .next() calls.这就是为什么需要关闭结果集(因为数据库维护一个单独的“版本”,因此,假设您使用的是可序列化或其他一些干净读取隔离级别,任何其他已启动(或者更确切地说,尚未提交)的事务) 当您打开您所在的那个时* 对您在通过.next()调用时看到的数据没有任何影响。

Now, the JDBC API is also quite flexible.现在,JDBC API 也非常灵活。 For example, that's a lot of packets and traffic and work, so in practice many DB JDBC drivers will either just send all the data at once, and resultset .close does nothing, or will at least send things in larger batches, and most .next() calls result in no DB traffic, except for every 100th call, or whatnot.例如,这是大量的数据包和流量和工作,因此在实践中,许多 DB JDBC 驱动程序要么一次发送所有数据,而结果集 .close 什么也不做,要么至少以更大的批次发送内容,而大多数.next()调用不会导致数据库流量,除非每 100 次调用或诸如此类。

Thus, we have 2 major options here:因此,我们在这里有两个主要选择:

  1. The memory leak has nothing whatsoever to do with what you pasted;内存泄漏与您粘贴的内容无关; for example, you're writing your CSV data into an ever growing buffer and you're not streaming it to disk at all.例如,您正在将 CSV 数据写入不断增长的缓冲区中,而您根本没有将其流式传输到磁盘。 Triple check this.三重检查这个。 Replace your giant SELECT with a LIMIT clause and add a giant for loop around it to simulate writing a ton of records without actually querying much from your JDBC loop.用 LIMIT 子句替换你的巨型 SELECT 并在它周围添加一个巨大的 for 循环来模拟写入大量记录,而无需从你的 JDBC 循环中实际查询太多。 If that still runs out of memory, it's not your db layer.如果仍然耗尽内存,则不是您的数据库层。

  2. The JDBC driver is nevertheless implementing its ResultSet implementation with something that continually takes memory.尽管如此,JDBC 驱动程序还是通过不断占用内存的东西来实现其 ResultSet 实现。

IF it is #2, then you have 2 solutions:如果它是 #2,那么你有 2 个解决方案:

  1. Make the DB engine not do that.让数据库引擎不要那样做。 ResultSets have 'features', and you ask for which feature(s) you need as you make them.结果集具有“功能”,您在制作它们时会询问您需要哪些功能。 For example, you can tell the system you want the resultset to be so-called 'forward only'.例如,您可以告诉系统您希望结果集是所谓的“仅转发”。 The 3 properties that are the most likely to result in non-memory-chewing ResultSets are those initialized with resultSetType = FORWARD_ONLY , resultSetConcurrency = CONCUR_READ_ONLY , and resultSetHoldability = CLOSE_CURSORS_AT_COMMIT .最有可能导致非内存咀嚼 ResultSets 的 3 个属性是那些使用resultSetType = FORWARD_ONLYresultSetConcurrency = CONCUR_READ_ONLYresultSetHoldability = CLOSE_CURSORS_AT_COMMIT初始化的属性。 I don't actually know how to tell jdbctemplate to do this, but it shouldn't be too difficult - jdbctemplate is calling java.sql.Connection 's prepareStatement method - make sure it calls the one where you set all those properties to those values.我实际上不知道如何告诉 jdbctemplate 执行此操作,但这应该不会太困难 - jdbctemplate 正在调用java.sql.ConnectionprepareStatement方法 - 确保它调用您将所有这些属性设置为那些属性的方法值。

  2. If that doesn't work, work around it, using OFFSET/LIMIT (the syntax for this depends on your DB engine unfortunately) to fetch pages at a time.如果这不起作用,请解决它,使用 OFFSET/LIMIT(不幸的是,此语法取决于您的数据库引擎)一次获取页面。 Of course, if the table is being edited while you are doing this, unless you have serializable transaction level set up, that's going to mess with your stuff, and you MUST add an ORDER BY clause of some sort or you don't get an actual guarantee results are returned in the same order (and without that, OFFSET/LIMIT paging isn't going to do what you want).当然,如果在执行此操作时正在编辑表,除非您设置了可序列化的事务级别,否则这会弄乱您的内容,并且您必须添加某种 ORDER BY 子句,否则您将无法获得实际保证结果以相同的顺序返回(如果没有,OFFSET/LIMIT 分页不会做你想做的事)。 That's a bit bizarre - what kind of exotic third rate badly written DB engine and/or JDBC driver are you using, if this is happening to you?这有点奇怪 - 如果这发生在您身上,您使用的是哪种奇特的三流写得不好的数据库引擎和/或 JDBC 驱动程序?

*) "But, I'm not using transactions!" *) “但是​​,我没有使用交易!” - yes, you are, 'auto commit = true' is what is commonly known as 'no transactions', but that is not true; - 是的,你是,'auto commit = true' 就是通常所说的'无事务',但事实并非如此; it's simply 1 transaction per SQL statement you send.您发送的每个 SQL 语句只是 1 个事务。 The only truly 'no transactions' mode are things your DB explicitly says exist outside of transactions, not-actually-safe-DBs like MySQL with the MyISAM table type, which isn't really a DB in the first place, or intentionally lenient isolation levels.唯一真正的“无事务”模式是您的数据库明确表示存在于事务之外的事物,而不是实际安全的数据库,例如具有 MyISAM 表类型的 MySQL,这首先不是真正的数据库,或者故意宽松的隔离水平。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM