Java GZIPOutputStream 似乎分配了不必要的字节数组？

Question

I have an application that processes text files and stores them and to save some space it gzips the files.我有一个应用程序可以处理文本文件并存储它们，并为了节省一些空间，它会压缩文件。 So I have some chained OutputStreams and one of them is java.util.zip.GZIPOutputStream that manages the compression.所以我有一些链式输出流，其中之一是管理压缩的java.util.zip.GZIPOutputStream 。

To make sure I was not wasting memory somewhere, I profiled my process with the async profiler/intellij with a file that had around 6MB of random data in a small loop for some amount of times.为了确保我没有在某处浪费内存，我使用 async profiler/intellij 对我的进程进行了分析，该文件在一个小循环中包含大约 6MB 的随机数据一段时间。 For reference I'm using Temurin JDK18.作为参考，我使用的是 Temurin JDK18。

I was surprised to see a lot of memory allocations with the GZIPOutputStream (via the parent method):我很惊讶地看到GZIPOutputStream分配了很多内存（通过父方法）： 1,601,318,568 samples 1,601,318,568样本

That's a bit strange.这有点奇怪。 I know GZIPOutputStream / DeflaterOutputStream uses a buffer, but why is it doing so many allocations?我知道GZIPOutputStream / DeflaterOutputStream使用缓冲区，但为什么要分配这么多？ I look deeper in the code.我深入研究了代码。 I notice the parent method in java.util.zip.DeflaterOutputStream does this when it writes a byte:我注意到java.util.zip.DeflaterOutputStream中的父方法在写入一个字节时会这样做：

    public void write(int b) throws IOException {
        byte[] buf = new byte[1];
        buf[0] = (byte)(b & 0xff);
        write(buf, 0, 1);
    }

So, it makes a new single byte array for every single byte?那么，它为每个字节创建一个新的单字节数组吗？ That definitely seems like it would be a lot of allocations?这绝对看起来会是很多分配？ To see if it makes a difference, I extend GZIPOutputStream with a new class I called LowAllocGzipOutputStream with an override method like this:为了看看它是否有所不同，我使用一个名为LowAllocGzipOutputStream的新类扩展了GZIPOutputStream ，并使用了如下覆盖方法：

    private final byte[] singleByteBuff = new byte[1];

    @Override
    public void write(int b) throws IOException {
        singleByteBuff[0] = (byte)(b & 0xff);
        write(singleByteBuff, 0, 1);
    }

I then profiled it again with my test case to see what might happen.然后我用我的测试用例再次分析它，看看会发生什么。 The data was quite different:数据完全不同： 162,262,880 samples 162,262,880样本

That is a pretty big reduction of allocations, -1,439,055,688 samples.这是一个相当大的分配减少， -1,439,055,688样本。

So I'm left with a few questions that I haven't found answers for:所以我留下了一些我没有找到答案的问题：

Why does GZIPOutputStream / DeflaterOutputStream allocate byte[] s like this?为什么GZIPOutputStream / DeflaterOutputStream会这样分配byte[] ？ This is a class that comes with the JDK, so I'm sure it's been profiled and scrutinized heavily, but with my naive understanding it appears to be unnecessarily wasteful?这是 JDK 附带的一个类，所以我确信它已经过深入分析和审查，但以我天真的理解，它似乎是不必要的浪费？ Does the single byte array get optimized away by hotspot or something eventually?单字节数组最终会被热点优化掉吗？ Does it not really add pressure to the garbage collector?它真的不会给垃圾收集器增加压力吗？
Is there a negative consequence to my cached singleByteBuff method?我缓存singleByteBuff方法是否有负面影响？ I can't seem to think of any issue it would cause so far.到目前为止，我似乎无法想到它会导致任何问题。 The benefit that I find with it is that my app's memory profile is no longer dominated by DeflaterOutputStream byte[] allocations.我发现它的好处是我的应用程序的内存配置文件不再受DeflaterOutputStream byte[]分配支配。

Answer 1

Having spent more time digging into streams I will attempt to answer my own question, with a bit of guesswork:在花了更多时间研究流之后，我将尝试回答我自己的问题，但需要进行一些猜测：

From what I can measure, there's basically only one sane way to call an OutputStream if you care at at all about performance, and it's the method据我所知，如果您完全关心性能，基本上只有一种理智的方法可以调用 OutputStream，这就是方法

public void write(byte[] b, int off, int len)

There are several reasons for this:有几个原因：

Handling more bytes at a time can be more efficient for things like gzip对于 gzip 之类的东西，一次处理更多字节可能更有效
Reusing buffers where you can saves memory在可以节省内存的地方重用缓冲区
Less function calls更少的函数调用

The #3 is less obvious as an average java developer.作为普通的 Java 开发人员，#3 不太明显。 Normally you don't think about function calls that much.通常你不会过多地考虑函数调用。 But if you're processing a billion bytes one byte at a time that adds up!但是，如果您一次一个字节地处理 10 亿字节，那就加起来了！ Function calls, little bits of work, etc, all of those things that you could just be doing less of add up needlessly.函数调用、少量工作等等，所有这些你可以少做的事情都不必要地加起来。
In my original design for my app I was considering that I could treat OutputStreams like a state machine, where each input part was a byte.在我的应用程序的原始设计中，我考虑可以将 OutputStreams 视为一个状态机，其中每个输入部分都是一个字节。 This is maybe just a wrong way to think about data streams when buffers are involved.当涉及缓冲区时，这可能只是考虑数据流的错误方式。

The only method you have to implement to make OutputStream work is this:要使 OutputStream 工作，您必须实现的唯一方法是：

public abstract void write(int b)

Sometimes you really do need to write one byte.有时你确实需要写一个字节。 There's convenience to using this method.使用这种方法很方便。 However, this method's convenience and simplicity is a trap.然而，这种方法的方便和简单是一个陷阱。 It's there for use, but if you care about performance you shouldn't use it.它可以使用，但如果您关心性能，则不应使用它。 Definitely too much needless work will happen if you use it in production.如果您在生产中使用它，肯定会发生太多不必要的工作。

This is where I think the reasoning behind GZIPOutputStream comes from in regards to this method.这就是我认为 GZIPOutputStream 背后的原因与此方法有关的原因。 If you've ever implemented an interesting OutputStream, a thing that becomes quickly obvious is that you really want all of your logic to flow into one of the methods.如果您曾经实现过一个有趣的 OutputStream，那么很快就会变得显而易见的是，您确实希望所有逻辑都流入其中一个方法。 But, if you care about anything, you'd never choose write(int b) , that would be crazy given how poorly it scales.但是，如果你关心任何事情，你永远不会选择write(int b) ，考虑到它的扩展性有多差，那将是疯狂的。 So these simpler methods are implemented without much care.因此，这些更简单的方法在实施时无需多加注意。 And if you are just writing one byte a few times a few array allocations are inconsequential.如果您只是多次写入一个字节，那么几个数组分配是无关紧要的。

In my question example, I made a more efficient method for GZIPOutputStream's write(int b) by adding a single byte buffer.在我的问题示例中，我通过添加单字节缓冲区为 GZIPOutputStream 的write(int b)制定了一种更有效的方法。 And, as far as I can tell, it is more efficient!而且，据我所知，它更有效率！ However, if you actually want your code to run efficiently, you'd still never use this method, no matter how optimized it could be.但是，如果您真的希望您的代码高效运行，那么您仍然永远不会使用这种方法，无论它如何优化。 Your program would still be doing too much unnecessary work.你的程序仍然会做太多不必要的工作。
This is where I think the thinking of the design comes from.这就是我认为设计思想的来源。 The write(int b) is there just so you can technically implement the OutputStream and also allow a single byte write, but it's something you should almost always avoid, so why optimize an inherently flawed method? write(int b)的存在只是为了让您可以在技术上实现 OutputStream 并允许单字节写入，但您几乎总是应该避免这种情况，那么为什么要优化一个固有缺陷的方法呢？

That said, a bit of javadoc in any of these methods could have gone a long way to help educate me here.也就是说，任何这些方法中的一点 javadoc 都可以帮助我在这里学习。

Java GZIPOutputStream 似乎分配了不必要的字节数组？

问题描述

1 个解决方案

解决方案1
0 2022-06-07 21:08:02

Java GZIPOutputStream 似乎分配了不必要的字节数组？

问题描述

1 个解决方案

解决方案1 0 2022-06-07 21:08:02

解决方案1
0 2022-06-07 21:08:02