简体   繁体   English

直接ByteBuffer相对于绝对读取性能

[英]Direct ByteBuffer relative vs absolute read performance

While I was testing the read performance of a direct java.nio.ByteBuffer I noticed that the absolute read is on average 2x times faster than the relative read. 当我测试直接java.nio.ByteBuffer的读取性能时,我注意到绝对读取平均比相对读取快2倍。 Also if I compare the source code of the relative vs absolute read, the code is pretty much the same except that the relative read maintains and internal counter. 此外,如果我比较相对与绝对读取的源代码,除了相对读取维护和内部计数器之外,代码几乎相同。 I wonder why do I see such a considerable difference in speed? 我想知道为什么我在速度上看到如此大的差异?

Below is the source code of my JMH benchmark: 以下是我的JMH基准测试的源代码:

public class DirectByteBufferReadBenchmark {

    private static final int OBJ_SIZE = 8 + 4 + 1;
    private static final int NUM_ELEM = 10_000_000;

    @State(Scope.Benchmark)
    public static class Data {

        private ByteBuffer directByteBuffer;

        @Setup
        public void setup() {
            directByteBuffer = ByteBuffer.allocateDirect(OBJ_SIZE * NUM_ELEM);
            for (int i = 0; i < NUM_ELEM; i++) {
                directByteBuffer.putLong(i);
                directByteBuffer.putInt(i);
                directByteBuffer.put((byte) (i & 1));
            }
        }
    }



    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.SECONDS)
    public long testReadAbsolute(Data d) throws InterruptedException {
        long val = 0l;
        for (int i = 0; i < NUM_ELEM; i++) {
            int index = OBJ_SIZE * i;
            val += d.directByteBuffer.getLong(index);
            d.directByteBuffer.getInt(index + 8);
            d.directByteBuffer.get(index + 12);
        }
        return val;
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.SECONDS)
    public long testReadRelative(Data d) throws InterruptedException {
        d.directByteBuffer.rewind();

        long val = 0l;
        for (int i = 0; i < NUM_ELEM; i++) {
            val += d.directByteBuffer.getLong();
            d.directByteBuffer.getInt();
            d.directByteBuffer.get();
        }

        return val;
    }

    public static void main(String[] args) throws Exception {
        Options opt = new OptionsBuilder()
            .include(DirectByteBufferReadBenchmark.class.getSimpleName())
            .warmupIterations(5)
            .measurementIterations(5)
            .forks(3)
            .threads(1)
            .build();

        new Runner(opt).run();
    }
}

And these are the results of my benchmark run: 这些是我的基准运行结果:

Benchmark                                        Mode  Cnt   Score   Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   15  88.605 ± 9.276  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   15  42.904 ± 3.018  ops/s

The test was run on a MacbookPro (2.2GHz Intel Core i7, 16Gb DDR3) and JDK 1.8.0_73. 测试在MacbookPro(2.2GHz Intel Core i7,16Gb DDR3)和JDK 1.8.0_73上运行。

UPDATE UPDATE

I run the same test with JDK 9-ea b134. 我用JDK 9-ea b134运行相同的测试。 Both test show a ~10% speed increase but the speed difference between the two remains similar. 两项测试均显示速度增加约10%,但两者之间的速度差异仍然相似。

# JMH 1.13 (released 45 days ago)
# VM version: JDK 9-ea, VM 9-ea+134
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/bin/java
# VM options: <none>


Benchmark                                        Mode  Cnt    Score    Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   15  102.170 ± 10.199  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   15   45.988 ±  3.896  ops/s

JDK 8 indeed generates worse code for the loop with relative ByteBuffer access. JDK 8确实为具有相对ByteBuffer访问的循环生成了更糟糕的代码。

JMH has built-in perfasm profiler that prints generated assembly code for the hottest regions. JMH具有内置的perfasm Profiler,可以为最热的区域打印生成的汇编代码。 I've used it to compare the compiled testReadAbsolute vs. testReadRelative , and here are the main differences: 用它来比较编译的testReadAbsolutetestReadRelative ,这里是主要的区别:

  1. Relative getLong / getInt/ get update position field of the ByteBuffer . ByteBuffer相对getLong / getInt/ get更新位置字段。 VM does not optimize these updates: there are 3 memory writes on each loop iteration. VM不优化这些更新:每次循环迭代都有3次内存写入。

  2. position range check is not eliminated: conditional branches on each loop iteration remained in compiled code. 不会消除position范围检查:每个循环迭代的条件分支保留在编译代码中。

  3. Since redundant field updates and range checks make the loop body longer, VM unrolls only 2 iterations of the loop. 由于冗余字段更新和范围检查使循环体更长,因此VM仅展开循环的2次迭代。 The compiled version for the loop with absolute access has 16 iterations unrolled. 具有绝对访问权限的循环的编译版本展开了16次迭代。

testReadAbsolute is compiled very well: the main loop just reads 16 longs, sums them up and jumps to the next iteration if index < 10_000_000 - 16 . testReadAbsolute编译得非常好:主循环只读取16个长testReadAbsolute ,将它们相加并跳转到下一次迭代,如果index < 10_000_000 - 16 The state of directByteBuffer is not updated. directByteBuffer的状态未更新。 However, JVM is not that smart for testReadRelative : seems like it cannot optimize field access of an object from outside. 但是,JVM对于testReadRelative并不那么聪明:似乎无法从外部优化对象的字段访问。

There was much work in JDK 9 to optimize ByteBuffer. JDK 9中有很多工作要优化ByteBuffer。 I've run the same test on JDK 9-ea b134, and verified that testReadRelative does not have redundant memory writes and range checks. 我在JDK 9-ea b134上运行了相同的测试,并验证了testReadRelative没有冗余内存写入和范围检查。 Now it runs almost as fast as testReadAbsolute . 现在它的运行速度几乎和testReadAbsolute一样快。

// JDK 1.8.0_92, VM 25.92-b14

Benchmark                                        Mode  Cnt   Score   Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   10  99,727 ± 0,542  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   10  47,126 ± 0,289  ops/s

// JDK 9-ea, VM 9-ea+134

Benchmark                                        Mode  Cnt    Score   Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   10  109,369 ± 0,403  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   10   97,140 ± 0,572  ops/s

UPDATE UPDATE

In order to help JIT compiler with optimization I've introduced local variable 为了帮助JIT编译器进行优化,我引入了局部变量

ByteBuffer directByteBuffer = d.directByteBuffer

in both benchmarks. 在两个基准测试中。 Otherwise level of indirection does not allow compiler to eliminate ByteBuffer.position field updates. 否则,间接级别不允许编译器消除ByteBuffer.position字段更新。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM