简体   繁体   English

检查字节数组是否全为零的最快方法

[英]Fastest way to check if a byte array is all zeros

I have a byte[4096] and was wondering what the fastest way is to check if all values are zero? 我有一个byte[4096]并且想知道检查所有值是否为零的最快方法是什么?

Is there any way faster than doing: 有没有比做更快的方法:

byte[] b = new byte[4096];
b[4095] = 1;
for(int i=0;i<b.length;i++)
    if(b[i] != 0)
        return false; // Not Empty

I have rewritten this answer as I was first summing all bytes, this is however incorrect as Java has signed bytes, hence I need to or. 我已经重写了这个答案,因为我是第一次总结所有字节,但这是不正确的,因为Java已经签名字节,因此我需要或。 Also I have changed the JVM warmup to be correct now. 此外,我已将JVM热备改为现在正确。

Your best bet really is to simply loop over all values. 你最好的选择就是简单地循环遍历所有值。

I suppose you have three major options available: 我想你有三个主要选择:

  1. Or all elements and check the sum. 或者所有元素并检查总和。
  2. Do branchless comparisons. 进行无分支比较。
  3. Do comparisons with a branch. 与分支进行比较。

I don't know how good the performance is of adding bytes using Java (low level performance), I do know that Java uses (low level) branch predictors if you give branched comparisons. 我不知道使用Java(低级别性能)添加字节的性能有多好,我知道如果你进行分支比较,Java会使用(低级别)分支预测器。

Therefore I expect the following to happen on: 因此,我希望发生以下情况:

byte[] array = new byte[4096];
for (byte b : array) {
    if (b != 0) {
        return false;
    }
}
  1. Relatively slow comparison in the first few iterations when the branch predictor is still seeding itself. 当分支预测器仍在播种时,在前几次迭代中相对较慢的比较。
  2. Very fast branch comparisons due to branch prediction as every value should be zero anyway. 由于分支预测导致的非常快速的分支比较,因为每个值无论如何都应该为零。

If it would hit a non-zero value, then the branch predictor would fail, causing a slow-down of the comparison, but then you are also at the end of your computation as you want to return false either way. 如果它会达到非零值,则分支预测器将失败,导致比较速度变慢,但随后您也计算结束,因为您希望以任何方式返回false。 I think the cost of one failing branch prediction is an order of magnitude smaller as the cost of continuing to iterate over the array. 我认为一个失败的分支预测的成本随着继续迭代阵列的成本而小一个数量级。

I furthermore believe that for (byte b : array) should be allowed as it should get compiled directly into indexed array iteration as as far as I know there is no such thing as a PrimitiveArrayIterator which would cause some extra method calls (as iterating over a list) until the code gets inlined. 我还认为应该允许for (byte b : array)因为它应该直接编译到索引数组迭代中,因为我知道没有像PrimitiveArrayIterator那样会导致一些额外的方法调用(如同迭代列表)直到代码内联。

Update 更新

I wrote my own benchmarks which give some interesting results... Unfortunately I couldn't use any of the existing benchmark tools as they are pretty hard to get installed correctly. 我写了自己的基准测试,给出了一些有趣的结果......不幸的是我无法使用任何现有的基准测试工具,因为它们很难正确安装。

I also decided to group options 1 and 2 together, as I think they are actually the same as with branchless you usually or everything (minus the condition) and then check the final result. 我还决定将选项1和2组合在一起,因为我认为它们实际上与您通常的无分支或一切(减去条件)相同,然后检查最终结果。 And the condition here is x > 0 and hence a or of zero is a noop presumably. 并且这里的条件是x > 0 ,因此a或者零是推测的noop。

The code: 代码:

public class Benchmark {
    private void start() {
        //setup byte arrays
        List<byte[]> arrays = createByteArrays(700_000);

        //warmup and benchmark repeated
        arrays.forEach(this::byteArrayCheck12);
        benchmark(arrays, this::byteArrayCheck12, "byteArrayCheck12");

        arrays.forEach(this::byteArrayCheck3);
        benchmark(arrays, this::byteArrayCheck3, "byteArrayCheck3");

        arrays.forEach(this::byteArrayCheck4);
        benchmark(arrays, this::byteArrayCheck4, "byteArrayCheck4");

        arrays.forEach(this::byteArrayCheck5);
        benchmark(arrays, this::byteArrayCheck5, "byteArrayCheck5");
    }

    private void benchmark(final List<byte[]> arrays, final Consumer<byte[]> method, final String name) {
        long start = System.nanoTime();
        arrays.forEach(method);
        long end = System.nanoTime();
        double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
        System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
    }

    private List<byte[]> createByteArrays(final int amount) {
        Random random = new Random();
        List<byte[]> resultList = new ArrayList<>();
        for (int i = 0; i < amount; i++) {
            byte[] byteArray = new byte[4096];
            byteArray[random.nextInt(4096)] = 1;
            resultList.add(byteArray);
        }
        return resultList;
    }

    private boolean byteArrayCheck12(final byte[] array) {
        int sum = 0;
        for (byte b : array) {
            sum |= b;
        }
        return (sum == 0);
    }

    private boolean byteArrayCheck3(final byte[] array) {
        for (byte b : array) {
            if (b != 0) {
                return false;
            }
        }
        return true;
    }

    private boolean byteArrayCheck4(final byte[] array) {
        return (IntStream.range(0, array.length).map(i -> array[i]).reduce(0, (a, b) -> a | b) != 0);
    }

    private boolean byteArrayCheck5(final byte[] array) {
        return IntStream.range(0, array.length).map(i -> array[i]).anyMatch(i -> i != 0);
    }

    public static void main(String[] args) {
        new Benchmark().start();
    }
}

The surprising results: 惊人的结果:

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 50.18817142857143ns 基准:byteArrayCheck12 / iterations:700000 /每次迭代的时间:50.18817142857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 767.7371985714286ns 基准:byteArrayCheck3 / iterations:700000 /每次迭代的时间:767.7371985714286ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21145.03219857143ns 基准测试:byteArrayCheck4 / iterations:700000 /每次迭代的时间:21145.03219857143ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10376.119144285714ns 基准测试:byteArrayCheck5 /迭代:700000 /每次迭代的时间:10376.119144285714ns

This shows that orring is a whole lots of faster than the branch predictor, which is rather surprising, so I assume some low level optimizations are being done. 这表明orring比分支预测器快很多,这是相当令人惊讶的,所以我假设正在进行一些低级优化。

As extra I've included the stream variants, which I did not expect to be that fast anyhow. 作为额外的我已经包括了流变体,我无论如何都不希望它快。

Ran on a stock-clocked Intel i7-3770, 16GB 1600MHz RAM. 搭载备有时钟的英特尔i7-3770,16GB 1600MHz内存。

So I think the final answer is: It depends. 所以我认为最终的答案是:这取决于。 It depends on how many times you are going to check the array consecutively. 这取决于您要连续检查阵列的次数。 The "byteArrayCheck3" solution is always steadily at 700~800ns. “byteArrayCheck3”解决方案始终稳定在700~800ns。

Follow up update 跟进更新

Things actually take another interesting approach, turns out the JIT was optimizing almost all calculations away due to resulting variables not being used at all. 事情实际上采取了另一种有趣的方法,事实证明JIT几乎所有计算都在优化,因为根本没有使用结果变量。

Thus I have the following new benchmark method: 因此,我有以下新的benchmark方法:

private void benchmark(final List<byte[]> arrays, final Predicate<byte[]> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (byte[] array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

This ensures that the result of the benchmarks cannot be optimized away, the major issue hence was that the byteArrayCheck12 method was void, as it noticed that the (sum == 0) was not being used, hence it optimized away the entire method. 这确保了基准测试的结果无法优化,因此主要问题是byteArrayCheck12方法是无效的,因为它注意到(sum == 0)没有被使用,因此它优化了整个方法。

Thus we have the following new result (omitted the result prints for clarity): 因此,我们有以下新结果(为清晰起见省略了结果打印):

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1370.6987942857143ns 基准测试:byteArrayCheck12 / iterations:700000 /每次迭代的时间:1370.6987942857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 736.1096242857143ns 基准测试:byteArrayCheck3 / iterations:700000 /每次迭代的时间:736.1096242857143ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 20671.230327142857ns 基准:byteArrayCheck4 / iterations:700000 /每次迭代的时间:20671.230327142857ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 9845.388841428572ns 基准测试:byteArrayCheck5 / iterations:700000 /每次迭代的时间:9845.388841428572ns

Hence we think that we can finally conclude that branch prediction wins. 因此,我们认为我们最终可以得出结论,分支预测获胜。 It could however also happen because of the early returns, as on average the offending byte will be in the middle of the byte array, hence it is time for another method that does not return early: 然而,它也可能因为早期返回而发生,因为平均有问题的字节将位于字节数组的中间,因此是时候另一种方法不能提前返回:

private boolean byteArrayCheck3b(final byte[] array) {
    int hits = 0;
    for (byte b : array) {
        if (b != 0) {
            hits++;
        }
    }
    return (hits == 0);
}

In this way we still benefit from the branch prediction, however we make sure that we cannot return early. 通过这种方式,我们仍然可以从分支预测中受益,但是我们确保我们不能提前返回。

Which in turn gives us more interesting results again! 这又为我们带来了更有趣的结果!

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1327.2817714285713ns 基准:byteArrayCheck12 / iterations:700000 /每次迭代的时间:1327.2817714285713ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 753.31376ns 基准:byteArrayCheck3 / iterations:700000 /每次迭代的时间:753.31376ns
Benchmark: byteArrayCheck3b / iterations: 700000 / time per iteration: 1506.6772842857142ns 基准测试:byteArrayCheck3b /迭代:700000 /每次迭代的时间:1506.6772842857142ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21655.950115714284ns 基准:byteArrayCheck4 / iterations:700000 /每次迭代的时间:21655.950115714284ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10608.70917857143ns 基准:byteArrayCheck5 / iterations:700000 /每次迭代的时间:10608.70917857143ns

I think we can though finally conclude that the fastest way is to use both early-return and branch prediction, followed by orring, followed by purely branch prediction. 我想我们终于可以得出结论,最快的方法是使用早期返回和分支预测,然后是orring,然后是纯粹的分支预测。 I suspect that all of those operations are highly optimized in native code. 我怀疑所有这些操作都在本机代码中进行了高度优化。

Update , some additional benchmarking using long and int arrays. 使用long和int数组更新 ,一些额外的基准测试。

After seeing suggestions on using long[] and int[] I decided it was worth investigating. 在看到使用long[]int[]建议后,我认为值得研究。 However these attempts may not be fully in line with the original answers anymore, nevertheless may still be interesting. 然而,这些尝试可能不再完全符合原始答案,但仍然可能仍然有趣。

Firstly, I changed the benchmark method to use generics: 首先,我更改了benchmark方法以使用泛型:

private <T> void benchmark(final List<T> arrays, final Predicate<T> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (T array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

Then I performed conversions from byte[] to long[] and int[] respectively before the benchmarks, it was also neccessary to set the maximum heap size to 10 GB. 然后我在基准测试之前分别执行了从byte[]long[]int[]的转换,还需要将最大堆大小设置为10 GB。

List<long[]> longArrays = arrays.stream().map(byteArray -> {
    long[] longArray = new long[4096 / 8];
    ByteBuffer.wrap(byteArray).asLongBuffer().get(longArray);
    return longArray;
}).collect(Collectors.toList());
longArrays.forEach(this::byteArrayCheck8);
benchmark(longArrays, this::byteArrayCheck8, "byteArrayCheck8");

List<int[]> intArrays = arrays.stream().map(byteArray -> {
    int[] intArray = new int[4096 / 4];
    ByteBuffer.wrap(byteArray).asIntBuffer().get(intArray);
    return intArray;
}).collect(Collectors.toList());
intArrays.forEach(this::byteArrayCheck9);
benchmark(intArrays, this::byteArrayCheck9, "byteArrayCheck9");

private boolean byteArrayCheck8(final long[] array) {
    for (long l : array) {
        if (l != 0) {
            return false;
        }
    }
    return true;
}

private boolean byteArrayCheck9(final int[] array) {
    for (int i : array) {
        if (i != 0) {
            return false;
        }
    }
    return true;
}

Which gave the following results: 结果如下:

Benchmark: byteArrayCheck8 / iterations: 700000 / time per iteration: 259.8157614285714ns 基准测试:byteArrayCheck8 /迭代:700000 /每次迭代的时间:259.8157614285714ns
Benchmark: byteArrayCheck9 / iterations: 700000 / time per iteration: 266.38013714285717ns 基准测试:byteArrayCheck9 /迭代:700000 /每次迭代的时间:266.38013714285717ns

This path may be worth exploring if it is possibly to get the bytes in such format. 如果可能以这种格式获取字节,则可能值得探索此路径。 However when doing the transformations inside the benchmarked method, the times were around 2000 nanoseconds per iteration, so it is not worth it when you need to do the conversions yourself. 但是,当在基准测试方法中进行转换时,每次迭代的时间大约为2000纳秒,因此当您需要自己进行转换时,这是不值得的。

This may not be the fastest or most memory performant solution but it's a one liner: 这可能不是最快或最大的内存性能解决方案,但它是一个单线程:

byte[] arr = randomByteArray();
assert Arrays.equals(arr, new byte[arr.length]);

For Java 8, you can simply use this: 对于Java 8,您只需使用:

public static boolean isEmpty(final byte[] data){
    return IntStream.range(0, data.length).parallel().allMatch(i -> data[i] == 0);
}

I think that theoretically your way in the fastest way, in practice you might be able to make use of larger comparisons as suggested by one of the commenters (1 byte comparison takes 1 instruction, but so does an 8-byte comparison on a 64-bit system). 我认为理论上你的方式是以最快的方式,在实践中你可能能够利用其中一个评论者建议的更大的比较(1字节比较需要1条指令,但64字节的8字节比较也是如此)位系统)。

Also in languages closer to the hardware (C and variants) you can make use of something called vectorization where you could perform a number of the comparisons/additions simultaneously. 此外,在更靠近硬件(C和变体)的语言中,您可以使用称为矢量化的东西,您可以同时执行许多比较/添加。 It looks like Java still doesn't have native support for it but based on this answer you might be able to get some use of it. 看起来Java仍然没有本机支持它,但基于这个答案,你可能能够使用它。

Also in line with the other comments I would say that with a 4k buffer it's probably not worth the time to try and optimize it (unless it is being called very often) 同样符合其他评论我会说,使用4k缓冲区时,可能不值得花时间尝试优化它(除非它经常被调用)

Someone suggested checking 4 or 8 bytes at a time. 有人建议一次检查4或8个字节。 You actually can do this in Java: 你实际上可以用Java做到这一点:

LongBuffer longBuffer = ByteBuffer.wrap(b).asLongBuffer();
while (longBuffer.hasRemaining()) {
    if (longBuffer.get() != 0) {
        return false;
    }
}
return true;

Whether this is faster than checking byte values is uncertain, since there is so much potential for optimization. 这是否比检查字节值更快是不确定的,因为有很多优化的可能性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM