简体   繁体   English

有没有一种方法可以检查缓冲区是否为Brotli压缩格式?

[英]Is there a way to check if a buffer is in Brotli compressed format?

I'm an intern doing research into whether using Brotli compression in a piece of software provides a performance boost over the current release, which uses GZip. 我是一名实习生,致力于研究在某软件中使用Brotli压缩是否比使用GZip的当前发行版提高了性能。

My task is to change anything using GZip to use Brotli compression instead. 我的任务是使用GZip更改任何内容以改为使用Brotli压缩。 One function I need to replace does a check to test if a buffer contains data that was compressed using GZip. 我需要替换的一个功能进行检查以测试缓冲区是否包含使用GZip压缩的数据。 It does this by checking the stream identifier at the beginning and end: 它通过检查流标识符的开头和结尾来实现:

bool isGzipped() const
{
    // Gzip file signature (0x1f8b)
    return
        (_bufferEnd >= _bufferStart + 2) &&
        (static_cast<unsigned char>(_bufferStart[0]) == 0x1f) &&
        (static_cast<unsigned char>(_bufferStart[1]) == 0x8b);
}

I want to create similar function bool isBrotliEncoded() . 我想创建类似的功能bool isBrotliEncoded() I was wondering if there is a similar quick check that can can be done with Brotli encoded buffers? 我想知道是否可以使用Brotli编码的缓冲区进行类似的快速检查? I've had a look at the byte values for some of the compressed files that brotli produces, but I can't find a rule that holds for all of them. 我已经看过brotli产生的某些压缩文件的字节值,但是我找不到适合所有压缩文件的规则。 Some start with 0x5B , some with 0x1B , compression of empty files results in 0x06 , and files that have been compressed multiple times start with a range of different values. 有些以0x5B ,有些以0x1B 0x5B ,空文件的压缩结果为0x06 ,而多次压缩的文件以一系列不同的值开头。 The end of each file is also inconsistent. 每个文件的末尾也不一致。

The only way I know of to test if it is in the correct format is to attempt decompression and wait for an error, which defeats the purpose of doing this test. 我知道测试格式是否正确的唯一方法是尝试解压缩并等待错误,这违背了进行此测试的目的。

So my question is: Does anyone know how to check if a buffer has been compressed with Brotli without attempting decompression and waiting for failure? 所以我的问题是:有谁知道如何在不尝试解压缩和等待失败的情况下检查是否已使用Brotli压缩了缓冲区?

Unfortunately, the raw brotli format is not well suited to such detection, even when simply trying to decompress and waiting for an error. 不幸的是,即使只是尝试解压缩并等待错误,原始brotli格式也不适合这种检测。

I ran a trial of one million brotli decompressions of random data. 我对随机数据的一百万个brotli解压缩进行了试验。 About 5% of them checked out as good brotli streams. 他们中约有5%的人认为是优质的肉肠。 So you've already got a problem right there. 因此,您已经在这里遇到了问题。 3.5% of the million are a single byte, since there are nine one-byte values that are each a valid brotli stream. 百万的3.5%是单个字节,因为有9个单字节值是有效的brotli流。 The mean length of the random valid streams was almost a megabyte. 随机有效流的平均长度几乎是一个兆字节。

For those in which an error was detected (about 95% of the million cases), 3.5% went more than a megabyte before the error was detected. 对于那些检测到错误的情况(大约百万例的95%),在检测到错误之前,有3.5%的数据超出了兆字节。 1.4% went more than ten megabytes. 1.4%的存储空间超过10兆字节。 The mean number of random bytes before finding an error was 309 KB. 发现错误之前,随机字节的平均数量为309 KB。 Another problem. 另一个问题。

In short, the probability of a false positive is relatively high, and the number of bytes to process to find a negative can be quite large. 简而言之,误报的可能性相对较高,查找负数时要处理的字节数可能会很大。

If you are writing this software, then you should put your own header before the brotli data to aid in detection. 如果您正在编写此软件,则应将自己的标头放在brotli数据之前,以帮助检测。 Or you can use the brotli framing format that I developed at their request , which has a unique four-byte header before the brotli compressed stream. 或者,您可以使用我根据他们的要求开发brotli框架格式,格式在brotli压缩流之前具有唯一的四字节标头。 That would reduce the probability of a false positive dramatically. 这将大大降低误报的可能性。

Brotli is formally defined in RFC 7932 . Brotli在RFC 7932中正式定义。 The format of the data stream is covered in Section 2: Compressed Representation Overview and Section 9: Compressed Data Format . 第2节:压缩表示概述第9节:压缩数据格式中介绍了数据流的格式 Brotli does not employ leading/trailing identifiers like gzip does, but it does consist of a sequence of uncompressed headers and commands that describe the compressed data. Brotli不像gzip那样使用前导/跟踪标识符,但是它由一系列未压缩的标头和描述压缩数据的命令组成。 They are not all aligned on byte boundaries, you have to parse them at the bit level instead (Brotli is processed as a stream of bits and bytes). 它们并非都在字节边界上对齐,您必须在位级别解析它们(Brotli被作为位和字节流处理)。 Refer to Section 10: Decoding Algorithm for how to read these headers. 有关如何读取这些标头的信息,请参见第10节:解码算法 If you parse out a few headers that follow the Brotli format without error then it is a good bet that you are dealing with a Brotli compressed buffer. 如果您解析出一些遵循Brotli格式的标头而没有错误,那么最好是在处理Brotli压缩缓冲区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM