简体   繁体   English

整数的变长编码

[英]Variable length encoding of an integer

Whats the best way of doing variable length encoding of an unsigned integer value in C# ?在 C# 中对无符号整数值进行可变长度编码的最佳方法是什么?


"The actual intent is to append a variable length encoded integer (bytes) to a file header." “实际意图是将可变长度编码的整数(字节)附加到文件头。”

For ex: "Content-Length" - Http Header例如:“内容长度” - Http 标头

Can this be achieved with some changes in the logic below.这可以通过对下面的逻辑进行一些更改来实现。


I have written some code which does that ....我已经写了一些代码来做到这一点......

A method I have used, which makes smaller values use fewer bytes, is to encode 7 bits of data + 1 bit of overhead pr.我使用的一种方法是对 7 位数据 + 1 位开销 pr 进行编码,该方法使较小的值使用较少的字节。 byte.字节。

The encoding works only for positive values starting with zero, but can be modified if necessary to handle negative values as well.编码仅适用于从零开始的正值,但也可以根据需要进行修改以处理负值。

The way the encoding works is like this:编码的工作方式是这样的:

  • Grab the lowest 7 bits of your value and store them in a byte, this is what you're going to output获取您的值的最低 7 位并将它们存储在一个字节中,这就是您要输出的内容
  • Shift the value 7 bits to the right, getting rid of those 7 bits you just grabbed将值向右移动 7 位,去掉刚刚抓取的 7 位
  • If the value is non-zero (ie. after you shifted away 7 bits from it), set the high bit of the byte you're going to output before you output it如果该值非零(即从它移开 7 位之后),请在输出之前设置要输出的字节的高位
  • Output the byte输出字节
  • If the value is non-zero (ie. same check that resulted in setting the high bit), go back and repeat the steps from the start如果该值非零(即导致设置高位的相同检查),则返回并从头开始重复步骤

To decode:解码:

  • Start at bit-position 0从位位置 0 开始
  • Read one byte from the file从文件中读取一个字节
  • Store whether the high bit is set, and mask it away存储是否设置了高位,并屏蔽掉
  • OR in the rest of the byte into your final value, at the bit-position you're at或在字节的其余部分转换为您的最终值,在您所在的位位置
  • If the high bit was set, increase the bit-position by 7, and repeat the steps, skipping the first one (don't reset the bit-position)如果设置了高位,则将位位置增加 7,并重复步骤,跳过第一个(不要重置位位置)
39    32 31    24 23    16 15     8 7      0
value:            |DDDDDDDD|CCCCCCCC|BBBBBBBB|AAAAAAAA|
encoded: |0000DDDD|xDDDDCCC|xCCCCCBB|xBBBBBBA|xAAAAAAA| (note, stored in reverse order)

As you can see, the encoded value might occupy one additional byte that is just half-way used, due to the overhead of the control bits.正如您所看到的,由于控制位的开销,编码值可能会占用一个额外的字节,而这个字节只是使用了一半。 If you expand this to a 64-bit value, the additional byte will be completely used, so there will still only be one byte of extra overhead.如果将其扩展为 64 位值,则额外的字节将被完全使用,因此仍然只有一个字节的额外开销。

Note : Since the encoding stores values one byte at a time, always in the same order, big- or little-endian systems will not change the layout of this.注意:由于编码一次存储一个字节的值,总是以相同的顺序,大端或小端系统不会改变它的布局。 The least significant byte is always stored first, etc.最低有效字节总是首先存储,等等。

Ranges and their encoded size:范围及其编码大小:

0 -         127 : 1 byte
        128 -      16.383 : 2 bytes
     16.384 -   2.097.151 : 3 bytes
  2.097.152 - 268.435.455 : 4 bytes
268.435.456 -   max-int32 : 5 bytes

Here's C# implementations for both:这是两者的 C# 实现:

void Main()
{
    using (FileStream stream = new FileStream(@"c:\temp\test.dat", FileMode.Create))
    using (BinaryWriter writer = new BinaryWriter(stream))
        writer.EncodeInt32(123456789);

    using (FileStream stream = new FileStream(@"c:\temp\test.dat", FileMode.Open))
    using (BinaryReader reader = new BinaryReader(stream))
        reader.DecodeInt32().Dump();
}

// Define other methods and classes here

public static class Extensions
{
    /// <summary>
    /// Encodes the specified <see cref="Int32"/> value with a variable number of
    /// bytes, and writes the encoded bytes to the specified writer.
    /// </summary>
    /// <param name="writer">
    /// The <see cref="BinaryWriter"/> to write the encoded value to.
    /// </param>
    /// <param name="value">
    /// The <see cref="Int32"/> value to encode and write to the <paramref name="writer"/>.
    /// </param>
    /// <exception cref="ArgumentNullException">
    /// <para><paramref name="writer"/> is <c>null</c>.</para>
    /// </exception>
    /// <exception cref="ArgumentOutOfRangeException">
    /// <para><paramref name="value"/> is less than 0.</para>
    /// </exception>
    /// <remarks>
    /// See <see cref="DecodeInt32"/> for how to decode the value back from
    /// a <see cref="BinaryReader"/>.
    /// </remarks>
    public static void EncodeInt32(this BinaryWriter writer, int value)
    {
        if (writer == null)
            throw new ArgumentNullException("writer");
        if (value < 0)
            throw new ArgumentOutOfRangeException("value", value, "value must be 0 or greater");

        do
        {
            byte lower7bits = (byte)(value & 0x7f);
            value >>= 7;
            if (value > 0)
                lower7bits |= 128;
            writer.Write(lower7bits);
        } while (value > 0);
    }

    /// <summary>
    /// Decodes a <see cref="Int32"/> value from a variable number of
    /// bytes, originally encoded with <see cref="EncodeInt32"/> from the specified reader.
    /// </summary>
    /// <param name="reader">
    /// The <see cref="BinaryReader"/> to read the encoded value from.
    /// </param>
    /// <returns>
    /// The decoded <see cref="Int32"/> value.
    /// </returns>
    /// <exception cref="ArgumentNullException">
    /// <para><paramref name="reader"/> is <c>null</c>.</para>
    /// </exception>
    public static int DecodeInt32(this BinaryReader reader)
    {
        if (reader == null)
            throw new ArgumentNullException("reader");

        bool more = true;
        int value = 0;
        int shift = 0;
        while (more)
        {
            byte lower7bits = reader.ReadByte();
            more = (lower7bits & 128) != 0;
            value |= (lower7bits & 0x7f) << shift;
            shift += 7;
        }
        return value;
    }
}

You should first make an histogram of your value.您应该首先制作您的价值的直方图。 If the distribution is random (that is, every bin of your histogram's count is close to the other), then you'll not be able encode more efficiently than the binary representation for this number.如果分布是随机的(即,直方图计数的每个 bin 都接近另一个),那么您将无法比此数字的二进制表示更有效地进行编码。

If your histogram is unbalanced (that is, if some values are more present than others), then it might make sense to choose an encoding that's using less bits for these values, while using more bits for the other -unlikely- values.如果您的直方图是不平衡的(即,如果某些值比其他值更多),那么选择一种对这些值使用较少位的编码,而对其他不太可能的值使用更多位可能是有意义的。

For example, if the number you need to encode are 2x more likely to be smaller than 15 bits than larger, you can use the 16-th bit to tell so and only store/send 16 bits (if it's zero, then the upcoming byte will form a 16-bits numbers that can fit in a 32 bits number).例如,如果您需要编码的数字小于 15 位的可能性是大于 15 位的 2 倍,您可以使用第 16 位来告诉这一点,并且只存储/发送 16 位(如果它为零,那么即将到来的字节将形成一个可以放入 32 位数字的 16 位数字)。 If it's 1, then the upcoming 25 bits will form a 32 bits numbers.如果它是 1,那么接下来的 25 位将形成一个 32 位的数字。 You loose one bit here but because it's unlikely, in the end, for a lot of number, you win more bits.你在这里输了一点,但因为最后不太可能,对于很多数字,你赢得更多的位。

Obviously, this is a trivial case, and the extension of this to more than 2 cases is the Huffman algorithm that affect a "code word" that close-to optimum based on the probability of the numbers to appear.显然,这是一个微不足道的案例,将其扩展到 2 个以上的案例是 Huffman 算法,该算法根据数字出现的概率影响接近最优的“代码字”。

There's also the arithmetic coding algorithm that does this too (and probably other).还有算术编码算法也可以做到这一点(可能还有其他)。

In all cases, there is no solution that can store random value more efficiently than what's being done currently in computer memory.在所有情况下,没有比当前在计算机内存中所做的更有效地存储随机值的解决方案。

You have to think about how long and how hard will be the implementation of such solution compared to the saving you'll get in the end to know if it's worth it.您必须考虑与最终节省的费用相比,实施此类解决方案需要多长时间和多难,才能知道这样做是否值得。 The language itself is not relevant here.语言本身在这里不相关。

如果小值比大值更常见,您可以使用Golomb 编码

I know this question was asked quite a few years ago, however for MIDI developers I thought to share some code from a personal midi project I'm working on.我知道这个问题是几年前被问到的,但是对于 MIDI 开发人员,我想从我正在处理的个人 MIDI 项目中分享一些代码。 The code block is based on a segment from the book Maximum MIDI by Paul Messick (This example is a tweaked version for my own needs however, the concept is all there...).代码块基于 Paul Messick 所著的《Maximum MIDI》一书中的一段(这个例子是根据我自己的需要调整的版本,但是,这个概念就在那里......)。

    public struct VariableLength
    {
        // Variable Length byte array to int
        public VariableLength(byte[] bytes)
        {
            int index = 0;
            int value = 0;
            byte b;
            do
            {
                value = (value << 7) | ((b = bytes[index]) & 0x7F);
                index++;
            } while ((b & 0x80) != 0);

            Length = index;
            Value = value;
            Bytes = new byte[Length];
            Array.Copy(bytes, 0, Bytes, 0, Length);
        }

        // Variable Length int to byte array
        public VariableLength(int value)
        {
            Value = value;
            byte[] bytes = new byte[4];
            int index = 0;
            int buffer = value & 0x7F;

            while ((value >>= 7) > 0)
            {
                buffer <<= 8;
                buffer |= 0x80;
                buffer += (value & 0x7F);
            }
            while (true)
            {
                bytes[index] = (byte)buffer;
                index++;
                if ((buffer & 0x80) > 0)
                    buffer >>= 8;
                else
                    break;
            }

            Length = index;
            Bytes = new byte[index];
            Array.Copy(bytes, 0, Bytes, 0, Length);
        }

        // Number of bytes used to store the variable length value
        public int Length { get; private set; }
        // Variable Length Value
        public int Value { get; private set; }
        // Bytes representing the integer value
        public byte[] Bytes { get; private set; }
    }

How to use:如何使用:

public void Example()
{   
//Convert an integer into a variable length byte
int varLenVal = 480;     
VariableLength v = new VariableLength(varLenVal);
byte[] bytes = v.Bytes;

//Convert a variable length byte array into an integer
byte[] varLenByte = new byte[2]{131, 96};     
VariableLength v = new VariableLength(varLenByte);
int result = v.Length;
}

As Grimbly pointed out , there exists BinaryReader.Read7BitEncodedInt and BinaryWriter.Write7BitEncodedInt .正如Grimbly 指出的那样,存在BinaryReader.Read7BitEncodedIntBinaryWriter.Write7BitEncodedInt However, these are internal methods that one cannot call from a BinaryReader or -Writer object.但是,这些是不能从 BinaryReader 或 -Writer 对象调用的内部方法。

However, what you can do is take the internal implementation and copy it from the reader and the writer :但是,您可以做的是获取内部实现并从readerwriter复制它:

public static int Read7BitEncodedInt(this BinaryReader br) {
    // Read out an Int32 7 bits at a time.  The high bit 
    // of the byte when on means to continue reading more bytes.
    int count = 0;
    int shift = 0;
    byte b;
    do {
        // Check for a corrupted stream.  Read a max of 5 bytes.
        // In a future version, add a DataFormatException.
        if (shift == 5 * 7)  // 5 bytes max per Int32, shift += 7
            throw new FormatException("Format_Bad7BitInt32");

        // ReadByte handles end of stream cases for us. 
        b = br.ReadByte();
        count |= (b & 0x7F) << shift;
        shift += 7;
    } while ((b & 0x80) != 0); 
    return count;
}   

public static void Write7BitEncodedInt(this BinaryWriter br, int value) {
    // Write out an int 7 bits at a time.  The high bit of the byte,
    // when on, tells reader to continue reading more bytes.
    uint v = (uint)value;   // support negative numbers
    while (v >= 0x80) {
        br.Write((byte)(v | 0x80));
        v >>= 7;
    }
    br.Write((byte)v);
}   

When you include this code in any class of your project, you'll be able to use the methods on any BinaryReader / BinaryWriter object.当您在项目的任何类中包含此代码时,您将能够在任何BinaryReader / BinaryWriter对象上使用这些方法。 They've only been slightly modified to make them work outside of their original classes (for example by changing ReadByte() to br.ReadByte() ).它们只是稍作修改,使它们在原始类之外工作(例如,通过将ReadByte()更改为br.ReadByte() )。 The comments are from the original source.评论来自原文。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM