简体   繁体   English

如何压缩整数序列?

[英]How can I compress a sequence of integers?

I have an array which contains data within range -255 to +255.eg The array can be like this: 我有一个数组,其中包含-255到+ 255.eg范围内的数据。数组可以是这样的:

  int data[]={234,56,-4,24,56,78,23,89,234,68,-12,-253,45,128};

Here, order must be preserved while decompressing eg after 1st term 234, 56 must come. 这里,必须在解压缩时保留顺序,例如在第一个术语234之后,必须来56。

So, what are the ways to compress any arbitrary sequence of numbers for which any repeating pattern can't be observed? 那么,有什么方法可以压缩任何无法观察到任何重复模式的任意数字序列?

A range of -255 to 255 means 511 values -> 9 bits. 范围-255到255表示511个值 - > 9位。 If you take the sign separately, 1 bit for sign and a byte for value. 如果单独使用符号,则1位用于符号,1位用于值。

You can write your array as a byte array, each byte value will be the absolute value of the related int. 您可以将数组编写为字节数组,每个字节值将是相关int的绝对值。

In a separate zone (a long, or perhaps a byte array), store the sign bit. 在单独的区域(长或可能是字节数组)中,存储符号位。

If there are truly no patterns in the data then a useful compression algorithm is impossible. 如果数据中确实没有模式,则无法使用有用的压缩算法。 Don't even bother trying! 甚至不打扰尝试!

Of course, in this case because all the numbers are in a restricted range n then you do have a pattern in the bits - namely that your high bits are either all 0 (positive) or all 1 (negative). 当然,在这种情况下,因为所有数字都在一个受限制的范围内,所以你的位数确实有一个模式 - 即你的高位全部为0(正)或全1(负)。

Standard compression algorithms like zip would therefore work if you want to compress reasonably effectively (assuming you have a long enough array of numbers to make it worthwhile). 因此,如果您想要合理有效地压缩(假设您拥有足够长的数字阵列以使其值得),则像zip这样的标准压缩算法将起作用。

Alternatively you can exploit the fact that you are effectively using 9-bit numbers. 或者,您可以利用有效使用9位数的事实。 So you could roll your own compression algorithm by laying out the numbers as a long stream of 9-bit chunks and putting this into a byte array. 因此,您可以通过将数字布置为9位块的长流并将其放入字节数组来推广自己的压缩算法。

In your situation (when repeating pattern can't be observed), variable-length coding may help you. 在您的情况下(当无法观察到重复模式时), 可变长度编码可能对您有所帮助。

For example, Elias gamma-coding and Exponential-Golomb coding . 例如, Elias gamma编码Exponential-Golomb编码 The general idea - is that small numbers needs only few bits to be encoded. 一般的想法 - 小数字只需要很少的位来编码。 Exp-Golomb coding is used in the H.264/MPEG-4 AVC video compression standard. Exp-Golomb编码用于H.264 / MPEG-4 AVC视频压缩标准。 It is very easy to encode and decode sequences with it, also it is not very hard to implement this coding. 使用它对序列进行编码和解码非常容易,实现这种编码也不是很难。

The way to code all integers is to set up a bijection, mapping integers (0, 1, -1, 2, -2, 3, -3, ...) to (1, 2, 3, 4, 5, 6, 7, ...) before coding. 编码所有整数的方法是设置一个双射,将整数(0,1,-1,2,-2,3,-3,...)映射到(1,2,3,4,5,6) ,7,...)编码之前。

For example: 例如:

Sequence (after bijection) [ 0, 2, 5, 8, 5, 2 ] would be encoded as 101100110000100100110011 - As you may see - there is no repeating patterns in this sequence, but it encoded only with 24 bits. 序列(双射后) [ 0, 2, 5, 8, 5, 2 ] 101100110000100100110011 [ 0, 2, 5, 8, 5, 2 ]将被编码为101100110000100100110011 - 正如您所看到的 - 此序列中没有重复模式,但它仅以24位编码。

Short description of decoding process: 解码过程的简短描述:

  1. Read from input stream and count leading zero-bits (until you find non-zero bit) -> zero_bits_count 从输入流读取并计数前导零位(直到找到非零位) - > zero_bits_count

  2. Read from input stream next ( zero_bits_count + 1 ) bits -> binary 从输入流读取下一个(zero_bits_count + 1)位 - > 二进制

  3. Convert binary to decimal 二进制转换为十进制

  4. Return ( decimal - 1 ) 返回(小数 - 1)

1... -> no leading zeros, zero_bits_count = 0 -> read next 1 bit -> [1]... -> [1] is 1 -> 1 - 1 = 0

011... -> [0] - one leading zero, zero_bits_count = 1 -> read next 2 bits -> [11]... -> [11] is 3 -> 3 - 1 = 2

00110... -> [00] - two leading zeros, zero_bits_count = 2 -> read next 3 bits -> [110]... -> [110] is 6 -> 6 - 1 = 5

etc. 等等

If the numbers are essentially random and uniformly distributed, and order is to be preserved, then the best you can do is about 9 bits per symbol. 如果数字基本上是随机且均匀分布的,并且要保留顺序,那么您可以做的最好的是每个符号大约9位。 At 9 bits, a single 9 bit value will be unused, ie -256 in a 2's complement representation. 在9位时,将使用单个9位值,即2的补码表示中的-256。 That is convenient, since you can use that as an end symbol to mark the end of the list. 这很方便,因为您可以将其用作结束符号来标记列表的结尾。 Then you have also coded the length of the list, which would be needed somehow anyway. 然后你还编写了列表的长度,无论如何都需要以某种方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM