简体   繁体   中英

Why can a textual representation of pi be compressed?

A random string should be incompressible.

pi = "31415..."
pi.size  # => 10000
XZ.compress(pi).size  # => 4540

A random hex string also gets significantly compressed. A random byte string, however, does not get compressed.

The string of pi only contains the bytes 48 through 57. With a prefix code on the integers, this string can be heavily compressed. Essentially, I'm wasting space by representing my 9 different characters in bytes (or 16, in the case of the hex string). Is this what's going on?

Can someone explain to me what the underlying method is, or point me to some sources?

It's a matter of information density. Compression is about removing redundant information.

In the string "314159" , each character occupies 8 bits, and can therefore have any of 2 8 or 256 distinct values, but only 10 of those values are actually used. Even a painfully naive compression scheme could represent the same information using 4 bits per digit; this is known as Binary Coded Decimal. More sophisticated compression schemes can do better than that (a decimal digit is effectively log 2 10, or about 3.32, bits), but at the expense of storing some extra information that allows for decompression.

In a random hexadecimal string, each 8-bit character has 4 meaningful bits, so compression by nearly 50% should be possible. The longer the string, the closer you can get to 50%. If you know in advance that the string contains only hexadecimal digits, you can compress it by exactly 50%, but of course that loses the ability to compress anything else.

In a random byte string, there is no opportunity for compression; you need the entire 8 bits per character to represent each value. If it's truly random, attempting to compress it will probably expand it slightly, since some additional information is needed to indicate that the output is compressed data.

Explaining the details of how compression works is beyond both the scope of this answer and my expertise.

In addition to Keith Thompson's excellent answer , there's another point that's relevant to LZMA (which is the compression algorithm that the XZ format uses). The number pi does not consist of a single repeating string of digits, but neither is it completely random. It does contain substrings of digits which are repeated within the larger sequence. LZMA can detect these and store only a single copy of the repeated substring, reducing the size of the compressed data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM