简体   繁体   中英

Measuring efficiency of Huffman coding with Python bitstring

I have the following string that I would like to Huffman-encode and store efficiently into a bit array:

>>> print sequence
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|

The frequencies of the symbols in sequence are:

>>> print freqTuples
[(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`

I translate this into a Huffman code dictionary:

>>> print codeDict
{'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}

I then used the Python bitstring package to translate the string, character by character, into an instance of the BitArray class, which I call bitArray , which contains bits for each character encoded with its respective Huffman code:

>>> print bitArray.bin
0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111

Here is the bit array in bytes:

>>> print bitArray.tobytes()
!I\254\363[^D\260^Z\240Ap

I must use tobytes() instead of bytes , as the bit array I generate does not divide evenly into 8-bit segments.

When I calculate the storage efficiency of the BitArray representation (the ratio of the sizes of the bit array and the input string), I get worse performance than if I had left the input string unencoded:

>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973

Am I measuring storage efficiency correctly? (If I encode longer input strings, this ratio improves, but it seems to approach an asymptotic limit of around 0.28. I'd like to confirm if this is the right way to measure things.)

Edit

The following two approaches yield different answers:

>>> print len(bitArray.tobytes()) / float(len(mergedSequence))
0.297297297297

>>> print bitArray.len / (8.*len(mergedSequence))
0.283783783784

I'm not sure which to believe. But in the process of writing data to storage, I think I would need the byte representation, which makes me inclined towards choosing the first result.

I'm not really sure about the bitarray stuff, but shouldn't you just be able to do:

>>> len(bitArray.tobytes()) / float(len(sequence))

I'm not saying that will solve your problem, but it could be that the "getsizeof" thing (again, something I'm not really all that familiar with) is throwing you off.

From what you've written up there, it kind of looks like you're comparing apples to oranges a bit.

>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973

Implies that the encoded version is 30% longer than the original sequence.

I don't think you want to use getsizeof here -- if you want to minimize the size of the Python object, you should be using getsizeof(sequence) as well, rather than len .

If instead, you want to do what Huffman coding is meant to do, and minimize the binary representation, then you want to use len on both (assuming the sequence is represented as one-byte-per-character).

So, your real ratio is 11 / 37.

I assume you're using Huffman coding as an exercise, as this doesn't seem like a logical way to efficiently store what is just a four-bit code with a termination character. At least it would be better to use arithmetic coding, which will allow you to use base-5 encoding instead of base-2, which is optimal for 5 possible characters.

Really, I would assume in a sequence long enough to be worth compressing, there is a known ratio of G:A:C:T and / or fixed length 2-bit encoding will be just as efficient (the ratios approach 1:1:1:1) since you don't really need to encode the termination character.

You know that the answer is wrong, because the Huffman dictionary is less than 4 bits per character, so the real answer must be less than .5. If the dictionary and character frequency doesn't change for longer strings, then the compression ratio shouldn't decrease toward an asymptotic limit as the string gets longer.

From the documentation of sys:

"getsizeof() calls the object’s __sizeof__ method and adds
 an additional garbage collector overhead if the object is
 managed by the garbage collector."

You need a function that will return the length of the bitstring itself, not the bitstring + overhead. The BitString documentation says that the len or length property returns the length in bits. So try doing:

bitArray.len / 8.*len(sequence)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM