简体   繁体   中英

python Binary string to binary data

I have a problem with my Huffman coding project.

I have string of a binary representation of a file but logically it is even larger that the original file when i save it as text file. What I want is something to save the file as a binary file.

example : after Huffman coding let abc and d represented by the following "binary code"

a="0010" b="010" c="110" d="101"

So a file with the text abcd represented by binary = "0010010110101"

If I save the concatenated binary representation string as a normal text file it is larger than the original abcd .

But I need to save the binary concatenated file as real binary file that has had it's size is lowered - for the example abcd= 8bit*4 = 32 bits originally, but after, I need it to be 13 bits.

I am doing this in python.

import struct
with open("foo.bin", 'wb') as f:
    f.write(struct.pack('h', 0b0010010110101))

Will take 2 bytes (16 bits) as a short integer ( h ). You can define your own format string using the struct module , but I am not sure you will be able to get under the byte size.

EDIT

As per your comment, here's a bit of context:

When writing something in a file, it is always converted to binary. Character are encoded using some rule, called encoding (such as ASCII) where each character is mapped to a number, itself represented in binary. This way, the number 00100100 (36) and the character '$' are the same thing . '$' is represented by 36 on the file, and the software layers between you (such as an editor) will render every '00100100' it encounters as character '$'.

Now when you write the string '00100100' into a file, it will print the characters '0', '1' etc.... So the string '00100100' is represented by the binary number 110000110000110001110000110000110001110000110000. This is necessary because the input being a string, you need an unambiguous way of representing all possible 8-characters long strings, not only the ones representing 0s and 1s.

The Python API for writing files is always writing strings , ie it will perform this conversion string -> binary number automatically, and I don't know any way to override that. What you can do however is generate the string such that its binary representation is the actual binary string you wanted to write: if you want to write the number 00100100 in a file, you can just write f.write('$') , which is effectively the same thing.

That is exactly what the 'struct' module performs: it generates a string of bytes, or characters, which exactly match the number you are providing them.

In my example, I give it the number 0b0010010110101 , and tell it to encode it as a short integer, ie on two bytes. If you execute struct.pack('h', 1205) in the Python interpreter, it will print out the two characters (bytes) \\xb5\\x04 which correspond to this number in 'byte-base', ie base 256 (with big-endian convention). Indeed:

>>> 0x04 * 256 + 0xb5
1205

Just like you can represent any decimal number in base 10 (eg 36), base 16 (eg 0x24), base 2 (eg 0b100100), you can also represent it in base 256 via the ASCII encoding (eg '$'). Struct does exactly that, also providing a convenient 'fmt' string convention for the type of data you are writing. You can also do it directly by converting each of your bytes into the corresponding character:

def encode(binary):
    # Aligning on bytes
    binary = '0' * (8 - len(binary) % 8) + binary
    # Generating the corresponding character for each
    # byte encountered
    return ''.join(chr(int('0b' + binary[i:i+8], base = 2)) 
                   for i in xrange(0, len(binary), 8))

This is a very crude and not super efficient way of proceeding, but it does convert every byte into its corresponding character, and returns the corresponding string, which you can directly write into a file:

>>> encode('001001001010100100100100100111110010101110100')
'\x04\x95$\x93\xe5t'

And indeed, writing this to a file produces 6 bytes, corresponding to the 6 characters:

with open("foo.bin", 'wb') as f:
    f.write('\x04\x95$\x93\xe5t')

>>> os.path.getsize("foo.bin")
6L

struct modules performs exactly the same thing, except with a fixed format, and in a more efficient fashion. Instead of getting the chr corresponding to the integer,

def encode2(binary):
    rawbytes = []
    while binary > 0:
        binary, byte = divmod(binary, 256)
        rawbytes.append(byte)
    fmt_string = '%sB' % len(rawbytes)
    print "Encoding %s into %s bytes (%s)" % (rawbytes, len(rawbytes), fmt_string)
    return struct.pack(fmt_string, *rawbytes)

>>> encode2(0b001001001010100100100100100111110010101110100)
Encoding [116L, 229L, 147L, 36L, 149L, 4L] into 6 bytes (6B)
't\xe5\x93$\x95\x04'

(Notice that these are the same character outputted as in encode . The only difference is the order, depending on endianness of the conversion).

You can then decode these character using struct as well, and the same format string:

>>> bytes = struct.unpack('6B', 't\xe5\x93$\x95\x04')
>>> bytes
(116, 229, 147, 36, 149, 4)
>>> bin(sum(x * 256 ** i for i, x in enumerate(bytes)))
'0b1001001010100100100100100111110010101110100'

Which is our original number.

Bottom line is: Python file API can only process characters , which are effectively bytes . There might be some magic way of writing individual bits to a file, but I wouldn't count too much on that, as this introduces its own world of problems, and bytes are more than sufficient in 99% of cases. To write binary data, represent it in base 256, and convert each of its b256 digits to the corresponding character. The binary representation of this string is, by definition, your original number.

binascii can be used.

import binascii

a = "1010"
b = "10"
c = "00"

data = a + b + c
hex_string = hex(int(data, 2))[2:]  #remove '0x'

with open('foo', 'wb') as f:
    f.write(binascii.unhexlify(hex_string))

The hex_string should be even, so you need to add one bit to "0010010110101" to make unhexlify work properly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM