简体   繁体   中英

How to encode English plain-text (consisting only of letters a-z and whitespace) using a 5-bit character encoding in Python?

in Python, is there any way to encode English plain-text (consisting only of small letters az and whitespace - ie a total of 27 characters) using 5-bit character-encoding? If yes, please do tell me how.

To be more specific, say I have a string: s="hello world". After encoding this using 5-bit character-encoding in Python I want to save this string to an external file such that each of the character in that file will only take 5-bits of storage space.

Probably the best recognised 5-bit encoding is Baudot (and its derivatives ITA2 and USTTY). Properly speaking this is a shift-based encoding with separate letter and figure shifts, but you can confine your output to the letter shift.

Here's a quick example of encoding (encoding table taken from http://code.google.com/p/tweletype/source/browse/baudot.py ):

import string
letters = "\x00E\x0AA SIU\x0DDRJNFCKTZLWHYPQOBG\x0EMXV\x0F"
s = "Hello World"
for c in string.upper(s):
    print letters.find(c)

How about less than five bits? Testing with the first paragraph of the translated Lorem ipsum :

import gzip
text = 'But I must explain to you how all this mistaken idea of denouncing of a pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or pursues or desires to obtain pain of itself, because it is pain, but occasionally circumstances occur in which toil and pain can procure him some great pleasure. To take a trivial example, which of us ever undertakes laborious physical exercise, except to obtain some advantage from it? But who has any right to find fault with a man who chooses to enjoy a pleasure that has no annoying consequences, or one who avoids a pain that produces no resultant pleasure?'
text = ''.join(c for c in text.lower() if c.islower() or c == ' ')
encoded = gzip.compress(text.encode())
decoded = gzip.decompress(encoded).decode()
print('%.3f' % (len(encoded) / len(text) * 8), 'bits per char')
print('Roundtrip ok?', decoded == text)
print(len(set(text)), 'different chars in text')

The result:

4.135 bits per char
Roundtrip ok? True
26 different chars in text

Compression like this takes advantage not only of the fact that there are only 27 chars but also of the different probabilities as well as patterns.

I also tried lzma and bz2 instead of gzip, but for this particular example, gzip compressed best.

First, you'll need to convert characters from ASCII to 5-bit encoding. It's up to you how to do it. One possible straight-forward way:

class TooMuchBits(Exception):
    pass

def encode_str(data):
    buf = bytearray()
    for char in data:
        num = ord(char)

        # Lower case latin letters
        if num >= 97 and num <= 122:
            buf.append(num - 96)

        # Space
        elif num == 32:
            buf.append(27)

        else:
            raise TooMuchBits(char)

    return buf

def decode_str(data):
    buf = bytearray()
    for num in data:
        if num == 27:
            buf.append(' ')
        else:
            buf.append(chr(num+96))

    return bytes(buf)

After it you have 5-bit numbers which can be packed into 8-bit bytes. Something like this:

# This should not be more than 8
BITS = 5

def get_last_bits(value, count):
    return value & ((1<<count) - 1)

def pack(data):
    buf = bytearray(1)
    used_bits = 0

    for num in data:
        # All zeroes is a special value marking unused bits
        if not isinstance(num, int) or num <= 0 or num.bit_length() > BITS:
            raise TooMuchBits(num)

        # Character fully fits into available bits in current byte
        if used_bits <= 8 - BITS:
            buf[-1] |= num << used_bits
            used_bits += BITS

        # Character should be split into two different bytes
        else:
            # Put lowest bit into available space
            buf[-1] |= get_last_bits(num, 8 - used_bits) << used_bits
            # Put highest bits into next byte
            buf.append(num >> (8 - used_bits))
            used_bits += BITS - 8

    return bytes(buf)

def unpack(data):
    buf = bytearray()
    data = bytearray(data)

    # Characters are filled with logic AND and therefore initialized with zero
    char_value = 0
    char_bits_left = BITS

    for byte in data:
        data_bits_left = 8

        while data_bits_left >= char_bits_left:
            # Current character ends in current byte
            # Take bits from current data bytes and shift them to appropriate position
            char_value |= get_last_bits(byte, char_bits_left) << (BITS - char_bits_left)

            # Discard processed bits
            byte = byte >> char_bits_left
            data_bits_left -= char_bits_left

            # Zero means the end of the string. It's necessary to detect unused space in the end of data
            # It's otherwise possible to detect such space as a 0x0 character
            if char_value == 0:
                break

            # Store and initialize character 
            buf.append(char_value)
            char_value = 0
            char_bits_left = BITS

        # Collect bits left in current byte
        if data_bits_left:
            char_value |= byte
            char_bits_left -= data_bits_left

    return buf

This seems to work as expected:

test_string = "the quick brown fox jumps over the lazy dog"

encoded = encode_str(test_string)
packed = pack(encoded)
unpacked = unpack(packed)
decoded = decode_str(unpacked)

print "Test str (len: %d): %r" % (len(test_string), test_string)
print "Encoded (len: %d):  %r" % (len(encoded), encoded)
print "Packed (len: %d):   %r" % (len(packed), packed)
print "Unpacked (len: %d): %r" % (len(unpacked),unpacked)
print "Decoded (len: %d):  %r" % (len(decoded), decoded)

Outputs:

Test str (len: 43): 'the quick brown fox jumps over the lazy dog'
Encoded (len: 43):  bytearray(b'\x14\x08\x05\x1b\x11\x15\t\x03\x0b\x1b\x02\x12\x0f\x17\x0e\x1b\x06\x0f\x18\x1b\n\x15\r\x10\x13\x1b\x0f\x16\x05\x12\x1b\x14\x08\x05\x1b\x0c\x01\x1a\x19\x1b\x04\x0f\x07')
Packed (len: 27):   '\x14\x95\x1dk\x1ak\x0b\xf9\xae\xdb\xe6\xe1\xadj\x83s?[\xe4\xa6\xa8l\x16t\xde\xe4\x1d'
Unpacked (len: 43): bytearray(b'\x14\x08\x05\x1b\x11\x15\t\x03\x0b\x1b\x02\x12\x0f\x17\x0e\x1b\x06\x0f\x18\x1b\n\x15\r\x10\x13\x1b\x0f\x16\x05\x12\x1b\x14\x08\x05\x1b\x0c\x01\x1a\x19\x1b\x04\x0f\x07')
Decoded (len: 43):  'the quick brown fox jumps over the lazy dog'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM