简体   繁体   English

霍夫曼编码:如何在 Python 中写入二进制数据

[英]Huffman encoding: how to write binary data in Python

I have tried methods using the struct module, as shown by the lines commented out in my code, but it didn't work out.我已经尝试过使用 struct 模块的方法,如我的代码中注释掉的行所示,但没有成功。 Basically I have two options: I can either write the binary data code by code (my code are sequences of bits of length varying from 3 to 13 bits), or convert the whole string of n characters (n=25000+ in this case) to binary data.基本上我有两个选择:我可以逐个编写二进制数据代码(我的代码是长度从 3 位到 13 位不等的位序列),或者转换整个 n 个字符的字符串(在这种情况下,n=25000+)为二进制数据。 But I don't know how to implement either methods.但我不知道如何实现这两种方法。 Code:代码:

import heapq
import binascii
import struct

def createFrequencyTupleList(inputFile):
    frequencyDic = {}

    intputFile = open(inputFile, 'r')
    for line in intputFile:
        for char in line:
            if char in frequencyDic.keys():
                frequencyDic[char] += 1
            else:
                frequencyDic[char] = 1

    intputFile.close()
    tupleList = []
    for myKey in frequencyDic:
        tupleList.append((frequencyDic[myKey],myKey))
    return tupleList

def createHuffmanTree(frequencyList):
    heapq.heapify(frequencyList)
    n = len(frequencyList)
    for i in range(1,n):
        left = heapq.heappop(frequencyList)
        right = heapq.heappop(frequencyList)
        newNode = (left[0] + right[0], left, right)
        heapq.heappush(frequencyList, newNode)
    return frequencyList[0]

def printHuffmanTree(myTree, someCode,prefix=''):
    if len(myTree) == 2:
        someCode.append((myTree[1] + "@" + prefix))
    else:
        printHuffmanTree(myTree[1], someCode,prefix + '0')
        printHuffmanTree(myTree[2], someCode,prefix + '1')

def parseCode(char, myCode):
    for k in myCode:
        if char == k[0]:
            return k[2:]


if __name__ == '__main__':
    myList = createFrequencyTupleList('input')
    myHTree = createHuffmanTree(myList)
    myCode = []
    printHuffmanTree(myHTree, myCode)
    inputFile = open('input', 'r')
    outputFile = open('encoded_file2', "w+b")
    asciiString = ''
    n=0
    for line in inputFile:
        for char in line:
            #outputFile.write(parseCode(char, myCode))
            asciiString += parseCode(char, myCode)
            n += len(parseCode(char, myCode))
    #values = asciiString
    #print n
    #s = struct.Struct('25216s')
    #packed_data = s.pack(values)
    #print packed_data
    inputFile.close()
    #outputFile.write(packed_data)
    outputFile.close()

You're looking for this:你正在寻找这个:

packed_data = ''.join(chr(int(asciiString[i:i+8], 2)) 
                         for i in range(0, len(asciiString), 8))

It will take 8 bits at a time from the asciiString , interpret it as an integer, and output the corresponding byte.asciiString一次需要 8 位,将其解释为 integer,output 是相应的字节。

Your problem here is that this requires the length of asciiString to be a multiple of 8 bits to work correctly.您的问题是,这需要asciiString的长度是 8 位的倍数才能正常工作。 If not, you'll insert zero bits before the last few real bits.如果没有,您将在最后几个实际位之前插入零位。

So you need to store the number of bits in the last byte somewhere, so you know to ignore those bits when you get them back, instead of interpreting them as zeros.因此,您需要将最后一个字节中的位数存储在某处,这样您就知道在取回它们时忽略这些位,而不是将它们解释为零。 You could try:你可以试试:

packed_data = chr(len(asciiString) % 8) + packed_data

Then when you read it back:然后当你读回来时:

packed_input = coded_file.read()
last_byte_length, packed_input, last_byte = (packed_input[0], 
                                             packed_input[1:-1], 
                                             packed_input[-1])
if not last_byte_length: last_byte_length = 8
ascii_input = ''.join(chain((bin(ord(byte))[2:].zfill(8) for byte in packed_input),
                      tuple(bin(ord(last_byte))[2:].zfill(last_byte_length),)))
# OR
# ascii_input = ''.join(chain(('{0:0=8b}'.format(byte) for byte in packed_input),
#                       tuple(('{0:0=' + str(last_byte_length) + '8b}').format(last_byte),)))

Edit: You either need to strip '0b' from the strings returned by bin() or, on 2.6 or newer, preferably use the new, alternate versions I added that use string formatting instead of bin() , slicing, and zfill() .编辑:您要么需要从bin()返回的字符串中删除 '0b',要么在 2.6 或更高版本上,最好使用我添加的新的替代版本,这些版本使用字符串格式而不是bin() 、切片和zfill() .

Edit: Thanks eryksun, good to use chain to avoid making a copy of the ASCII string.编辑:感谢 eryksun,很好地使用链来避免复制 ASCII 字符串。 Also, need to call ord(byte) in the bin() version.另外,需要在bin()版本中调用ord(byte)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM