简体   繁体   English

将位作为位写入文件

[英]Writing bits as bits to a file

So file systems deal with bytes but I'm looking to read/write data to a file in bits.所以文件系统处理字节,但我希望以位为单位读取/写入文件。

I have a file that is ~ 850mb and the goal is to get it under 100 mb.我有一个大约 850 mb 的文件,目标是使其小于 100 mb。 I used delta + huffman encoding to generate a "code table" of binary.我使用 delta + huffman 编码来生成二进制的“代码表”。 When you add all "bits" (aka the total number of 0s and 1s in the file) you get about 781,000,000 "bits" so theoretically I should be able to store these in about 90mb or so.当您添加所有“位”(即文件中 0 和 1 的总数)时,您会得到大约 781,000,000 个“位”,因此理论上我应该能够将这些存储在大约 90mb 左右。 This is where I'm running into a problem.这是我遇到问题的地方。

Based on other answers I've seen around SO, this is the closest I've gotten:根据我在 SO 周围看到的其他答案,这是我得到的最接近的答案:

with open(r'encoded_file.bin', 'wb') as f:
    for val in filedict:
            int_val = int(val[::-1], base=2)
            bin_array = struct.pack('i', int_value)
            f.write(bin_array)

The val being passed along each iteration is the binary to be written.每次迭代传递的val是要写入的二进制文件。 These do not have a fixed length and range from 10 from the most common to 111011001111001100 for the longest.这些没有固定的长度,范围从最常见的10到最长的111011001111001100 The average code length is 5 bits.平均码长为 5 位。 The above code generates a file of about 600mb, still way off the target.上面的代码生成了一个大约 600mb 的文件,仍然离目标很远。

Currently I am using Python 2.7, I can get to Python 3.x if I absolutely have to.目前我使用的是 Python 2.7,如果必须的话,我可以使用 Python 3.x。 Is it even possible in Python?在 Python 中甚至可能吗? Could a language like C or C++ do it easier?像 C 或 C++ 这样的语言可以更容易吗?

Note: because the bytes object is just an alias to str in python 2 I wasn't able to find (decent) way of writing the following that worked for both versions without using if USING_VS_3 .注意:因为bytes对象只是 python 2 中str的别名,所以我无法找到(体面的)编写以下适用于两个版本的方法而不使用if USING_VS_3

As a minimal interface to go from a string of bits to bytes that can be written to a file you can use something like this:作为从位串到可以写入文件的字节的最小接口,您可以使用以下内容:

def _gen_parts(bits):
    for start in range(0,len(bits),8):
        b = int(bits[start:start+8], base=2)
        if USING_VS_3:
            yield b #bytes takes an iterator of ints
        else:
            yield chr(b)

def bits_to_bytes(bits): # -> (bytes, "leftover")
    split_i = -(len(bits)%8)
    byte_gen = _gen_parts(bits[:split_i])
    if USING_VS_3:
        whole = bytes(byte_gen)
    else:
        whole = "".join(byte_gen)
    return whole, bits[split_i:]

So giving a string of binary data like '111011001111001100' to bits_to_bytes` will return a 2 item tuple of (byte data to write to file) and (left over bits).因此,将诸如'111011001111001100' to类的二进制数据字符串提供'111011001111001100' to bits_to_bytes` 将返回(要写入文件的字节数据)和(剩余位)的 2 项元组。

Then a simple and un-optimized file interface to handle the partial-byte-buffer could be like this:然后一个简单且未优化的文件接口来处理部分字节缓冲区可能是这样的:

class Bit_writer:
    def __init__(self,file):
        self.file = file
        self.buffer = ""

    def write(self,bits):
        byte_data, self.buffer = bits_to_bytes(self.buffer + bits)
        self.file.write(byte_data)

    def close(self):
        #you may want to handle the padding differently?
        byte_data,_ = bits_to_bytes("{0.buffer:0<8}".format(self))
        self.file.write(byte_data)
        self.file.close()

    def __enter__(self): # This will let you use a 'with' block
        return self
    def __exit__(self,*unused):
        self.file.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM