使用自定义行终止符读取二进制文件中的大文件，并在python中编写较小的块

Question

I have a file that uses \\x01 as line terminator. 我有一个使用\\x01作为行终止符的文件。 That is line terminator is NOT newline but the bytevalue of 001 . 那是行终止符不是换行符，而是001的字节值。 Here is the ascii representation for it which ^A . 这是它的ASCII表示形式^A

I want to split file to size of 10 MB each. 我想将文件拆分为每个10 MB。 Here is what I came up with 这是我想出的

size=10000 #10 MB
i=0
with open("in-file", "rb") as ifile:
    ofile = open("output0.txt","wb")
    data = ifile.read(size)
        while data:
            ofile.write(data)
            ofile.close()
            data = ifile.read(size)
            i+=1 
            ofile = open("output%d.txt"%(i),"wb")


    ofile.close()

However, this would result in files that are broken at arbitrary places. 但是，这将导致文件在任意位置损坏。 I want the files to be terminated only at the byte value of 001 and next read resumes from the next byte. 我希望文件仅在字节值001处终止，下一次读取从下一个字节恢复。

Answer 1

if its just one byte terminal you can do something like 如果只是一个字节的终端，您可以执行以下操作

def read_line(f_object,terminal_byte): # its one line you could just as easily do this inline
    return "".join(iter(lambda:f_object.read(1),terminal_byte))

then make a helper function that will read all the lines in a file 然后创建一个辅助函数，该函数将读取文件中的所有行

def read_lines(f_object,terminal_byte):
    tmp = read_line(f_object,terminal_byte)
    while tmp:
        yield tmp
        tmp = read_line(f_object,terminal_byte)

then make a function that will chunk it up 然后创建一个将其分块的函数

def make_chunks(f_object,terminal_byte,max_size):
    current_chunk = []
    current_chunk_size = 0
    for line in read_lines(f_object,terminal_byte):
        current_chunk.append(line)
        current_chunk_size += len(line)
        if current_chunk_size > max_size:
            yield "".join(current_chunk)
            current_chunk = []
            current_chunk_size = 0
    if current_chunk:
        yield "".join(current_chunk)

then just do something like 然后做类似的事情

with open("my_binary.dat","rb") as f_in:
    for i,chunk in enumerate(make_chunks(f_in,"\x01",1024*1000*10)):
        with open("out%d.dat"%i,"wb") as f_out:
            f_out.write(chunk)

there might be some way to do this with libraries (or even an awesome builtin way) but im not aware of any offhand 可能有一些方法可以使用库来执行此操作（甚至是一种很棒的内置方法），但是我不知道有任何副手

使用自定义行终止符读取二进制文件中的大文件，并在python中编写较小的块

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-08-25 19:38:15

使用自定义行终止符读取二进制文件中的大文件，并在python中编写较小的块

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-08-25 19:38:15

解决方案1
1 已采纳 2017-08-25 19:38:15