简体   繁体   English

使用自定义行终止符读取二进制文件中的大文件,并在python中编写较小的块

[英]reading a big file in binary with custom line terminator and writing in smaller chunks in python

I have a file that uses \\x01 as line terminator. 我有一个使用\\x01作为行终止符的文件。 That is line terminator is NOT newline but the bytevalue of 001 . 那是行终止符不是换行符,而是001的字节值。 Here is the ascii representation for it which ^A . 是它的ASCII表示形式^A

I want to split file to size of 10 MB each. 我想将文件拆分为每个10 MB。 Here is what I came up with 这是我想出的

size=10000 #10 MB
i=0
with open("in-file", "rb") as ifile:
    ofile = open("output0.txt","wb")
    data = ifile.read(size)
        while data:
            ofile.write(data)
            ofile.close()
            data = ifile.read(size)
            i+=1 
            ofile = open("output%d.txt"%(i),"wb")


    ofile.close()

However, this would result in files that are broken at arbitrary places. 但是,这将导致文件在任意位置损坏。 I want the files to be terminated only at the byte value of 001 and next read resumes from the next byte. 我希望文件仅在字节值001处终止,下一次读取从下一个字节恢复。

if its just one byte terminal you can do something like 如果只是一个字节的终端,您可以执行以下操作

def read_line(f_object,terminal_byte): # its one line you could just as easily do this inline
    return "".join(iter(lambda:f_object.read(1),terminal_byte))

then make a helper function that will read all the lines in a file 然后创建一个辅助函数,该函数将读取文件中的所有行

def read_lines(f_object,terminal_byte):
    tmp = read_line(f_object,terminal_byte)
    while tmp:
        yield tmp
        tmp = read_line(f_object,terminal_byte)

then make a function that will chunk it up 然后创建一个将其分块的函数

def make_chunks(f_object,terminal_byte,max_size):
    current_chunk = []
    current_chunk_size = 0
    for line in read_lines(f_object,terminal_byte):
        current_chunk.append(line)
        current_chunk_size += len(line)
        if current_chunk_size > max_size:
            yield "".join(current_chunk)
            current_chunk = []
            current_chunk_size = 0
    if current_chunk:
        yield "".join(current_chunk)

then just do something like 然后做类似的事情

with open("my_binary.dat","rb") as f_in:
    for i,chunk in enumerate(make_chunks(f_in,"\x01",1024*1000*10)):
        with open("out%d.dat"%i,"wb") as f_out:
            f_out.write(chunk)

there might be some way to do this with libraries (or even an awesome builtin way) but im not aware of any offhand 可能有一些方法可以使用库来执行此操作(甚至是一种很棒的内置方法),但是我不知道有任何副手

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM