fastest way to parse large binary file in python

Question

I have files on the order of tens of GBs that are composed of a mixture of 10 or so packed C structs. I need to be able to iterate through each struct in the file for analysis and want to be able to do this analysis in python code. I don't need to write to the file at all.

I don't think numpy can help here because the files aren't just a single repeating struct. struct.unpack I find is much too slow.

My idea so far is to use Cython and mmap the file, then iterate and cast the buffer to Cython C structs in the hope to avoid any unnecessary copying. The snag I ran into with this approach though is I can't use the Cython C struct pointer directly and need to effectively write python wrapper classes which makes things a bit slower and tedious to write. Anyone know of a way around this?

Wondering if there are other approaches that might work? I haven't considered ctypes yet.

Answer 1

Are you sure the copying is actually the problem? Something like this is already too slow for you?

st = struct.Struct('>QLB')  # whatever
while True:
    data = fp.read(st.size)
    if not data:
        break
    a, b, c = st.unpack(data)
    do_something_with(a, b, c)

If so, perhaps using mmap and struct.unpack_from can get you a bit more speed.

fastest way to parse large binary file in python

Question

1 answers

solution1
0 2015-04-12 10:47:02

fastest way to parse large binary file in python

Question

1 answers

solution1 0 2015-04-12 10:47:02

solution1
0 2015-04-12 10:47:02