在python中解析大型二进制文件的最快方法

Question

I have files on the order of tens of GBs that are composed of a mixture of 10 or so packed C structs. 我有大约数十GB的文件，这些文件由10个左右的打包C结构组成。 I need to be able to iterate through each struct in the file for analysis and want to be able to do this analysis in python code. 我需要能够遍历文件中的每个结构进行分析，并希望能够在python代码中进行此分析。 I don't need to write to the file at all. 我根本不需要写文件。

I don't think numpy can help here because the files aren't just a single repeating struct. 我不认为numpy可以在这里提供帮助，因为这些文件不仅仅是一个重复的结构。 struct.unpack I find is much too slow. 我觉得struct.unpack太慢了。

My idea so far is to use Cython and mmap the file, then iterate and cast the buffer to Cython C structs in the hope to avoid any unnecessary copying. 到目前为止我的想法是使用Cython和mmap文件，然后迭代并将缓冲区转换为Cython C结构，以避免任何不必要的复制。 The snag I ran into with this approach though is I can't use the Cython C struct pointer directly and need to effectively write python wrapper classes which makes things a bit slower and tedious to write. 我遇到这种方法的障碍虽然是我不能直接使用Cython C结构指针并且需要有效地编写python包装类，这使得编写起来有点慢和乏味。 Anyone know of a way around this? 有人知道解决这个问题吗？

Wondering if there are other approaches that might work? 想知道是否有其他方法可行？ I haven't considered ctypes yet. 我还没有考虑过ctypes。

Answer 1

Are you sure the copying is actually the problem? 你确定复制实际上是问题吗？ Something like this is already too slow for you? 这样的事情对你来说已经太慢了？

st = struct.Struct('>QLB')  # whatever
while True:
    data = fp.read(st.size)
    if not data:
        break
    a, b, c = st.unpack(data)
    do_something_with(a, b, c)

If so, perhaps using mmap and struct.unpack_from can get you a bit more speed. 如果是这样，也许使用mmap和struct.unpack_from可以让你更快一点。

在python中解析大型二进制文件的最快方法

问题描述

1 个解决方案

解决方案1
0 2015-04-12 10:47:02

在python中解析大型二进制文件的最快方法

问题描述

1 个解决方案

解决方案1 0 2015-04-12 10:47:02

解决方案1
0 2015-04-12 10:47:02