从特定位置的二进制文件读取整数的性能问题

Question

I have a file with integers stored as binary and I'm trying to extract values at specific locations. 我有一个整数存储为二进制的文件，我正在尝试提取特定位置的值。 It's one big serialized integer array for which I need values at specific indexes. 这是一个大的序列化整数数组，为此我需要特定索引处的值。 I've created the following code but its terribly slow compared to the F# version I created before. 我创建了以下代码，但是与我之前创建的F＃版本相比，它的运行速度非常慢。

import os, struct

def read_values(filename, indices):
    # indices are sorted and unique
    values = []
    with open(filename, 'rb') as f:
        for index in indices:
            f.seek(index*4L, os.SEEK_SET)
            b = f.read(4)
            v = struct.unpack("@i", b)[0]
            values.append(v)
    return values

For comparison here is the F# version: 为了进行比较，下面是F＃版本：

open System
open System.IO

let readValue (reader:BinaryReader) cellIndex = 
    // set stream to correct location
    reader.BaseStream.Position <- cellIndex*4L
    match reader.ReadInt32() with
    | Int32.MinValue -> None
    | v -> Some(v)

let readValues fileName indices = 
    use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
    let values = List.map (readValue reader) (List.ofSeq indices)
    values

Any tips on how to improve the performance of the python version, eg by usage of numpy ? 关于如何提高python版本性能的任何技巧，例如通过使用numpy？

Update 更新

Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file): HDF5的效果非常好（在我的测试文件上，从5秒到0.8秒）：

import tables
def read_values_hdf5(filename, indices):
    values = []
    with tables.open_file(filename) as f:
        dset = f.root.raster
        return dset[indices]

Update 2 更新2

I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production. 我使用np.memmap是因为性能类似于hdf5，并且我已经在生产中使用了numpy。

Answer 1

Heavily depending on your index file size you might want to read it completely into a numpy array. 在很大程度上取决于索引文件的大小，您可能希望将其完全读入numpy数组。 If the file is not large, complete sequential read may be faster than a large number of seeks. 如果文件不大，则完整的顺序读取可能比大量搜索快。

One problem with the seek operations is that python operates on buffered input. 搜索操作的一个问题是python在缓冲输入上运行。 If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values. 如果程序是用较低级的语言编写的，那么在无缓冲IO上使用将是一个好主意，因为您只需要几个值。

import numpy as np

# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]

If you would anyway read almost all pages (ie your indices are random and at a frequency of 1/1000 or more), this is probably faster. 如果您仍然要阅读几乎所有页面（即您的索引是随机的，并且频率为1/1000或更高），则可能更快。 On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast. 另一方面，如果您有一个较大的索引文件，而您只想选择几个索引，则速度不是那么快。

Then one more possibility - which might be the fastest - is to use the python mmap module. 然后另一种可能性-可能是最快的-是使用python mmap模块。 Then the file is memory-mapped, and only the pages really required are accessed. 然后将文件映射到内存，仅访问真正需要的页面。

It should be something like this: 应该是这样的：

import mmap

with open("my_index", "rb") as f:
    memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
    for i in indices:
        # the index at position i:
        idx_value = struct.unpack('I', memory_map[4*i:4*i+4])

(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.) （请注意，我实际上没有测试过那个，所以可能会键入错误。而且，我也不关心字节序，因此请检查它是否正确。）

Happily, these can be combined by using numpy.memmap . 幸运的是，可以使用numpy.memmap将它们组合numpy.memmap 。 It should keep your array on disk but give you numpyish indexing. 它应将阵列保留在磁盘上，但会给您带来麻木的索引。 It should be as easy as: 它应该很简单：

import numpy as np

index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]

I think this should be the easiest and fastest alternative. 我认为这应该是最简单，最快的选择。 However, if "fast" is important, please test and profile. 但是，如果“快速”很重要，请进行测试和配置。

EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files. 编辑：随着mmap解决方案似乎越来越流行，我将添加一些有关内存映射文件的信息。

What is mmap? 什么是mmap？

Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. 内存映射文件不是python独有的东西，因为内存映射是POSIX标准中定义的东西。 Memory mapping is a way to use devices or files as if they were just areas in memory. 内存映射是一种使用设备或文件的方式，就好像它们只是内存中的区域一样。

File memory mapping is a very efficient way to randomly access fixed-length data files. 文件内存映射是随机访问固定长度数据文件的一种非常有效的方法。 It uses the same technology as is used with virtual memory. 它使用与虚拟内存相同的技术。 The reads and writes are ordinary memory operations. 读和写是普通的内存操作。 If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory. 如果它们指向不在物理RAM存储器中的存储位置（发生“页面错误”），则将所需的文件块（页面）读入内存。

The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). 随机文件访问的延迟主要归因于磁盘的物理旋转（SSD是另一回事）。 In average, the block you need is half a rotation away; 平均而言，您需要的块距旋转了一半。 for a typical HDD this delay is approximately 5 ms plus any data handling delay. 对于典型的HDD，此延迟约为5毫秒加上任何数据处理延迟。 The overhead introduced by using python instead of a compiled language is negligible compared to this delay. 与这种延迟相比，使用python而不是编译语言引入的开销可以忽略不计。

If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. 如果文件是按顺序读取的，则操作系统通常会在您甚至不需要文件之前使用预读缓存对文件进行缓冲。 For a randomly accessed big file this does not help at all. 对于随机访问的大文件，这根本没有帮助。 Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. 内存映射提供了一种非常有效的方法，因为所有块均在需要时准确加载，并保留在缓存中以备将来使用。 (This could in principle happen with fseek , as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.) （这在原则上也可以用fseek发生，因为它可能在幕后使用相同的技术。但是，不能保证，并且无论如何，调用都会在操作系统中徘徊，因此会产生一些开销。）

mmap can also be used to write files. mmap也可用于写入文件。 It is very flexible in the sense that a single memory mapped file can be shared by several processes. 从单个进程映射可以共享一个内存映射文件的角度来看，这是非常灵活的。 This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. 在某些情况下，这可能非常有用且高效，并且mmap也可以用于进程间通信。 In that case usually no file is specified for mmap , instead the memory map is created with no file behind it. 在这种情况下，通常没有为mmap指定文件，而是创建了内存映射，且其后没有文件。

mmap is not very well-known despite its usefulness and relative ease of use. 尽管mmap有用且相对易用，但它并不是非常知名。 It has, however, one important 'gotcha'. 但是，它有一个重要的“陷阱”。 The file size has to remain constant. 文件大小必须保持恒定。 If it changes during mmap , odd things may happen. 如果在mmap期间发生变化，可能会发生奇怪的事情。

Answer 2

Is the indices list sorted? 索引列表是否排序？ i think you could get better performance if the list would be sorted, as you would make a lot less disk seeks 我认为，如果对列表进行排序，可能会获得更好的性能，因为磁盘搜索量会减少很多

从特定位置的二进制文件读取整数的性能问题

问题描述

2 个解决方案

解决方案1
4 已采纳 2014-06-24 08:56:35

解决方案2
1 2014-06-24 07:55:51

从特定位置的二进制文件读取整数的性能问题

问题描述

2 个解决方案

解决方案1 4 已采纳 2014-06-24 08:56:35

解决方案2 1 2014-06-24 07:55:51

解决方案1
4 已采纳 2014-06-24 08:56:35

解决方案2
1 2014-06-24 07:55:51