Cython将二进制字符串快速转换为int数组

Question

I have a large binary data file which I want to load into a C array for fast access. 我有一个很大的二进制数据文件，我想将其加载到C数组中以进行快速访问。 The data file just contains a sequence of 4 byte ints. 数据文件仅包含一个4字节整数的序列。

I get the data via the pkgutil.get_data function, which returns a binary string. 我通过pkgutil.get_data函数获取数据，该函数返回一个二进制字符串。 the following code works: 以下代码有效：

import pkgutil
import struct

cdef int data[32487834]

def load_data():
    global data
    py_data = pkgutil.get_data('my_module', 'my_data')
    for i in range(32487834):
        data[i] = <int>struct.unpack('i', py_data[4*i:4*(i+1)])[0]
    return 0

load_data()

The problem is that this code is quite slow. 问题在于此代码相当慢。 Reading the whole data file can take 7 or 8 seconds. 读取整个数据文件可能需要7或8秒。 Reading the file directly into an array in C only takes 1-2 seconds, but I want to use pkgutil.get_data so that my module can reliably find the data whereever it gets installed. 将文件直接读取到C语言的数组中仅需1-2秒，但我想使用pkgutil.get_data以便我的模块无论安装在哪里都能可靠地找到数据。

So, my question is: what's the best way to do this? 因此，我的问题是：最佳方法是什么？ Is there a way to directly cast the data as an array of ints without all the calls to struct.unpack? 有没有一种方法可以将数据直接转换为整数数组，而无需所有对struct.unpack的调用？ And, as a secondary question, is there a way to simply get a pointer to the data to avoid copying 120MB of data unnecessarily? 而且，作为第二个问题，是否有一种方法可以简单地获取数据指针，以避免不必要地复制120MB数据？

Alternatively, is there a way to make pkgutil return the file path to the data instead of the data itself (in which case I can use C file IO to read the file quite quickly. 另外，有一种方法可以使pkgutil将文件路径返回到数据而不是数据本身（在这种情况下，我可以使用C文件IO相当快地读取文件。

EDIT: 编辑：

Just for the record, here's the final code used (based on Veedrac's answer): 仅作记录，这是使用的最终代码（基于Veedrac的回答）：

import pkgutil

from cpython cimport array
import array

cdef int[:] data

cdef void load_data():
    global data
    py_data = pkgutil.get_data('my_module', 'my_data')
    data = array.array('i', py_data)

load_data()

Everything is quite fast. 一切都很快。

Answer 1

Chances are you should really just use Numpy: 您真的应该只使用Numpy：

import numpy
import random
import struct

data = struct.pack('i'*100, *[random.randint(0, 1000000) for _ in range(100)])

numpy.fromstring(data, dtype="int32")
#>>> array([642029, 967046, 599565, ...etc], dtype=int32)

Then just use any of the standard methods to get a pointer from that . 然后，只需使用任何标准方法即可从中获取指针。

If you want to avoid Numpy, a faster but less platform-agnostic method would be to go via a char pointer: 如果要避免使用Numpy，可以使用char指针来实现一种更快但与平台无关的方法：

cdef int *data_view = <int *><char *>data

This has lots of "undefined"-ness to it, so be careful. 这有很多“不确定”的性质，所以要小心。 Also be careful not to modify the data! 另外请注意不要修改数据！

A good compromize between the two would be to use cpython.array : 两者之间最好的妥协是使用cpython.array ：

from cpython cimport array
import array

def main(data):
    cdef array.array[int] data_arr = array.array('i', data)
    cdef int *data_ptr = data_arr.data.as_ints

which gives you well defined semantics and is fast with built-in libraries. 它为您提供了定义明确的语义，并且使用内置库可以快速完成。

Cython将二进制字符串快速转换为int数组

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-11-06 17:05:40

Cython将二进制字符串快速转换为int数组

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-11-06 17:05:40

解决方案1
3 已采纳 2014-11-06 17:05:40