高效的cython文件读取，字符串解析和数组构建

Question

So I have some data files that look like this: 所以我有一些看起来像这样的数据文件：

      47
   425   425  -3 15000 15000 900   385   315   3   370   330   2   340   330   2
   325   315   2   325   240   2   340   225   2   370   225   2   385   240   2
   385   315   2   475   240   3   460   240   2   460   255   2   475   255   2
   475   240   2   595   315   3   580   330   2   550   330   2   535   315   2
   535   240   2   550   225   2   580   225   2   595   240   2   595   315   2
   700   315   3   685   330   2   655   330   2   640   315   2   640   240   2
   655   225   2   685   225   2   700   240   2   700   315   2   700   315   3
  9076   456   2  9102   449   2  9127   443   2  9152   437   2  9178   433   2
  9203   430   2  9229   428   2  9254   427   2  9280   425   2  9305   425   2
     0     0 999  6865    259999
      20
   425   425  -3 15000 15000 900   385   315   3   370   330   2   340   330   2
   325   315   2   325   240   2   340   225   2   370   225   2   385   240   2
   385   315   2   475   240   3   460   240   2   460   255   2   475   255   2
   475   240   2   595   315   3   580   330   2   550   330   2   535   315   2

The first number is the number of points in the following block of text, and then the block of text has that many points with up to 5 points per line. 第一个数字是下一个文本块中的点数，然后文本块有多个点，每行最多5个点。 Each point has 3 components (I'll call them x, y, z). 每个点有3个组件（我将它们称为x，y，z）。 x and y get 6 characters, while z gets 4, so each point takes 16 characters. x和y得6个字符，而z得4，所以每个点需要16个字符。 Occasionally z is 9999 resulting in no space between y and z, so using split() will mess up parsing those lines. 偶尔z为9999，导致y和z之间没有空格，因此使用split（）将会解析这些行。 Also, all the numbers are integers (no decimals), but there are some negatives. 此外，所有数字都是整数（没有小数），但也有一些负数。

In the actual file the blocks are generally 1000 points long with some blocks being smaller (at the end of a "page" where page breaks are denoted by z=9999) 在实际文件中，块通常是1000个点，一些块较小（在“页面”的末尾，其中分页符由z = 9999表示）

My initial solution was to use regex: 我最初的解决方案是使用正则表达式：

import re
def get_points_regex(filename):
    with open(filename, 'r') as f:
        text = f.read()
    points = []
    for m in re.finditer('([ \d-]{6})([ \d-]{6})([ \d\-]{4})', text):
        point = tuple(int(i) for i in m.groups())
        points.append(point)
    return points

My test file is 55283 lines long (4.4 MB) and contains 274761 points. 我的测试文件长55283行（4.4 MB），包含274761点。

Using timeit on get_points_regex I get 560 ms. 在get_points_regex上使用timeit我得到560毫秒。

I then figured that while finditer is memory efficient, generating thousands of match objects is slow when I don't need any of their features, so I made a version using re.findall : 然后我认为虽然finditer是内存有效的，但是当我不需要它们的任何功能时，生成数千个匹配对象的速度很慢，所以我使用re.findall了一个版本：

def get_points_regex2():
    with open(filename, 'r') as f:
        text = f.read()
    points = re.findall(r'([ \d-]{6})([ \d-]{6})([ \d\-]{4})', text)
    points = [tuple(map(int, point)) for point in points]
    return points

This version runs in 414 ms, 1.35x faster than finditer. 此版本运行时间为414毫秒，比查找器快1.35倍。

Then I was thinking that for such simple patterns regex might be overkill, so I made a version using pure python: 然后我想，对于这样简单的模式，正则表达式可能有点过分，所以我使用纯python创建了一个版本：

def get_points_simple():
    points = []
    with open(filename, 'r') as f:
        for line in f:
            n_chunks = int(len(line)/16)
            for i in range(n_chunks):
                chunk = line[16*i:16*(i+1)]
                x = int(chunk[0:6])
                y = int(chunk[6:12])
                z = int(chunk[12:16])
                points.append((x, y, z))
    return points

This runs in 386 ms, 1.07x faster than regex. 运行时间为386毫秒，比正则表达式快1.07倍。

Then I broke down and tried Cython for the first time. 然后我崩溃了，第一次尝试了Cython。 I'm just running using the %%cython cell magic in a jupyter notebook. 我只是在jupyter笔记本中运行%%cython cell magic。 I came up with this: 我想出了这个：

%%cython
def get_points_cython(filename):
    cdef int i, x, y, z
    points = []
    f = open(filename, 'r')
    for line in f:
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = int(chunk[0:6])
            y = int(chunk[6:12])
            z = int(chunk[12:16])
            points.append((x, y, z))

    f.close()
    return points

The cython function runs in 196 ms. cython函数运行时间为196毫秒。 (2x faster than pure python) （比纯python快2倍）

I tried to simplify some expressions, like not using a context manager for file opening. 我试图简化一些表达式，比如不使用上下文管理器来打开文件。 While I declared the integers I wasn't sure what else to do so I left the rest alone. 当我宣布整数时，我不知道还有什么可以这样做，所以我独自完成剩下的工作。 I made a couple attempts at doing a 2D integer array instead of a list of tuples for points , but python segfaulted (I'm assuming that's what happened, the IPython kernal died). 我做了几次尝试做一个2D整数数组而不是points元组列表，但python segfaulted（我假设发生了什么，IPython核心死了）。 I had cdef int points[1000000][3] then I assigned with statements like points[j][1] = x while incrementing j . 我有cdef int points[1000000][3]然后我在递增j同时分配了像points[j][1] = x这样的语句。 From some light reading and very little C background I think that might be a rather large array? 从一些轻读和非常少的C背景我认为这可能是一个相当大的数组？ Stack vs. heap (I don't know what these really are)? 堆栈与堆（我不知道这些是什么）？ Need things like malloc? 需要像malloc这样的东西吗？ I'm a bit lost on that stuff. 我对这些东西有点失落。

Next I had read that maybe I should just use Numpy since Cython is good at that. 接下来我读过，也许我应该使用Numpy，因为Cython很擅长。 Following this I was able to create this function: 在此之后，我能够创建此功能：

%%cython
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t

def get_points_cython_numpy(filename):
    cdef int i, j, x, y, z
    cdef np.ndarray points = np.zeros([1000000, 3], dtype=DTYPE)
    f = open(filename, 'r')
    j = 0
    for line in f:
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = int(chunk[0:6])
            y = int(chunk[6:12])
            z = int(chunk[12:16])
            points[j, 0] = x
            points[j, 1] = y
            points[j, 2] = z
            j = j + 1

    f.close()
    return points

Unfortunately this takes 263 ms, so a little slower. 不幸的是，这需要263毫秒，所以慢一点。

Am I missing something obvious with cython or python std lib that would make parsing this any faster, or is this about as fast as it gets for a file of this size? 我是否遗漏了使用cython或python std lib显而易见的东西，这会使得解析速度更快，或者这个速度与这个大小的文件一样快？

I thought about pandas and numpy loading functions, but I figured the chunk size rows would complicate it too much. 我想过大熊猫和numpy加载函数，但我认为大块行会使它复杂化太多。 At one point I about had something working with pandas read_fwf followed by DataFrame.values.reshape(-1, 3), then drop rows with NaNs, but I knew that had to be slower by that point. 有一次，我有一些工作与pandas read_fwf后跟DataFrame.values.reshape（-1,3），然后删除带有NaNs的行，但我知道那时候必须要慢一些。

Any ideas to speed this up would be very appreciated! 任何提高速度的想法都将非常感激！

I'd love to get this below 100ms so that a GUI can be updated rapidly from reading these files as they get generated. 我希望在100ms以下获得这一点，以便GUI可以在生成时快速读取这些文件。 (Move slider > run background analysis > load data > plot results in real time). （移动滑块>运行背景分析>加载数据>实时绘制结果）。

Answer 1

Here is a faster example, it use fast_atoi() to convert string to int, it's 2x faster then get_points_cython() on my pc. 这是一个更快的例子，它使用fast_atoi()将字符串转换为int，它比我的电脑上的get_points_cython()快2倍。 If the number of points line have the same width (8 chars), then I think I can speedup it further (about 12x faster then get_points_cython() ). 如果点线的数量具有相同的宽度（8个字符），那么我认为我可以进一步加速（比get_points_cython()快12倍）。

%%cython
import numpy as np
cimport numpy as np
import cython

cdef int fast_atoi(char *buff):
    cdef int c = 0, sign = 0, x = 0
    cdef char *p = buff
    while True:
        c = p[0]
        if c == 0:
            break
        if c == 45:
            sign = 1
        elif c > 47 and c < 58:
            x = x * 10 + c - 48
        p += 1
    return -x if sign else x

@cython.boundscheck(False)
@cython.wraparound(False)
def get_points_cython_numpy(filename):
    cdef int i, j, x, y, z, n_chunks
    cdef bytes line, chunk
    cdef int[:, ::1] points = np.zeros([500000, 3], np.int32)
    f = open(filename, 'rb')
    j = 0
    for line in f:
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = fast_atoi(chunk[0:6])
            y = fast_atoi(chunk[6:12])
            z = fast_atoi(chunk[12:16])
            points[j, 0] = x
            points[j, 1] = y
            points[j, 2] = z
            j = j + 1

    f.close()
    return points.base[:j]

Here is the fasest method, the idea is read the whole file content into a bytes object, and get points data from it. 这是最简洁的方法，其思路是将整个文件内容读入一个bytes对象，并从中获取点数据。

@cython.boundscheck(False)
@cython.wraparound(False)
cdef inline int fast_atoi(char *buf, int size):
    cdef int i=0 ,c = 0, sign = 0, x = 0
    for i in range(size):
        c = buf[i]
        if c == 0:
            break
        if c == 45:
            sign = 1
        elif c > 47 and c < 58:
            x = x * 10 + c - 48
    return -x if sign else x

@cython.boundscheck(False)
@cython.wraparound(False)
def fastest_read_points(fn):
    cdef bytes buf
    with open(fn, "rb") as f:
        buf = f.read().replace(b"\n", b"") # change it with your endline.

    cdef char * p = buf
    cdef int length = len(buf)
    cdef char * buf_end = p + length
    cdef int count = length // 16 * 2 # create enough large array  
    cdef int[:, ::1] res = np.zeros((count, 3), np.int32)
    cdef int i, j, block_count
    i = 0
    while p < buf_end:
        block_count = fast_atoi(p, 10)
        p += 10
        for j in range(block_count):
            res[i, 0] = fast_atoi(p, 6)
            res[i, 1] = fast_atoi(p+6, 6)
            res[i, 2] = fast_atoi(p+12, 4)
            p += 16
            i += 1
    return res.base[:i]

Answer 2

Files that are fixed format and well behaved can be read efficiently with Numpy. 使用Numpy可以有效地读取固定格式和良好行为的文件。 The idea is to read the file into an array of strings and then convert to integers in one go. 我们的想法是将文件读入一个字符串数组，然后一次转换为整数。 The tricky bit is the handling of variable-width fields and the placement of newline characters. 棘手的一点是处理可变宽度字段和换行符的位置。 One way to do it for your file is: 为您的文件执行此操作的一种方法是：

def read_chunk_numpy(fh, n_points):
    # 16 chars per point, plus one newline character for every 5 points
    n_bytes = n_points * 16 + (n_points + 1) // 5

    txt_arr = np.fromfile(fh, 'S1', n_bytes)
    txt_arr = txt_arr[txt_arr != b'\n']    
    xyz = txt_arr.view('S6,S6,S4').astype('i,i,i')
    xyz.dtype.names = 'x', 'y', 'z'
    return xyz

Note that \\n newline characters are assumed, so some more effort is needed for portability. 需要注意的是\\n换行符假设，因此需要对便携些精力。 This gave me a huge speedup compared to the plain Python loop. 与普通的Python循环相比，这给了我巨大的加速。 Test code: 测试代码：

import numpy as np

def write_testfile(fname, n_points):
    with open(fname, 'wb') as fh:
        for _ in range(n_points // 1000):
            n_chunk = np.random.randint(900, 1100)
            fh.write(str(n_chunk).rjust(8) + '\n')
            xyz = np.random.randint(10**4, size=(n_chunk, 3))
            for i in range(0, n_chunk, 5):
                for row in xyz[i:i+5]:
                    fh.write('%6i%6i%4i' % tuple(row))
                fh.write('\n')

def read_chunk_plain(fh, n_points):
    points = []
    count = 0
    # Use while-loop because `for line in fh` would mess with file pointer
    while True:
        line = fh.readline()
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = int(chunk[0:6])
            y = int(chunk[6:12])
            z = int(chunk[12:16])
            points.append((x, y, z))

            count += 1
            if count == n_points:
                return points

def test(fname, read_chunk):
    with open(fname, 'rb') as fh:
        line = fh.readline().strip()
        while line:
            n = int(line)
            read_chunk(fh, n)
            line = fh.readline().strip()

fname = 'test.txt'
write_testfile(fname, 10**5)
%timeit test(fname, read_chunk_numpy)
%timeit test(fname, read_chunk_plain)

高效的cython文件读取，字符串解析和数组构建

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-05-14 05:26:51

解决方案2
1

高效的cython文件读取，字符串解析和数组构建

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-05-14 05:26:51

解决方案2 1

解决方案1
2 已采纳 2016-05-14 05:26:51

解决方案2
1