简体   繁体   English

如何以内存高效的方式在python中拆分和解析大文本文件?

[英]How to split and parse a big text file in python in a memory-efficient way?

I have quite a big text file to parse. 我有一个很大的文本文件要解析。 The main pattern is as follows: 主要模式如下:

step 1

[n1 lines of headers]

  3  3  2
 0.25    0.43   12.62    1.22    8.97
12.89   89.72   34.87   55.45   17.62
 4.25   16.78   98.01    1.16   32.26
 0.90    0.78   11.87
step 2

[n2 != n1 lines of headers]

  3  3  2
 0.25    0.43   12.62    1.22    8.97
12.89   89.72   34.87   55.45   17.62
 4.25   16.78   98.01    1.16   32.26
 0.90    0.78   11.87
step 3

[(n3 != n1) and (n3 !=n2) lines of headers]

  3  3  2
 0.25    0.43   12.62    1.22    8.97
12.89   89.72   34.87   55.45   17.62
 4.25   16.78   98.01    1.16   32.26
 0.90    0.78   11.87

in other words: 换一种说法:

A separator: step # 分隔符:步骤#

Headers of known length (line numbers, not bytes) 已知长度的标头(行号,而不是字节)

Data 3-dimensional shape: nz, ny, nx 数据3维形状:nz,ny,nx

Data: fortran formating, ~10 floats/line in the original dataset 数据:fortran格式,原始数据集中每行约10个浮点数

I just want to extract the data, convert them to floats, put it in a numpy array and ndarray.reshape it to the shapes given. 我只想提取数据,将其转换为浮点数,将其放入numpy数组中,然后将ndarray.reshape更改为给定的形状。

I've already done a bit of programming... The main idea is 我已经做了一些编程工作...主要思想是

  1. to get the offsets of each separator first ("step X") 首先获取每个分隔符的偏移量(“步骤X”)
  2. skip nX (n1, n2...) lines + 1 to reach the data 跳过nX(n1,n2 ...)行+ 1到达数据
  3. read bytes from there all the way to the next separator. 从那里一直读取字节到下一个分隔符。

I wanted to avoid regex at first since these would slow things down a lot. 我想一开始就避免使用正则表达式,因为它们会使速度减慢很多。 It already takes 3-4 minutes just to get the first step done (browsing the file to get the offset of each part). 仅完成第一步就已经需要3-4分钟(浏览文件以获取每个零件的偏移量)。

The problem is that I'm basically using file.tell() method to get the separator positions: 问题是我基本上是使用file.tell()方法来获取分隔符的位置:

[file.tell() - len(sep) for line in file if sep in line]

The problem is two-fold: 问题有两个:

  1. for smaller files, file.tell() gives the right separator positions, for longer files, it does not. 对于较小的文件, file.tell()给出正确的分隔符位置,对于较长的文件,则没有。 I suspect that file.tell() should not be used in loops neither using explicit file.readline() nor using the implicit for line in file (I tried both). 我怀疑file.tell()不应在循环中使用,既不要使用显式file.readline()也不应该for line in file使用隐式的for line in file (我都尝试过)。 I don't know, but the result is there: with big files, [file.tell() for line in file if sep in line] does not give systematically the position of the line right after a separator. 我不知道,但结果有:与大文件, [file.tell() for line in file if sep in line] 没有系统分离后立即给该行的位置。
  2. len(sep) does not give the right offset correction to go back at the beginning of the "separator" line. len(sep)没有给出正确的偏移量校正,以返回到“分隔符”行的开头。 sep is a string (bytes) containing the first line of the file (the first separator). sep是包含文件第一行(第一分隔符)的字符串(字节)。

Does anyone knows how I should parse that? 有谁知道我该怎么解析?

NB: I find the offsets first because I want to be able to browse inside the file: I might just want the 10th dataset or the 50000th one... 注意:我首先找到偏移量是因为我希望能够浏览文件内部:我可能只想要第10个数据集或第50000个数据集...

1- Finding the offsets 1-找到偏移量

sep = "step "
with open("myfile") as f_in:
    offsets = [fin.tell() for line in fin if sep in line]

As I said, this is working in the simple example, but not on the big file. 就像我说的那样,这在简单的示例中有效,但不适用于大文件。

New test: 新测试:

sep = "step "
offsets = []
with open("myfile") as f_in:
    for line in f_in:
        if sep in line:
            print line
            offsets.append(f_in.tell())

The line printed corresponds to the separators, no doubt about it. 毫无疑问,打印的行对应于分隔符。 But the offsets obtained with f_in.tell() do not correspond to the next line. 但是用f_in.tell()获得的偏移量不对应于下一行。 I guess the file is buffered in memory and as I try to use f_in.tell() in the implicit loop, I do not get the current position but the end of the buffer. 我猜文件是在内存中缓冲的,当我尝试在隐式循环中使用f_in.tell()时,我没有得到当前位置,而是得到了缓冲区的末尾。 This is just a wild guess. 这只是一个疯狂的猜测。

I got the answer: for -loops on a file and tell() do not get along very well. 我得到了答案: for文件上的-loops和tell()并不太好。 Just like mixing for i in file and file.readline() raises an error. 就像for i in filefile.readline()混合for i in file一样会引发错误。

So, use file.tell() with file.readline() or file.read() only . 因此,仅将file.tell()file.readline()file.read() 使用

Never ever use : 永远不要使用

for line in file:
    [do stuff]
    offset = file.tell()

This is really a shame but that's the way it is. 这确实是一种耻辱,但事实就是这样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 python 中表示非均匀数据集的内存有效方式 - Memory-efficient way of representing non-uniform datasets in python 使用内存高效的方式在Python中从字典创建迭代器 - creating an iterator in Python from a dictionary in memory-efficient way Python中更具内存效率的结构表示形式? - More memory-efficient struct representation in Python? 以内存高效的方式在python中更快地编写大文件 - Writing a big file in python faster in memory efficient way 如何在类中使用变量来提高内存效率? - How to be memory-efficient with variables in classes? 在Python中保存base64数据的大多数内存有效的方法? - Most memory-efficient way of holding base64 data in Python? 在python中更改和解析大型XML文件的内存有效方式 - memory efficient way to change and parse a large XML file in python 如何以内存有效的方式重命名熊猫数据框(不创建副本)? - How to rename a pandas dataframe in a memory-efficient way (without creating a copy)? 将函数应用于 Pandas 数据帧:是否有更节省内存的方法? - Applying function to pandas dataframe: is there a more memory-efficient way of doing this? 以内存高效的方式将加权边列表转换为邻接矩阵 - Convert a list of weighted edges into an adjacency matrix in a memory-efficient way
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM