[英]How to split and parse a big text file in python in a memory-efficient way?
I have quite a big text file to parse. 我有一个很大的文本文件要解析。 The main pattern is as follows:
主要模式如下:
step 1
[n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 2
[n2 != n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 3
[(n3 != n1) and (n3 !=n2) lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
in other words: 换一种说法:
A separator: step #
分隔符:步骤#
Headers of known length (line numbers, not bytes)
已知长度的标头(行号,而不是字节)
Data 3-dimensional shape: nz, ny, nx
数据3维形状:nz,ny,nx
Data: fortran formating, ~10 floats/line in the original dataset
数据:fortran格式,原始数据集中每行约10个浮点数
I just want to extract the data, convert them to floats, put it in a numpy array and ndarray.reshape it to the shapes given. 我只想提取数据,将其转换为浮点数,将其放入numpy数组中,然后将ndarray.reshape更改为给定的形状。
I've already done a bit of programming... The main idea is 我已经做了一些编程工作...主要思想是
I wanted to avoid regex at first since these would slow things down a lot. 我想一开始就避免使用正则表达式,因为它们会使速度减慢很多。 It already takes 3-4 minutes just to get the first step done (browsing the file to get the offset of each part).
仅完成第一步就已经需要3-4分钟(浏览文件以获取每个零件的偏移量)。
The problem is that I'm basically using file.tell()
method to get the separator positions: 问题是我基本上是使用
file.tell()
方法来获取分隔符的位置:
[file.tell() - len(sep) for line in file if sep in line]
The problem is two-fold: 问题有两个:
file.tell()
gives the right separator positions, for longer files, it does not. file.tell()
给出正确的分隔符位置,对于较长的文件,则没有。 I suspect that file.tell()
should not be used in loops neither using explicit file.readline()
nor using the implicit for line in file
(I tried both). file.tell()
不应在循环中使用,既不要使用显式file.readline()
也不应该for line in file
使用隐式的for line in file
(我都尝试过)。 I don't know, but the result is there: with big files, [file.tell() for line in file if sep in line]
does not give systematically the position of the line right after a separator. [file.tell() for line in file if sep in line]
没有系统分离后立即给该行的位置。 sep
is a string (bytes) containing the first line of the file (the first separator). sep
是包含文件第一行(第一分隔符)的字符串(字节)。 Does anyone knows how I should parse that? 有谁知道我该怎么解析?
NB: I find the offsets first because I want to be able to browse inside the file: I might just want the 10th dataset or the 50000th one... 注意:我首先找到偏移量是因为我希望能够浏览文件内部:我可能只想要第10个数据集或第50000个数据集...
sep = "step "
with open("myfile") as f_in:
offsets = [fin.tell() for line in fin if sep in line]
As I said, this is working in the simple example, but not on the big file. 就像我说的那样,这在简单的示例中有效,但不适用于大文件。
New test: 新测试:
sep = "step "
offsets = []
with open("myfile") as f_in:
for line in f_in:
if sep in line:
print line
offsets.append(f_in.tell())
The line printed corresponds to the separators, no doubt about it. 毫无疑问,打印的行对应于分隔符。 But the offsets obtained with
f_in.tell()
do not correspond to the next line. 但是用
f_in.tell()
获得的偏移量不对应于下一行。 I guess the file is buffered in memory and as I try to use f_in.tell()
in the implicit loop, I do not get the current position but the end of the buffer. 我猜文件是在内存中缓冲的,当我尝试在隐式循环中使用
f_in.tell()
时,我没有得到当前位置,而是得到了缓冲区的末尾。 This is just a wild guess. 这只是一个疯狂的猜测。
I got the answer: for
-loops on a file and tell()
do not get along very well. 我得到了答案:
for
文件上的-loops和tell()
并不太好。 Just like mixing for i in file
and file.readline()
raises an error. 就像
for i in file
和file.readline()
混合for i in file
一样会引发错误。
So, use file.tell()
with file.readline()
or file.read()
only . 因此,仅将
file.tell()
与file.readline()
或file.read()
使用 。
Never ever use : 永远不要使用 :
for line in file:
[do stuff]
offset = file.tell()
This is really a shame but that's the way it is. 这确实是一种耻辱,但事实就是这样。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.