简体   繁体   中英

How to split and parse a big text file in python in a memory-efficient way?

I have quite a big text file to parse. The main pattern is as follows:

step 1

[n1 lines of headers]

  3  3  2
 0.25    0.43   12.62    1.22    8.97
12.89   89.72   34.87   55.45   17.62
 4.25   16.78   98.01    1.16   32.26
 0.90    0.78   11.87
step 2

[n2 != n1 lines of headers]

  3  3  2
 0.25    0.43   12.62    1.22    8.97
12.89   89.72   34.87   55.45   17.62
 4.25   16.78   98.01    1.16   32.26
 0.90    0.78   11.87
step 3

[(n3 != n1) and (n3 !=n2) lines of headers]

  3  3  2
 0.25    0.43   12.62    1.22    8.97
12.89   89.72   34.87   55.45   17.62
 4.25   16.78   98.01    1.16   32.26
 0.90    0.78   11.87

in other words:

A separator: step #

Headers of known length (line numbers, not bytes)

Data 3-dimensional shape: nz, ny, nx

Data: fortran formating, ~10 floats/line in the original dataset

I just want to extract the data, convert them to floats, put it in a numpy array and ndarray.reshape it to the shapes given.

I've already done a bit of programming... The main idea is

  1. to get the offsets of each separator first ("step X")
  2. skip nX (n1, n2...) lines + 1 to reach the data
  3. read bytes from there all the way to the next separator.

I wanted to avoid regex at first since these would slow things down a lot. It already takes 3-4 minutes just to get the first step done (browsing the file to get the offset of each part).

The problem is that I'm basically using file.tell() method to get the separator positions:

[file.tell() - len(sep) for line in file if sep in line]

The problem is two-fold:

  1. for smaller files, file.tell() gives the right separator positions, for longer files, it does not. I suspect that file.tell() should not be used in loops neither using explicit file.readline() nor using the implicit for line in file (I tried both). I don't know, but the result is there: with big files, [file.tell() for line in file if sep in line] does not give systematically the position of the line right after a separator.
  2. len(sep) does not give the right offset correction to go back at the beginning of the "separator" line. sep is a string (bytes) containing the first line of the file (the first separator).

Does anyone knows how I should parse that?

NB: I find the offsets first because I want to be able to browse inside the file: I might just want the 10th dataset or the 50000th one...

1- Finding the offsets

sep = "step "
with open("myfile") as f_in:
    offsets = [fin.tell() for line in fin if sep in line]

As I said, this is working in the simple example, but not on the big file.

New test:

sep = "step "
offsets = []
with open("myfile") as f_in:
    for line in f_in:
        if sep in line:
            print line
            offsets.append(f_in.tell())

The line printed corresponds to the separators, no doubt about it. But the offsets obtained with f_in.tell() do not correspond to the next line. I guess the file is buffered in memory and as I try to use f_in.tell() in the implicit loop, I do not get the current position but the end of the buffer. This is just a wild guess.

I got the answer: for -loops on a file and tell() do not get along very well. Just like mixing for i in file and file.readline() raises an error.

So, use file.tell() with file.readline() or file.read() only .

Never ever use :

for line in file:
    [do stuff]
    offset = file.tell()

This is really a shame but that's the way it is.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM