Using python to extract a few lines from a data file

Question

I have a large file which has an enormous of data in it. I need to extract 3 lines every 5000 or so lines. The format of the data file is as follows:

...

O_sh          9215    1.000000   -2.304400   
 -1.0680E+00  1.3617E+00 -5.7138E+00  
O_sh          9216    1.000000   -2.304400  
 -8.1186E-01 -1.7454E+00 -5.8169E+00  
timestep    501      9216         0         3    0.000500  
   20.54      -11.85       35.64      
  0.6224E-02   23.71       35.64      
  -20.54      -11.86       35.64      
Li               1    6.941000    0.843200
  3.7609E-02  1.1179E-01  4.1032E+00
Li               2    6.941000    0.843200
  6.6451E-02 -1.3648E-01  1.0918E+01

...

What I need is the the three lines after the line that starts with "timestep" so in this case I need the 3x3 array:

   20.54      -11.85       35.64      
  0.6224E-02   23.71       35.64      
  -20.54      -11.86       35.64

in an output file for each time the word "timestep" appears.

Then I need the average of all those arrays in just one array. Just one array consisting of the average value of each element in the same position in every array for the whole file. I've been working on this for a while, but I haven't been able to extract the data correctly yet.

Thanks so much, and this is not for homework. You're advice will be helping the progress of science! =)

Thanks,

Answer 1

Assuming this is not homework, I think regex is overkill for the problem. If you know that you need three lines after one starts with 'timestep' why not approach the problem this way:

Matrices = []

with open('data.txt') as fh:
  for line in fh:
    # If we see timestep put the next three lines in our Matrices list.
    if line.startswith('timestep'):
      Matrices.append([next(fh) for _ in range(3)])

Per the comments - you use next(fh) in this situation to keep the file handle in sync when you want to pull the next three lines from it. Thanks!

Answer 2

I'd suggest using a coroutine (which is basically a generator that can accept values, if you are unfamiliar) to keep a running average as you iterate over your file.

def running_avg():
    count, sum = 0, 0
    value = yield None
    while True:
        if value:
            sum += value
            count += 1
        value = yield(sum/count)

# array for keeping running average
array = [[running_avg() for y in range(3)] for x in range(3)]

# advance to first yield before we begin
[[elem.next() for elem in row] for row in array]

with open('data.txt') as f:
    idx = None
    for line in f:
        if idx is not None and idx < 3:
            for i, elem in enumerate(line.strip().split()):
                array[idx][i].send(float(elem))
            idx += 1
        if line.startswith('timestep'):
            idx = 0

To get a convert array into a list of averages, just call each coroutine next method, it'll return current average:

averages = [[elem.next() for elem in row] for row in array]

And you'd get something like:

averages = [[20.54, -11.85, 35.64], [0.006224, 23.71, 35.64], [-20.54, -11.86, 35.64]]

Answer 3

Okay, so you can do this:

Algorithm:

Read the file line by line
if the line starts with "timestep":
    read the next three lines
    take the average as needed

Code:

def getArrays(f):
    answer = [[0, 0, 0], [0, 0, 0], [0, 0, 0]]
    count = 0
    line = f.readline()
    while line:
        if line.strip().startswith("timestep"):
            one, two, three = getFloats(f.readline().strip()), getFloats(f.readline().strip()), getFloats(f.readline().strip())
            answer[0][0] = ((answer[0][0]*count) + one[0])/(count+1)
            answer[0][1] = ((answer[0][0]*count) + one[1])/(count+1)
            answer[0][2] = ((answer[0][0]*count) + one[2])/(count+1)

            answer[1][0] = ((answer[0][0]*count) + two[0])/(count+1)
            answer[1][1] = ((answer[0][0]*count) + two[1])/(count+1)
            answer[1][2] = ((answer[0][0]*count) + two[2])/(count+1)

            answer[2][0] = ((answer[0][0]*count) + three[0])/(count+1)
            answer[2][1] = ((answer[0][0]*count) + three[1])/(count+1)
            answer[2][2] = ((answer[0][0]*count) + three[2])/(count+1)
        line = f.readline()
        count += 1
    return answer

def getFloats(line):
    answer = []
    for num in line.split():
        if "E" in num:
            parts = num.split("E")
            base = float(parts[0])
            exp = int(parts[1])
            answer.append(base**exp)
        else:
            answer.append(float(num))
    return answer

answer is now a list of all the 3x3 arrays. I don't know how you want to do the averaging, so if you post that, I can incorporate it into this algorithm. Else, you can write a function to take my array and compute the averages are required.

Hope this helps

Answer 4

Building on inspectorG4dget's and gddc's posts, here's a version that should do the reading, parsing, and averaging. Please point out my bugs! :)

    def averageArrays(filename):
        # initialize average variables then,
        # open the file and iterate through the lines until ...
        answer, count = [[0.0]*3 for _ in range(3)], 0
        with open(filename) as fh:
            for line in fh:
                if line.startswith('timestep'):  # ... we find 'timestep'!
                    # so , we read the three lines and sanitize them
                    # conversion to float happens here, which may be slow
                    raw_mat = [fh.next().strip().split() for _ in range(3)]
                    mat = []
                    for row in raw_mat:
                        mat.append([float(item) for item in row])
                    # now, update the running average, noting overflows as by
                    # http://invisibleblocks.wordpress.com/2008/07/30/long-running-averages-without-the-sum-of-preceding-values/
                    # there are surely more pythonic ways to do this
                    count += 1
                    for r in range(3):
                        for c in range(3):
                            answer[r][c] += (mat[r][c] - answer[r][c]) / count
        return answer

Answer 5

import re
from itertools import imap

text = '''O_sh          9215    1.000000   -2.304400
 -1.0680E+00  1.3617E+00 -5.7138E+00
O_sh          9216    1.000000   -2.304400
 -8.1186E-01 -1.7454E+00 -5.8169E+00
timestep    501      9216         0         3    0.000500
   20.54      -11.85       35.64
  0.6224E-02   23.71       35.64
  -20.54      -11.86       35.64
Li               1    6.941000    0.843200
  3.7609E-02  1.1179E-01  4.1032E+00
Li               2    6.941000    0.843200
  6.6451E-02 -1.3648E-01  1.0918E+01
O_sh          9215    1.000000   -2.304400
 -1.0680E+00  1.3617E+00 -5.7138E+00
O_sh          9216    1.000000   -2.304400
 -8.1186E-01 -1.7454E+00 -5.8169E+00
timestep    501      9216         0         3    0.000500
   80.80      -14580       42.28
  7.5224E-01   777.1       42.28
  140.54      -33.86       42.28
Li               1    6.941000    0.843200
  3.7609E-02  1.1179E-01  4.1032E+00
Li               2    6.941000    0.843200
  6.6451E-02 -1.3648E-01  1.0918E+01'''

lin = '\r?\n{0}*({1}+){0}+({1}+){0}+({1}+){0}*'
pat = ('^timestep.+'+3*lin).format('[ \t]','[.\deE+-]')
regx = re.compile(pat,re.MULTILINE)

def moy(x):
    return sum(map(float,x))/len(x)

li = map(moy,zip(*regx.findall(text)))
n = len(li)
g = iter(li).next
res = [(g(),g(),g()) for i in xrange(n//3)]
print res

result

[(50.67, -7295.925, 38.96), (0.379232, 400.40500000000003, 38.96), (60.0, -22.86, 38.96)]

Using python to extract a few lines from a data file

Question

5 answers

solution1
3 2011-05-09 16:27:24

solution2
2 ACCPTED 2011-05-09 17:15:12

solution3
1 2011-05-09 16:26:54

solution4
0 2011-05-09 18:08:04

solution5
0 2011-05-09 19:32:33

Using python to extract a few lines from a data file

Question

5 answers

solution1 3 2011-05-09 16:27:24

solution2 2 ACCPTED 2011-05-09 17:15:12

solution3 1 2011-05-09 16:26:54

solution4 0 2011-05-09 18:08:04

solution5 0 2011-05-09 19:32:33

solution1
3 2011-05-09 16:27:24

solution2
2 ACCPTED 2011-05-09 17:15:12

solution3
1 2011-05-09 16:26:54

solution4
0 2011-05-09 18:08:04

solution5
0 2011-05-09 19:32:33