使用python從數據文件中提取幾行

Question

我有一個很大的文件，里面有大量的數據。 我需要每隔5000行提取3行。 數據文件的格式如下：

...

O_sh          9215    1.000000   -2.304400   
 -1.0680E+00  1.3617E+00 -5.7138E+00  
O_sh          9216    1.000000   -2.304400  
 -8.1186E-01 -1.7454E+00 -5.8169E+00  
timestep    501      9216         0         3    0.000500  
   20.54      -11.85       35.64      
  0.6224E-02   23.71       35.64      
  -20.54      -11.86       35.64      
Li               1    6.941000    0.843200
  3.7609E-02  1.1179E-01  4.1032E+00
Li               2    6.941000    0.843200
  6.6451E-02 -1.3648E-01  1.0918E+01

...

我需要的是以“ timestep”開頭的行之后的三行，因此在這種情況下，我需要3x3數組：

   20.54      -11.85       35.64      
  0.6224E-02   23.71       35.64      
  -20.54      -11.86       35.64

在輸出文件中，每次出現“ timestep”一詞。

然后，我只需要一個數組中所有這些數組的平均值。 整個文件只有一個數組，該數組由每個數組中相同位置的每個元素的平均值組成。 我已經為此工作了一段時間，但還無法正確提取數據。

非常感謝，這不是為了功課。 您的建議將對科學的進步有所幫助！ =）

謝謝，

Answer 1

假設這不是家庭作業，我認為正則表達式對於解決該問題是過大的。 如果您知道在一行以“ timestep”開頭之后需要三行，為什么不這樣處理問題：

Matrices = []

with open('data.txt') as fh:
  for line in fh:
    # If we see timestep put the next three lines in our Matrices list.
    if line.startswith('timestep'):
      Matrices.append([next(fh) for _ in range(3)])

根據注釋-在這種情況下，您要使用next（fh）來使文件句柄保持同步，以便從中拉出下三行。 謝謝！

Answer 2

我建議使用協程（基本上是一個生成器，如果您不熟悉的話，可以接受值）在迭代文件時保持運行平均值。

def running_avg():
    count, sum = 0, 0
    value = yield None
    while True:
        if value:
            sum += value
            count += 1
        value = yield(sum/count)

# array for keeping running average
array = [[running_avg() for y in range(3)] for x in range(3)]

# advance to first yield before we begin
[[elem.next() for elem in row] for row in array]

with open('data.txt') as f:
    idx = None
    for line in f:
        if idx is not None and idx < 3:
            for i, elem in enumerate(line.strip().split()):
                array[idx][i].send(float(elem))
            idx += 1
        if line.startswith('timestep'):
            idx = 0

要將轉換array轉換為平均值列表，只需調用每個協程next方法，它將返回當前平均值：

averages = [[elem.next() for elem in row] for row in array]

而且您會得到類似的信息：

averages = [[20.54, -11.85, 35.64], [0.006224, 23.71, 35.64], [-20.54, -11.86, 35.64]]

Answer 3

好的，所以您可以這樣做：

算法：

Read the file line by line
if the line starts with "timestep":
    read the next three lines
    take the average as needed

碼：

def getArrays(f):
    answer = [[0, 0, 0], [0, 0, 0], [0, 0, 0]]
    count = 0
    line = f.readline()
    while line:
        if line.strip().startswith("timestep"):
            one, two, three = getFloats(f.readline().strip()), getFloats(f.readline().strip()), getFloats(f.readline().strip())
            answer[0][0] = ((answer[0][0]*count) + one[0])/(count+1)
            answer[0][1] = ((answer[0][0]*count) + one[1])/(count+1)
            answer[0][2] = ((answer[0][0]*count) + one[2])/(count+1)

            answer[1][0] = ((answer[0][0]*count) + two[0])/(count+1)
            answer[1][1] = ((answer[0][0]*count) + two[1])/(count+1)
            answer[1][2] = ((answer[0][0]*count) + two[2])/(count+1)

            answer[2][0] = ((answer[0][0]*count) + three[0])/(count+1)
            answer[2][1] = ((answer[0][0]*count) + three[1])/(count+1)
            answer[2][2] = ((answer[0][0]*count) + three[2])/(count+1)
        line = f.readline()
        count += 1
    return answer

def getFloats(line):
    answer = []
    for num in line.split():
        if "E" in num:
            parts = num.split("E")
            base = float(parts[0])
            exp = int(parts[1])
            answer.append(base**exp)
        else:
            answer.append(float(num))
    return answer

現在， answer是所有3x3陣列的列表。 我不知道您要如何進行平均，因此如果您發布該平均值，我可以將其合並到此算法中。 否則，您可以編寫一個函數來獲取我的數組並計算所需的平均值。

希望這可以幫助

Answer 4

在inspectorG4dget和gddc的帖子的基礎上，這是一個應該進行讀取，解析和平均的版本。 請指出我的錯誤！ :)

    def averageArrays(filename):
        # initialize average variables then,
        # open the file and iterate through the lines until ...
        answer, count = [[0.0]*3 for _ in range(3)], 0
        with open(filename) as fh:
            for line in fh:
                if line.startswith('timestep'):  # ... we find 'timestep'!
                    # so , we read the three lines and sanitize them
                    # conversion to float happens here, which may be slow
                    raw_mat = [fh.next().strip().split() for _ in range(3)]
                    mat = []
                    for row in raw_mat:
                        mat.append([float(item) for item in row])
                    # now, update the running average, noting overflows as by
                    # http://invisibleblocks.wordpress.com/2008/07/30/long-running-averages-without-the-sum-of-preceding-values/
                    # there are surely more pythonic ways to do this
                    count += 1
                    for r in range(3):
                        for c in range(3):
                            answer[r][c] += (mat[r][c] - answer[r][c]) / count
        return answer

Answer 5

import re
from itertools import imap

text = '''O_sh          9215    1.000000   -2.304400
 -1.0680E+00  1.3617E+00 -5.7138E+00
O_sh          9216    1.000000   -2.304400
 -8.1186E-01 -1.7454E+00 -5.8169E+00
timestep    501      9216         0         3    0.000500
   20.54      -11.85       35.64
  0.6224E-02   23.71       35.64
  -20.54      -11.86       35.64
Li               1    6.941000    0.843200
  3.7609E-02  1.1179E-01  4.1032E+00
Li               2    6.941000    0.843200
  6.6451E-02 -1.3648E-01  1.0918E+01
O_sh          9215    1.000000   -2.304400
 -1.0680E+00  1.3617E+00 -5.7138E+00
O_sh          9216    1.000000   -2.304400
 -8.1186E-01 -1.7454E+00 -5.8169E+00
timestep    501      9216         0         3    0.000500
   80.80      -14580       42.28
  7.5224E-01   777.1       42.28
  140.54      -33.86       42.28
Li               1    6.941000    0.843200
  3.7609E-02  1.1179E-01  4.1032E+00
Li               2    6.941000    0.843200
  6.6451E-02 -1.3648E-01  1.0918E+01'''

lin = '\r?\n{0}*({1}+){0}+({1}+){0}+({1}+){0}*'
pat = ('^timestep.+'+3*lin).format('[ \t]','[.\deE+-]')
regx = re.compile(pat,re.MULTILINE)

def moy(x):
    return sum(map(float,x))/len(x)

li = map(moy,zip(*regx.findall(text)))
n = len(li)
g = iter(li).next
res = [(g(),g(),g()) for i in xrange(n//3)]
print res

結果

[(50.67, -7295.925, 38.96), (0.379232, 400.40500000000003, 38.96), (60.0, -22.86, 38.96)]

使用python從數據文件中提取幾行

問題描述

5 個解決方案

解決方案1
3 2011-05-09 16:27:24

解決方案2
2 已采納 2011-05-09 17:15:12

解決方案3
1 2011-05-09 16:26:54

解決方案4
0 2011-05-09 18:08:04

解決方案5
0 2011-05-09 19:32:33

使用python從數據文件中提取幾行

問題描述

5 個解決方案

解決方案1 3 2011-05-09 16:27:24

解決方案2 2 已采納 2011-05-09 17:15:12

解決方案3 1 2011-05-09 16:26:54

解決方案4 0 2011-05-09 18:08:04

解決方案5 0 2011-05-09 19:32:33

解決方案1
3 2011-05-09 16:27:24

解決方案2
2 已采納 2011-05-09 17:15:12

解決方案3
1 2011-05-09 16:26:54

解決方案4
0 2011-05-09 18:08:04

解決方案5
0 2011-05-09 19:32:33