在關鍵字后讀取多行？

Question

我有一個 output 文件，它打印出一個數字數據矩陣。 我需要在這個文件中搜索每個數據集開頭的標識符，即：

GROUP      1 FIRST      1 LAST    163

這里 GROUP 1 是矩陣的第一列，FIRST 1 是 position 1 中該矩陣的第一個非零元素，LAST 163 是 position 163 中矩陣的最后一個非零元素。矩陣不一定以這個 LAST 值結束 - 在這種情況下，有 172 個值。

我想把這些數據讀成更簡單的形式來處理。 以下是前兩列結果的示例：

GROUP      1 FIRST      1 LAST    163
  7.150814E-02  9.866657E-03  8.500540E-04  1.818338E-03  2.410691E-03  3.284499E-03  3.011986E-03  1.612432E-03
  1.674247E-03  3.436244E-03  3.655873E-03  4.056876E-03  4.560725E-03  2.462454E-03  2.567764E-03  5.359393E-03
  5.457415E-03  2.679373E-03  2.600020E-03  2.491592E-03  2.365089E-03  2.228494E-03  5.792616E-03  1.623274E-03
  1.475062E-03  1.331820E-03  1.195052E-03  2.832699E-03  7.298341E-04  6.301271E-04  1.377459E-03  1.048925E-03
  1.677453E-04  3.580640E-04  1.575301E-04  1.150545E-04  1.197719E-04  2.950028E-05  5.380539E-05  1.228784E-05
  1.627659E-05  4.522051E-05  7.736908E-06  1.758838E-05  8.161204E-06  6.103670E-06  6.431876E-06  1.585671E-06
  4.110246E-06  4.512924E-07  2.775227E-06  5.107739E-07  1.219448E-06  1.653674E-07  4.429047E-07  4.837661E-07
  2.036820E-07  3.449548E-07  1.457648E-07  4.494116E-07  1.629392E-07  1.300509E-07  1.730199E-07  8.130338E-08
  1.591993E-08  5.457638E-08  1.713141E-08  7.806754E-09  1.154869E-08  3.545961E-09  2.862203E-09  2.289470E-09
  4.324002E-09  2.243199E-09  2.627165E-09  2.273119E-09  1.973867E-09  1.710714E-09  1.468845E-09  1.772236E-09
  1.764492E-09  1.004393E-09  1.044698E-09  5.201382E-10  2.660613E-10  3.012732E-10  2.630323E-10  4.381052E-10
  2.521794E-10  9.213524E-11  2.619283E-10  3.591906E-11  1.449830E-10  1.867363E-11  1.230445E-10  1.108149E-11
  2.775004E-11  1.156249E-11  4.393752E-11  5.318751E-11  6.815569E-12  1.817489E-11  2.044674E-11  2.044673E-11
  1.931080E-11  1.931076E-11  1.817484E-11  2.044668E-11  5.486837E-12  7.681572E-12  1.536314E-11  7.132886E-12
  8.230253E-12  1.426577E-11  1.426577E-11  4.389468E-12  5.925780E-12  2.853153E-12  2.853153E-12  5.706307E-12
  5.706307E-12  2.194733E-12  3.292099E-12  5.267358E-12  2.194733E-12  3.072626E-12  4.828412E-12  4.389466E-12
  4.389465E-12  1.097366E-11  2.194732E-12  1.316839E-11  2.194732E-12  1.608784E-11  1.674222E-11  1.778860E-11
  6.993074E-12  2.622402E-12  9.090994E-12  5.769285E-12  1.573441E-12  6.861030E-12  4.782885E-12  8.768619E-13
  2.311727E-12  3.188589E-12  4.393636E-12  3.844430E-12  4.256331E-12  1.235709E-12  2.746020E-12  2.746020E-12
  8.238059E-13  2.608719E-12  1.445203E-12  4.817344E-13  1.445203E-12  7.609642E-14  2.536547E-13  2.000924E-13
  7.075681E-14  7.075681E-14  3.056704E-14
GROUP      2 FIRST      2 LAST    168
  6.740271E-02  8.310813E-03  3.609403E-03  1.307012E-03  2.949375E-03  3.605043E-03  1.612647E-03  1.640960E-03
  3.597806E-03  4.022993E-03  4.289805E-03  4.480576E-03  2.352539E-03  2.415121E-03  5.018262E-03  5.188098E-03
  2.589224E-03  2.546116E-03  2.472462E-03  2.374431E-03  2.260519E-03  5.981164E-03  1.700972E-03  1.556116E-03
  1.410140E-03  1.273499E-03  3.061941E-03  7.995844E-04  6.967963E-04  1.553994E-03  1.216266E-03  1.997540E-04
  4.426460E-04  1.990445E-04  1.470610E-04  1.539762E-04  3.814900E-05  7.024764E-05  1.611156E-05  2.136422E-05
  5.984886E-05  1.035646E-05  2.363444E-05  1.105747E-05  8.308678E-06  8.789299E-06  2.257693E-06  5.807418E-06
  6.248625E-07  3.822327E-06  6.987942E-07  1.660586E-06  2.240283E-07  5.983062E-07  6.513773E-07  2.735403E-07
  4.614998E-07  1.940877E-07  5.895136E-07  2.081549E-07  1.662117E-07  2.316650E-07  1.101916E-07  2.162701E-08
  7.493990E-08  2.341661E-08  1.072330E-08  1.606536E-08  4.945307E-09  3.936301E-09  3.147244E-09  5.945972E-09
  3.108514E-09  3.682241E-09  3.210760E-09  2.795020E-09  2.436545E-09  2.118219E-09  2.612622E-09  2.586657E-09
  1.432507E-09  1.457386E-09  7.264341E-10  3.803348E-10  4.514677E-10  3.959518E-10  6.541553E-10  3.707172E-10
  1.334816E-10  3.875547E-10  5.294296E-11  2.294557E-10  2.790137E-11  1.719152E-10  1.408339E-11  3.526731E-11
  1.469469E-11  5.583990E-11  6.759567E-11  8.766360E-12  2.337697E-11  2.629908E-11  2.629908E-11  2.483802E-11
  2.483802E-11  2.337697E-11  2.629908E-11  7.112706E-12  9.957791E-12  1.991557E-11  9.246516E-12  1.066906E-11
  1.849303E-11  1.849303E-11  5.690165E-12  7.681722E-12  3.698607E-12  3.698607E-12  7.397214E-12  7.397214E-12
  2.845082E-12  4.267624E-12  6.828199E-12  2.845082E-12  3.983115E-12  6.259180E-12  5.690165E-12  5.690165E-12
  1.422541E-11  2.845082E-12  1.707049E-11  2.845082E-12  2.095991E-11  2.193285E-11  2.330364E-11  1.096642E-11
  4.112407E-12  1.425635E-11  8.906802E-12  2.429128E-12  1.106603E-11  8.097092E-12  1.484468E-12  3.913596E-12
  5.398063E-12  8.624785E-12  7.546689E-12  8.355261E-12  2.425721E-12  5.390492E-12  5.390492E-12  1.617147E-12
  5.120967E-12  2.710198E-12  9.033993E-13  2.710198E-12  3.744092E-13  1.248030E-12  6.614939E-13  4.359798E-13
  4.359798E-13  1.364861E-13  4.856661E-15  4.856661E-15  4.856661E-15  4.856661E-15  4.856661E-15

我目前所擁有的工作，除了它只在 GROUP 關鍵字行之后的第一行中讀取。 我怎樣才能讓它繼續讀取數據，直到它到達下一個 GROUP 關鍵字？

file_name = "test_data.txt"

import re
import io

group_pattern = re.compile(r"GROUP +\d+ FIRST +(?P<first>\d+) LAST +(?P<last>\d+)")


def read_data_from_file(file_name, start_identifier, end_identifier):
    results = []
    longest = 0

    with open(file_name) as file:
        t = file.read()
        t=t[t.find('MACRO'):]
        t=t[t.find(start_identifier)+len(start_identifier):t.find(end_identifier)]
        t=io.StringIO(t)
        for line in t:
            match = group_pattern.search(line)
            if match:
                first = int(match.group('first'))
                last = int(match.group('last'))
                data = [float(value) for value in next(t).split()]
                row = [0.0] * last
                for i, value in enumerate(data, start=first-1):
                    row[i] = value
                longest = max(longest, len(row))
                results.append(row)

    for row in results:
        if len(row) < longest:
            row.extend([0.0] * (longest-len(row)))
    return results

start_identifier = "SCATTER MOMENT      1"
end_identifier = "SCATTER MOMENT      2"

results = read_data_from_file(file_name, start_identifier, end_identifier)
print(results)

我希望代碼產生的是一個只有數字數據的矩陣。 在這種情況下，它的大小是 [2x168]，但我的完整數據集是 [172x172]。 我希望將每個 GROUP 作為矩陣的一行讀入，並在 output 數據中未指定的每個元素中填充零。 當前代碼幾乎完成了所有這些，除了它只讀取 GROUP 關鍵字行之后的第一行數據。

Answer 1

因此，我查看了您在問題中提供的數據。 我發現我認為從該文件中提取這些數據點的更好、更簡單的方法。 但是我注意到您還有一些其他代碼也在文件中尋找其他內容，但這些代碼不在您發布的測試數據中。 因此，您可能需要稍微調整一下才能使用您的數據集。

def read_data_from_file(file_name):
    with open(file_name) as fp:
        index = -1
        matrices = []

        # Iterate over the file line by line via iter. Reduces memory usage
        for line in fp:

            # Since headers are always on their own line and data points always being with
            # two spaces we can just look for lines that start with two spaces.
            # If we find a line without these spaces then its the header line, add a new
            # list to matrices and add one to index
            if not line.startswith('  '):
                index += 1
                matrices.append([])

            else:
                # Splice str at index 2 to ignore first two spaces
                # Then split by two spaces to get each data point
                str_data_points = line[2:].split('  ')

                # Map the string data points to a floats
                float_data_points = map(lambda s: float(s), str_data_points)

                # Add those float data points to the list in matrices via index
                matrices[index].extend(float_data_points)

        max_matrix_length = max(map(lambda matrix: len(matrix), matrices))

        for matrix in matrices:
            matrix.extend([0.0] * (max_matrix_length - len(matrix)))

        return matrices

Answer 2

這是我從.txt文件中讀取數據並生成類似矩陣的 output 的解決方案（每組末尾填充 0.0）

import re

def read_data_from_file(filepath):
    GROUP_DATA = []
    MAX_ELEMENT_COUNT = 0
    
    with open(file_path) as f:
        for line in f.readlines():
            if 'GROUP' in line:
                GROUP_DATA.append([])
                MAX_ELEMENT_COUNT = max(MAX_ELEMENT_COUNT, int(re.findall(r'\d+', line)[-1]))
            else:
                values = line.split(' ')
                for value in values:
                    try:
                        GROUP_DATA[-1].append(float(value))
                    except ValueError:
                        pass

    for DATA in GROUP_DATA:
        if len(DATA) < MAX_ELEMENT_COUNT:
            DATA += [0.0] * (MAX_ELEMENT_COUNT - len(DATA))

    return GROUP_DATA

對於保存到data.txt的給定問題中的數據，output 如下：

>>> import numpy as np ------------------------------> Just to check the output shape
>>> mat = read_data_from_file('data.txt')
>>> np.shape(mat)
(2, 168) <-------------------------------------------- Output shape as expected

Output 矩陣的大小對給定數據是靈活的

在關鍵字后讀取多行？

問題描述

2 個解決方案

解決方案1
2 已采納 2021-06-04 16:06:29

解決方案2
0 2021-06-04 16:55:30

在關鍵字后讀取多行？

問題描述

2 個解決方案

解決方案1 2 已采納 2021-06-04 16:06:29

解決方案2 0 2021-06-04 16:55:30

解決方案1
2 已采納 2021-06-04 16:06:29

解決方案2
0 2021-06-04 16:55:30