[英]Reading from more than one line after keyword?
我有一個 output 文件,它打印出一個數字數據矩陣。 我需要在這個文件中搜索每個數據集開頭的標識符,即:
GROUP 1 FIRST 1 LAST 163
這里 GROUP 1 是矩陣的第一列,FIRST 1 是 position 1 中該矩陣的第一個非零元素,LAST 163 是 position 163 中矩陣的最后一個非零元素。矩陣不一定以這個 LAST 值結束 - 在這種情況下,有 172 個值。
我想把這些數據讀成更簡單的形式來處理。 以下是前兩列結果的示例:
GROUP 1 FIRST 1 LAST 163
7.150814E-02 9.866657E-03 8.500540E-04 1.818338E-03 2.410691E-03 3.284499E-03 3.011986E-03 1.612432E-03
1.674247E-03 3.436244E-03 3.655873E-03 4.056876E-03 4.560725E-03 2.462454E-03 2.567764E-03 5.359393E-03
5.457415E-03 2.679373E-03 2.600020E-03 2.491592E-03 2.365089E-03 2.228494E-03 5.792616E-03 1.623274E-03
1.475062E-03 1.331820E-03 1.195052E-03 2.832699E-03 7.298341E-04 6.301271E-04 1.377459E-03 1.048925E-03
1.677453E-04 3.580640E-04 1.575301E-04 1.150545E-04 1.197719E-04 2.950028E-05 5.380539E-05 1.228784E-05
1.627659E-05 4.522051E-05 7.736908E-06 1.758838E-05 8.161204E-06 6.103670E-06 6.431876E-06 1.585671E-06
4.110246E-06 4.512924E-07 2.775227E-06 5.107739E-07 1.219448E-06 1.653674E-07 4.429047E-07 4.837661E-07
2.036820E-07 3.449548E-07 1.457648E-07 4.494116E-07 1.629392E-07 1.300509E-07 1.730199E-07 8.130338E-08
1.591993E-08 5.457638E-08 1.713141E-08 7.806754E-09 1.154869E-08 3.545961E-09 2.862203E-09 2.289470E-09
4.324002E-09 2.243199E-09 2.627165E-09 2.273119E-09 1.973867E-09 1.710714E-09 1.468845E-09 1.772236E-09
1.764492E-09 1.004393E-09 1.044698E-09 5.201382E-10 2.660613E-10 3.012732E-10 2.630323E-10 4.381052E-10
2.521794E-10 9.213524E-11 2.619283E-10 3.591906E-11 1.449830E-10 1.867363E-11 1.230445E-10 1.108149E-11
2.775004E-11 1.156249E-11 4.393752E-11 5.318751E-11 6.815569E-12 1.817489E-11 2.044674E-11 2.044673E-11
1.931080E-11 1.931076E-11 1.817484E-11 2.044668E-11 5.486837E-12 7.681572E-12 1.536314E-11 7.132886E-12
8.230253E-12 1.426577E-11 1.426577E-11 4.389468E-12 5.925780E-12 2.853153E-12 2.853153E-12 5.706307E-12
5.706307E-12 2.194733E-12 3.292099E-12 5.267358E-12 2.194733E-12 3.072626E-12 4.828412E-12 4.389466E-12
4.389465E-12 1.097366E-11 2.194732E-12 1.316839E-11 2.194732E-12 1.608784E-11 1.674222E-11 1.778860E-11
6.993074E-12 2.622402E-12 9.090994E-12 5.769285E-12 1.573441E-12 6.861030E-12 4.782885E-12 8.768619E-13
2.311727E-12 3.188589E-12 4.393636E-12 3.844430E-12 4.256331E-12 1.235709E-12 2.746020E-12 2.746020E-12
8.238059E-13 2.608719E-12 1.445203E-12 4.817344E-13 1.445203E-12 7.609642E-14 2.536547E-13 2.000924E-13
7.075681E-14 7.075681E-14 3.056704E-14
GROUP 2 FIRST 2 LAST 168
6.740271E-02 8.310813E-03 3.609403E-03 1.307012E-03 2.949375E-03 3.605043E-03 1.612647E-03 1.640960E-03
3.597806E-03 4.022993E-03 4.289805E-03 4.480576E-03 2.352539E-03 2.415121E-03 5.018262E-03 5.188098E-03
2.589224E-03 2.546116E-03 2.472462E-03 2.374431E-03 2.260519E-03 5.981164E-03 1.700972E-03 1.556116E-03
1.410140E-03 1.273499E-03 3.061941E-03 7.995844E-04 6.967963E-04 1.553994E-03 1.216266E-03 1.997540E-04
4.426460E-04 1.990445E-04 1.470610E-04 1.539762E-04 3.814900E-05 7.024764E-05 1.611156E-05 2.136422E-05
5.984886E-05 1.035646E-05 2.363444E-05 1.105747E-05 8.308678E-06 8.789299E-06 2.257693E-06 5.807418E-06
6.248625E-07 3.822327E-06 6.987942E-07 1.660586E-06 2.240283E-07 5.983062E-07 6.513773E-07 2.735403E-07
4.614998E-07 1.940877E-07 5.895136E-07 2.081549E-07 1.662117E-07 2.316650E-07 1.101916E-07 2.162701E-08
7.493990E-08 2.341661E-08 1.072330E-08 1.606536E-08 4.945307E-09 3.936301E-09 3.147244E-09 5.945972E-09
3.108514E-09 3.682241E-09 3.210760E-09 2.795020E-09 2.436545E-09 2.118219E-09 2.612622E-09 2.586657E-09
1.432507E-09 1.457386E-09 7.264341E-10 3.803348E-10 4.514677E-10 3.959518E-10 6.541553E-10 3.707172E-10
1.334816E-10 3.875547E-10 5.294296E-11 2.294557E-10 2.790137E-11 1.719152E-10 1.408339E-11 3.526731E-11
1.469469E-11 5.583990E-11 6.759567E-11 8.766360E-12 2.337697E-11 2.629908E-11 2.629908E-11 2.483802E-11
2.483802E-11 2.337697E-11 2.629908E-11 7.112706E-12 9.957791E-12 1.991557E-11 9.246516E-12 1.066906E-11
1.849303E-11 1.849303E-11 5.690165E-12 7.681722E-12 3.698607E-12 3.698607E-12 7.397214E-12 7.397214E-12
2.845082E-12 4.267624E-12 6.828199E-12 2.845082E-12 3.983115E-12 6.259180E-12 5.690165E-12 5.690165E-12
1.422541E-11 2.845082E-12 1.707049E-11 2.845082E-12 2.095991E-11 2.193285E-11 2.330364E-11 1.096642E-11
4.112407E-12 1.425635E-11 8.906802E-12 2.429128E-12 1.106603E-11 8.097092E-12 1.484468E-12 3.913596E-12
5.398063E-12 8.624785E-12 7.546689E-12 8.355261E-12 2.425721E-12 5.390492E-12 5.390492E-12 1.617147E-12
5.120967E-12 2.710198E-12 9.033993E-13 2.710198E-12 3.744092E-13 1.248030E-12 6.614939E-13 4.359798E-13
4.359798E-13 1.364861E-13 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15
我目前所擁有的工作,除了它只在 GROUP 關鍵字行之后的第一行中讀取。 我怎樣才能讓它繼續讀取數據,直到它到達下一個 GROUP 關鍵字?
file_name = "test_data.txt"
import re
import io
group_pattern = re.compile(r"GROUP +\d+ FIRST +(?P<first>\d+) LAST +(?P<last>\d+)")
def read_data_from_file(file_name, start_identifier, end_identifier):
results = []
longest = 0
with open(file_name) as file:
t = file.read()
t=t[t.find('MACRO'):]
t=t[t.find(start_identifier)+len(start_identifier):t.find(end_identifier)]
t=io.StringIO(t)
for line in t:
match = group_pattern.search(line)
if match:
first = int(match.group('first'))
last = int(match.group('last'))
data = [float(value) for value in next(t).split()]
row = [0.0] * last
for i, value in enumerate(data, start=first-1):
row[i] = value
longest = max(longest, len(row))
results.append(row)
for row in results:
if len(row) < longest:
row.extend([0.0] * (longest-len(row)))
return results
start_identifier = "SCATTER MOMENT 1"
end_identifier = "SCATTER MOMENT 2"
results = read_data_from_file(file_name, start_identifier, end_identifier)
print(results)
我希望代碼產生的是一個只有數字數據的矩陣。 在這種情況下,它的大小是 [2x168],但我的完整數據集是 [172x172]。 我希望將每個 GROUP 作為矩陣的一行讀入,並在 output 數據中未指定的每個元素中填充零。 當前代碼幾乎完成了所有這些,除了它只讀取 GROUP 關鍵字行之后的第一行數據。
因此,我查看了您在問題中提供的數據。 我發現我認為從該文件中提取這些數據點的更好、更簡單的方法。 但是我注意到您還有一些其他代碼也在文件中尋找其他內容,但這些代碼不在您發布的測試數據中。 因此,您可能需要稍微調整一下才能使用您的數據集。
def read_data_from_file(file_name):
with open(file_name) as fp:
index = -1
matrices = []
# Iterate over the file line by line via iter. Reduces memory usage
for line in fp:
# Since headers are always on their own line and data points always being with
# two spaces we can just look for lines that start with two spaces.
# If we find a line without these spaces then its the header line, add a new
# list to matrices and add one to index
if not line.startswith(' '):
index += 1
matrices.append([])
else:
# Splice str at index 2 to ignore first two spaces
# Then split by two spaces to get each data point
str_data_points = line[2:].split(' ')
# Map the string data points to a floats
float_data_points = map(lambda s: float(s), str_data_points)
# Add those float data points to the list in matrices via index
matrices[index].extend(float_data_points)
max_matrix_length = max(map(lambda matrix: len(matrix), matrices))
for matrix in matrices:
matrix.extend([0.0] * (max_matrix_length - len(matrix)))
return matrices
這是我從.txt
文件中讀取數據並生成類似矩陣的 output 的解決方案(每組末尾填充 0.0)
import re
def read_data_from_file(filepath):
GROUP_DATA = []
MAX_ELEMENT_COUNT = 0
with open(file_path) as f:
for line in f.readlines():
if 'GROUP' in line:
GROUP_DATA.append([])
MAX_ELEMENT_COUNT = max(MAX_ELEMENT_COUNT, int(re.findall(r'\d+', line)[-1]))
else:
values = line.split(' ')
for value in values:
try:
GROUP_DATA[-1].append(float(value))
except ValueError:
pass
for DATA in GROUP_DATA:
if len(DATA) < MAX_ELEMENT_COUNT:
DATA += [0.0] * (MAX_ELEMENT_COUNT - len(DATA))
return GROUP_DATA
對於保存到data.txt
的給定問題中的數據,output 如下:
>>> import numpy as np ------------------------------> Just to check the output shape
>>> mat = read_data_from_file('data.txt')
>>> np.shape(mat)
(2, 168) <-------------------------------------------- Output shape as expected
Output 矩陣的大小對給定數據是靈活的
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.