Python正則表達式解析文本文件

Question

我的目標是從文本文件中提取一些值並使用 matplotlib 生成 plot ...

我有幾個從調用 tensorflow 的 python 腳本生成的大型（~100MB）文本日志文件。 我從運行腳本中保存了終端 output，如下所示：

python my_script.py 2>&1 | tee mylog.txt

這是我試圖解析並轉換為字典的文本文件的一個片段：

Epoch 00001: saving model to /root/data-cache/data/tmp/models/ota-cfo-full_20200626-173916_01_0.05056382_0.99.h5

5938/5938 [==============================] - 4312s 726ms/step - loss: 0.1190 - accuracy: 0.9583 - val_loss: 0.0506 - val_accuracy: 0.9854

我特別想提升 100 個 epoch 的 epoch 數（0001）、以秒為單位的時間（4312）、loss（0.1190）、accuracy（0.9538）、val_loss（0.0506）和 val_accuracy，這樣我就可以使用 plotA4F00211343973A

日志文件中充滿了我不想要的其他垃圾，例如：

Epoch 1/100

   1/5938 [..............................] - ETA: 0s - loss: 1.7893 - accuracy: 0.31252020-06-26 17:39:45.253972: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1479] CUPTI activity buffer flushed
2020-06-26 17:39:45.255588: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216]  GpuTracer has collected 179 callback api events and 179 activity events.
2020-06-26 17:39:45.276306: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45
2020-06-26 17:39:45.284235: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.trace.json.gz
2020-06-26 17:39:45.286639: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0.049 ms

2020-06-26 17:39:45.288257: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45Dumped tool data for overview_page.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.overview_page.pb
Dumped tool data for input_pipeline.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.kernel_stats.pb


   2/5938 [..............................] - ETA: 6:24 - loss: 1.7824 - accuracy: 0.2656
   3/5938 [..............................] - ETA: 17:03 - loss: 1.7562 - accuracy: 0.2396
   4/5938 [..............................] - ETA: 22:27 - loss: 1.7368 - accuracy: 0.2344
   5/5938 [..............................] - ETA: 22:55 - loss: 1.7387 - accuracy: 0.2375
   6/5938 [..............................] - ETA: 24:16 - loss: 1.7175 - accuracy: 0.2656
   7/5938 [..............................] - ETA: 24:34 - loss: 1.6885 - accuracy: 0.2812

Epoch 54/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0088 - accuracy: 0.9984       
Epoch 00054: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_54_0.0054215402_1.00.h5
1500/1500 [==============================] - 942s 628ms/step - loss: 0.0088 - accuracy: 0.9984 - val_loss: 0.0054 - val_accuracy: 0.9993
2020-07-02 14:42:29.102025: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 674 of 1000
2020-07-02 14:42:33.511163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
Epoch 55/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0136 - accuracy: 0.9978      
Epoch 00055: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_55_0.0036424326_1.00.h5
1500/1500 [==============================] - 948s 632ms/step - loss: 0.0136 - accuracy: 0.9978 - val_loss: 0.0036 - val_accuracy: 0.9990
2020-07-02 14:58:32.042963: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 690 of 1000
2020-07-02 14:58:36.302518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.

這幾乎可以工作，但沒有得到紀元：

regular_exp = re.compile(r'(?P<loss>\d+\.\d+)\s+-\s+accuracy:\s*(?P<accuracy>\d+\.\d+)\s+-\s+val_loss:\s*(?P<val_loss>\d+\.\d+)\s*-\s*val_accuracy:\s*(?P<val_accuracy>\d+\.\d+)', re.M)
with open(log_file, 'r') as file:
    results = [ match.groupdict() for match in regular_exp.finditer(file.read()) ]

我也試過只讀取文件，但到處都有這些奇怪的 x08。

from pprint import pprint as pp
log_file = 'mylog.txt'
text_file = open(log_file, "r")
lines = text_file.readlines()
pp (lines)
' 291/1500 [====>.........................] - ETA: 12:19 - loss: 0.7179 - '
 'accuracy: '
 '0.7163\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\n',
 ' 292/1500 [====>.........................] - ETA: 12:18 - loss: 0.7164 - '
 'accuracy: '
 '0.7168\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\n',

有人可以幫我在 python 中構建一個正則表達式，讓我制作這些值的字典嗎？

我的目標是這樣的：

[{'iteration': '00', 'seconds': '1802', 'loss': '0.3430', 'accuracy': '0.8753', 'val_loss': '0.1110', 'val_accuracy': '0.9670', 'epoch_num': '00002', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_02_0.069291627_0.98.h5'}, {'iteration': '1500/1500', 'seconds': '1679', 'loss': '0.0849', 'accuracy': '0.9739', 'val_loss': '0.0693', 'val_accuracy': '0.9807', 'epoch_num': '00003', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_03_0.055876694_0.98.h5'}, {'iteration': '1500/1500', 'seconds': '1674', 'loss': '0.0742', 'accuracy': '0.9791', 'val_loss': '0.0559', 'val_accuracy': '0.9845', 'epoch_num': '00004', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_04_0.053867317_0.99.h5'}, {'iteration': '1500/1500', 'seconds': '1671', 'loss': '0.0565', 'accuracy': '0.9841', 'val_loss': '0.0539', 'val_accuracy': '0.9850', 'epoch_num': '00005', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_05_0.053266536_0.99.h5'}]

Answer 1

您可以使用正確的正則表達式、列表理解、 groupdict和finditer的組合來實現此目的

首先，我們需要一個基線和標准化的文本格式。 這很重要 - 如果您認為您的文本內容與此不匹配，也許嘗試用空格替換所有\x08字節（以及所有其他不必要的字節）。 （ \x08只是表示退格）

data = """Epoch 54/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0088 - accuracy: 0.9984       
Epoch 00054: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_54_0.0054215402_1.00.h5
1500/1500 [==============================] - 942s 628ms/step - loss: 0.0088 - accuracy: 0.9984 - val_loss: 0.0054 - val_accuracy: 0.9993
2020-07-02 14:42:29.102025: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 674 of 1000
2020-07-02 14:42:33.511163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
Epoch 55/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0136 - accuracy: 0.9978      
Epoch 00055: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_55_0.0036424326_1.00.h5
1500/1500 [==============================] - 948s 632ms/step - loss: 0.0136 - accuracy: 0.9978 - val_loss: 0.0036 - val_accuracy: 0.9990
2020-07-02 14:58:32.042963: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 690 of 1000
2020-07-02 14:58:36.302518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled."""

這是您提供的最新示例的一對一復制。 它似乎最“完整”，我將使用它。

您需要的正則表達式應該是 -

ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)

查看演示！

現在，你需要一條奇異的線來解決所有的魔法——

info_list = [match.groupdict() for match in re.finditer(r'ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)', data)]

不過，對於這種特定情況，我絕對建議先編譯模式。

DATA_PATTERN = re.compile(r'ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)')
info_list = [match.groupdict() for match in DATA_PATTERN.finditer(data)]

輸出-

[{'ETA': '0', 'loss': '0.0088', 'accuracy': '0.9984', 'iteration': '00054'},
 {'ETA': '0', 'loss': '0.0136', 'accuracy': '0.9978', 'iteration': '00055'}]

Python正則表達式解析文本文件

問題描述

1 個解決方案

解決方案1
2 已采納 2020-07-02 16:41:38

Python正則表達式解析文本文件

問題描述

1 個解決方案

解決方案1 2 已采納 2020-07-02 16:41:38

解決方案1
2 已采納 2020-07-02 16:41:38