简体   繁体   English

Python正则表达式解析文本文件

[英]Python regular expression to parse text file

My goal is to lift a few values from a text file and generate a plot using matplotlib...我的目标是从文本文件中提取一些值并使用 matplotlib 生成 plot ...

I have several large (~100MB) text log files generated from a python script that is calling tensorflow.我有几个从调用 tensorflow 的 python 脚本生成的大型(~100MB)文本日志文件。 I save the terminal output from running the script like this:我从运行脚本中保存了终端 output,如下所示:

python my_script.py 2>&1 | tee mylog.txt

Here's a snippet from the text file that I'm trying to parse and turn into a dictionary:这是我试图解析并转换为字典的文本文件的一个片段

Epoch 00001: saving model to /root/data-cache/data/tmp/models/ota-cfo-full_20200626-173916_01_0.05056382_0.99.h5

5938/5938 [==============================] - 4312s 726ms/step - loss: 0.1190 - accuracy: 0.9583 - val_loss: 0.0506 - val_accuracy: 0.9854

I'm specifically trying to lift epoch number (0001), the time in seconds (4312), loss (0.1190), accuracy (0.9538), val_loss (0.0506) and val_accuracy for 100 epochs so I can make a plot using matplotlib.我特别想提升 100 个 epoch 的 epoch 数(0001)、以秒为单位的时间(4312)、loss(0.1190)、accuracy(0.9538)、val_loss(0.0506)和 val_accuracy,这样我就可以使用 plotA4F00211343973A

The log file is full of other junk that I don't want like:日志文件中充满了我不想要的其他垃圾,例如:

Epoch 1/100

   1/5938 [..............................] - ETA: 0s - loss: 1.7893 - accuracy: 0.31252020-06-26 17:39:45.253972: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1479] CUPTI activity buffer flushed
2020-06-26 17:39:45.255588: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216]  GpuTracer has collected 179 callback api events and 179 activity events.
2020-06-26 17:39:45.276306: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45
2020-06-26 17:39:45.284235: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.trace.json.gz
2020-06-26 17:39:45.286639: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0.049 ms

2020-06-26 17:39:45.288257: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45Dumped tool data for overview_page.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.overview_page.pb
Dumped tool data for input_pipeline.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.kernel_stats.pb


   2/5938 [..............................] - ETA: 6:24 - loss: 1.7824 - accuracy: 0.2656
   3/5938 [..............................] - ETA: 17:03 - loss: 1.7562 - accuracy: 0.2396
   4/5938 [..............................] - ETA: 22:27 - loss: 1.7368 - accuracy: 0.2344
   5/5938 [..............................] - ETA: 22:55 - loss: 1.7387 - accuracy: 0.2375
   6/5938 [..............................] - ETA: 24:16 - loss: 1.7175 - accuracy: 0.2656
   7/5938 [..............................] - ETA: 24:34 - loss: 1.6885 - accuracy: 0.2812
Epoch 54/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0088 - accuracy: 0.9984       
Epoch 00054: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_54_0.0054215402_1.00.h5
1500/1500 [==============================] - 942s 628ms/step - loss: 0.0088 - accuracy: 0.9984 - val_loss: 0.0054 - val_accuracy: 0.9993
2020-07-02 14:42:29.102025: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 674 of 1000
2020-07-02 14:42:33.511163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
Epoch 55/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0136 - accuracy: 0.9978      
Epoch 00055: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_55_0.0036424326_1.00.h5
1500/1500 [==============================] - 948s 632ms/step - loss: 0.0136 - accuracy: 0.9978 - val_loss: 0.0036 - val_accuracy: 0.9990
2020-07-02 14:58:32.042963: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 690 of 1000
2020-07-02 14:58:36.302518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.

This almost works but doesn't get the epoch number:这几乎可以工作,但没有得到纪元:

regular_exp = re.compile(r'(?P<loss>\d+\.\d+)\s+-\s+accuracy:\s*(?P<accuracy>\d+\.\d+)\s+-\s+val_loss:\s*(?P<val_loss>\d+\.\d+)\s*-\s*val_accuracy:\s*(?P<val_accuracy>\d+\.\d+)', re.M)
with open(log_file, 'r') as file:
    results = [ match.groupdict() for match in regular_exp.finditer(file.read()) ]

I've also tried just reading the file in but it has these weird x08's everywhere.我也试过只读取文件,但到处都有这些奇怪的 x08。

from pprint import pprint as pp
log_file = 'mylog.txt'
text_file = open(log_file, "r")
lines = text_file.readlines()
pp (lines)
' 291/1500 [====>.........................] - ETA: 12:19 - loss: 0.7179 - '
 'accuracy: '
 '0.7163\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\n',
 ' 292/1500 [====>.........................] - ETA: 12:18 - loss: 0.7164 - '
 'accuracy: '
 '0.7168\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\n',

Can someone help me construct a regular expression in python that will let me make a dictionary of these values?有人可以帮我在 python 中构建一个正则表达式,让我制作这些值的字典吗?

My goal is something like this:我的目标是这样的:

[{'iteration': '00', 'seconds': '1802', 'loss': '0.3430', 'accuracy': '0.8753', 'val_loss': '0.1110', 'val_accuracy': '0.9670', 'epoch_num': '00002', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_02_0.069291627_0.98.h5'}, {'iteration': '1500/1500', 'seconds': '1679', 'loss': '0.0849', 'accuracy': '0.9739', 'val_loss': '0.0693', 'val_accuracy': '0.9807', 'epoch_num': '00003', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_03_0.055876694_0.98.h5'}, {'iteration': '1500/1500', 'seconds': '1674', 'loss': '0.0742', 'accuracy': '0.9791', 'val_loss': '0.0559', 'val_accuracy': '0.9845', 'epoch_num': '00004', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_04_0.053867317_0.99.h5'}, {'iteration': '1500/1500', 'seconds': '1671', 'loss': '0.0565', 'accuracy': '0.9841', 'val_loss': '0.0539', 'val_accuracy': '0.9850', 'epoch_num': '00005', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_05_0.053266536_0.99.h5'}]

You can achieve this using a combination of the correct regex, list comprehension, groupdict , and finditer您可以使用正确的正则表达式、列表理解、 groupdictfinditer的组合来实现此目的

First things first - we need a baseline and standardized text format.首先,我们需要一个基线和标准化的文本格式。 This is important - if you think your text content does not match this, perhaps try replacing all \x08 bytes (and all other unnecessary bytes for that matter) with blank space.这很重要 - 如果您认为您的文本内容与此不匹配,也许尝试用空格替换所有\x08字节(以及所有其他不必要的字节)。 ( \x08 just means backspace) \x08只是表示退格)

data = """Epoch 54/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0088 - accuracy: 0.9984       
Epoch 00054: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_54_0.0054215402_1.00.h5
1500/1500 [==============================] - 942s 628ms/step - loss: 0.0088 - accuracy: 0.9984 - val_loss: 0.0054 - val_accuracy: 0.9993
2020-07-02 14:42:29.102025: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 674 of 1000
2020-07-02 14:42:33.511163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
Epoch 55/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0136 - accuracy: 0.9978      
Epoch 00055: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_55_0.0036424326_1.00.h5
1500/1500 [==============================] - 948s 632ms/step - loss: 0.0136 - accuracy: 0.9978 - val_loss: 0.0036 - val_accuracy: 0.9990
2020-07-02 14:58:32.042963: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 690 of 1000
2020-07-02 14:58:36.302518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled."""

That's a one to one replication of the latest example you provided.这是您提供的最新示例的一对一复制。 It seemed to most "complete" and I'll be using this.它似乎最“完整”,我将使用它。

The regex you need should be -您需要的正则表达式应该是 -

ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)

Check out the demo !查看演示

Now, you need a singular line to work out all the magic-现在,你需要一条奇异的线来解决所有的魔法——

info_list = [match.groupdict() for match in re.finditer(r'ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)', data)]

I definitely recommend compiling the pattern first though, for this specific case.不过,对于这种特定情况,我绝对建议先编译模式。

DATA_PATTERN = re.compile(r'ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)')
info_list = [match.groupdict() for match in DATA_PATTERN.finditer(data)]

Output-输出-

[{'ETA': '0', 'loss': '0.0088', 'accuracy': '0.9984', 'iteration': '00054'},
 {'ETA': '0', 'loss': '0.0136', 'accuracy': '0.9978', 'iteration': '00055'}]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM