[英]How to collect all lines of data between keywords in a file - starting+ending at linebreaks
I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need. 我试图从非常大的日志文件中收集特定信息,但无法弄清楚如何获得我需要的行为。
For reference, an example log is sort of like this: 作为参考,示例日志有点像这样:
garbage I don't need - garbage I don't need timestamp - date - server info - 'keyword 1' - data more data more data more data more data more data more data more data more data more data more data 'keyword 2' - last bit of data garbage I don't need - garbage I don't need
What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data). 我需要的是找到'关键字1',抓住整行关键字1打开(回到时间戳)和所有后续行直到(并包括)'关键字2'所在的整行(通过最后一位数据) )。
So far I have tried a few things. 到目前为止,我尝试过一些事情。 I cannot get decent results with re methods (findall, match, search etc.); 用re方法我找不到合适的结果(findall,match,search等); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character. 我无法弄清楚如何在比赛之前抓住数据(即使看后面)但更重要的是,我无法弄清楚如何让捕捉停在一个短语而不仅仅是一个字符。
for match in re.findall('keyword1[keyword2]+|', showall.read()):
I also tried something like this: 我也尝试过这样的事情:
start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)
No matter what I tried, this returned an empty list 无论我尝试什么,这都返回了一个空列表
Finally,I tried something like this: 最后,我尝试过这样的事情:
def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)
This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data. 最后一个从关键字1到EOF,包括近100,000行垃圾数据。
You can use regex if you specify re.dotall
and use lazy anythings .*? 如果指定re.dotall
并使用lazy anythings,则可以使用正则表达式。*? to match start and end: 匹配开始和结束:
import re
regex = r"\n.*?(keyword 1).*?(keyword 2).*?$"
test_str = ("garbage I don't need - garbage I don't need\n"
"timestamp - date - server info - 'keyword 1' - data\n"
"more data more data more data more data\n"
"more data more data more data more data\n"
"more data more data 'keyword 2' - last bit of data\n"
"garbage I don't need - garbage I don't need")
matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print (match.group()) # your match is the whole group
Output: 输出:
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
You might need to strip('\\n')
from it ... 您可能需要从中strip('\\n')
...
You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. 您可以在此处查看: https : //regex101.com/r/HWIALZ/1 - 它还包含模式的说明。 The short of it: 缺点:
\n newline
.*? as few as possible anythings
(keyword 1) literal text - the () are not needed only if you want the group
.*? as few as possible anythings
(keyword 2) literal text - again () are not needed
.*? as few as possible anythings
$ end of line
I included the () for clarity - you do not evaluate groups, you you remove them. 为了清楚起见,我包括了() - 你不评估组,你删除它们。
The following is fast for any size of file. 对于任何大小的文件,以下内容都很快。 It extracts from a 250M log file of nearly 2 million lines in 3 seconds. 它在3秒内从250万个日志文件中提取近200万行。 The extracted portion was at the end of the file. 提取的部分位于文件的末尾。
I would not recommend using list
, regexes or other in-memory techniques if there is a chance your files won't fit in available memory. 如果您的文件可能不适合可用内存,我不建议使用list
,正则表达式或其他内存技术。
Test text file startstop_text
: 测试文本文件startstop_text
:
line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output
Code: 码:
from itertools import dropwhile
def keepuntil(contains_end_keyword, lines):
for line in lines:
yield line
if contains_end_keyword(line):
break
with open('startstop_text', 'r') as f:
from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
for line in extracted:
print(line.rstrip())
>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2
其他任何反应都没有奏效,但我能够使用正则表达式来解决这个问题。
for match in re.findall(".*keyword1[\s\S]*?keyword2:[\s\S]*?keyword3.*", log_file.read()):
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.