如何收集文件中关键字之间的所有数据行 - 从换行符开始+结束

Question

I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need. 我试图从非常大的日志文件中收集特定信息，但无法弄清楚如何获得我需要的行为。

For reference, an example log is sort of like this: 作为参考，示例日志有点像这样：

 garbage I don't need - garbage I don't need timestamp - date - server info - 'keyword 1' - data more data more data more data more data more data more data more data more data more data more data 'keyword 2' - last bit of data garbage I don't need - garbage I don't need

What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data). 我需要的是找到'关键字1'，抓住整行关键字1打开（回到时间戳）和所有后续行直到（并包括）'关键字2'所在的整行（通过最后一位数据））。

So far I have tried a few things. 到目前为止，我尝试过一些事情。 I cannot get decent results with re methods (findall, match, search etc.); 用re方法我找不到合适的结果（findall，match，search等）; I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character. 我无法弄清楚如何在比赛之前抓住数据（即使看后面）但更重要的是，我无法弄清楚如何让捕捉停在一个短语而不仅仅是一个字符。

for match in re.findall('keyword1[keyword2]+|', showall.read()):

I also tried something like this: 我也尝试过这样的事情：

start_capture = False
for current_line in fileName:
    if 'keyword1' in current_line:
        start_capture = True
    if start_capture:
        new_list.append(current_line)
    if 'keyword2' in current_line:
        return(new_list)

No matter what I tried, this returned an empty list 无论我尝试什么，这都返回了一个空列表

Finally,I tried something like this: 最后，我尝试过这样的事情：

def takewhile_plus_next(predicate, xs):
for x in xs:
    if not predicate(x):
        break
    yield x
yield x
with lastdb as f:
    lines = map(str.rstrip, f)
    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data. 最后一个从关键字1到EOF，包括近100,000行垃圾数据。

Answer 1

You can use regex if you specify re.dotall and use lazy anythings .*? 如果指定re.dotall并使用lazy anythings，则可以使用正则表达式。*？ to match start and end: 匹配开始和结束：

import re

regex = r"\n.*?(keyword 1).*?(keyword 2).*?$"

test_str = ("garbage I don't need - garbage I don't need\n"
    "timestamp - date - server info - 'keyword 1' - data\n"
    "more data more data more data more data\n"
    "more data more data more data more data\n"
    "more data more data 'keyword 2' - last bit of data\n"
    "garbage I don't need - garbage I don't need")

matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print (match.group()) # your match is the whole group

Output: 输出：

timestamp - date - server info - 'keyword 1' - data 
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data

You might need to strip('\\n') from it ... 您可能需要从中strip('\\n') ...

You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. 您可以在此处查看： https ： //regex101.com/r/HWIALZ/1 - 它还包含模式的说明。 The short of it: 缺点：

\n        newline 
   .*?    as few as possible anythings
   (keyword 1)   literal text - the () are not needed only if you want the group
   .*?    as few as possible anythings
   (keyword 2)   literal text - again () are not needed 
   .*?    as few as possible anythings
$         end of line

I included the () for clarity - you do not evaluate groups, you you remove them. 为了清楚起见，我包括了（） - 你不评估组，你删除它们。

Answer 2

The following is fast for any size of file. 对于任何大小的文件，以下内容都很快。 It extracts from a 250M log file of nearly 2 million lines in 3 seconds. 它在3秒内从250万个日志文件中提取近200万行。 The extracted portion was at the end of the file. 提取的部分位于文件的末尾。

I would not recommend using list , regexes or other in-memory techniques if there is a chance your files won't fit in available memory. 如果您的文件可能不适合可用内存，我不建议使用list ，正则表达式或其他内存技术。

Test text file startstop_text : 测试文本文件startstop_text ：

line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output

Code: 码：

from itertools import dropwhile


def keepuntil(contains_end_keyword, lines):
    for line in lines:
        yield line
        if contains_end_keyword(line):
            break


with open('startstop_text', 'r') as f:
    from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
    extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
    for line in extracted:
        print(line.rstrip())


>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2

Answer 3

其他任何反应都没有奏效，但我能够使用正则表达式来解决这个问题。

for match in re.findall(".*keyword1[\s\S]*?keyword2:[\s\S]*?keyword3.*", log_file.read()):

如何收集文件中关键字之间的所有数据行 - 从换行符开始+结束

问题描述

3 个解决方案

解决方案1
1 2018-11-08 21:57:14

解决方案2
1 2018-11-09 04:48:50

解决方案3
-1 已采纳 2018-11-11 05:48:39

如何收集文件中关键字之间的所有数据行 - 从换行符开始+结束

问题描述

3 个解决方案

解决方案1 1 2018-11-08 21:57:14

解决方案2 1 2018-11-09 04:48:50

解决方案3 -1 已采纳 2018-11-11 05:48:39

解决方案1
1 2018-11-08 21:57:14

解决方案2
1 2018-11-09 04:48:50

解决方案3
-1 已采纳 2018-11-11 05:48:39