简体   繁体   English

如何收集文件中关键字之间的所有数据行 - 从换行符开始+结束

[英]How to collect all lines of data between keywords in a file - starting+ending at linebreaks

I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need. 我试图从非常大的日志文件中收集特定信息,但无法弄清楚如何获得我需要的行为。

For reference, an example log is sort of like this: 作为参考,示例日志有点像这样:

 garbage I don't need - garbage I don't need timestamp - date - server info - 'keyword 1' - data more data more data more data more data more data more data more data more data more data more data 'keyword 2' - last bit of data garbage I don't need - garbage I don't need 

What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data). 我需要的是找到'关键字1',抓住整行关键字1打开(回到时间戳)和所有后续行直到(并包括)'关键字2'所在的整行(通过最后一位数据) )。

So far I have tried a few things. 到目前为止,我尝试过一些事情。 I cannot get decent results with re methods (findall, match, search etc.); 用re方法我找不到合适的结果(findall,match,search等); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character. 我无法弄清楚如何在比赛之前抓住数据(即使看后面)但更重要的是,我无法弄清楚如何让捕捉停在一个短语而不仅仅是一个字符。

for match in re.findall('keyword1[keyword2]+|', showall.read()):

I also tried something like this: 我也尝试过这样的事情:

start_capture = False
for current_line in fileName:
    if 'keyword1' in current_line:
        start_capture = True
    if start_capture:
        new_list.append(current_line)
    if 'keyword2' in current_line:
        return(new_list)

No matter what I tried, this returned an empty list 无论我尝试什么,这都返回了一个空列表

Finally,I tried something like this: 最后,我尝试过这样的事情:

def takewhile_plus_next(predicate, xs):
for x in xs:
    if not predicate(x):
        break
    yield x
yield x
with lastdb as f:
    lines = map(str.rstrip, f)
    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data. 最后一个从关键字1到EOF,包括近100,000行垃圾数据。

You can use regex if you specify re.dotall and use lazy anythings .*? 如果指定re.dotall并使用lazy anythings,则可以使用正则表达式。*? to match start and end: 匹配开始和结束:

import re

regex = r"\n.*?(keyword 1).*?(keyword 2).*?$"

test_str = ("garbage I don't need - garbage I don't need\n"
    "timestamp - date - server info - 'keyword 1' - data\n"
    "more data more data more data more data\n"
    "more data more data more data more data\n"
    "more data more data 'keyword 2' - last bit of data\n"
    "garbage I don't need - garbage I don't need")

matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print (match.group()) # your match is the whole group

Output: 输出:

timestamp - date - server info - 'keyword 1' - data 
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data

You might need to strip('\\n') from it ... 您可能需要从中strip('\\n') ...

You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. 您可以在此处查看: https//regex101.com/r/HWIALZ/1 - 它还包含模式的说明。 The short of it: 缺点:

\n        newline 
   .*?    as few as possible anythings
   (keyword 1)   literal text - the () are not needed only if you want the group
   .*?    as few as possible anythings
   (keyword 2)   literal text - again () are not needed 
   .*?    as few as possible anythings
$         end of line

I included the () for clarity - you do not evaluate groups, you you remove them. 为了清楚起见,我包括了() - 你不评估组,你删除它们。

The following is fast for any size of file. 对于任何大小的文件,以下内容都很快。 It extracts from a 250M log file of nearly 2 million lines in 3 seconds. 它在3秒内从250万个日志文件中提取近200万行。 The extracted portion was at the end of the file. 提取的部分位于文件的末尾。

I would not recommend using list , regexes or other in-memory techniques if there is a chance your files won't fit in available memory. 如果您的文件可能不适合可用内存,我不建议使用list ,正则表达式或其他内存技术。

Test text file startstop_text : 测试文本文件startstop_text

line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output

Code: 码:

from itertools import dropwhile


def keepuntil(contains_end_keyword, lines):
    for line in lines:
        yield line
        if contains_end_keyword(line):
            break


with open('startstop_text', 'r') as f:
    from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
    extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
    for line in extracted:
        print(line.rstrip())


>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2

其他任何反应都没有奏效,但我能够使用正则表达式来解决这个问题。

for match in re.findall(".*keyword1[\s\S]*?keyword2:[\s\S]*?keyword3.*", log_file.read()):

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 获取文本文件中开始和结束关键字之间的行,然后使用 python 进行后期处理 - Get lines between starting and ending keywords in text file, then do post processing using python 如何将文件拆分为起始索引和结束索引之间的不规则部分? - How to split a file into irregular parts between the starting and ending indexes? 如何使用开始和结束条件从文本中读取特定行? - How to read specific lines from text using a starting and ending condition? 读取用户在python中指定的开始位置和结束位置之间的文本文件 - Read a text file between user given starting and ending position in python 读取文件中的关键字,跳过行,使用 Python - Reading between keywords in file, skipping lines, using Python 如何列出以特定字符开头和结尾的所有Unicode字符串? - How to list all Unicode strings starting and ending with a particular character? 如何定义一个句子,以“大写字母”开头,以“.”结尾,在一个txt文件中 - How to define a sentence as ,Starting with “uppercase letter” end ending with “.”, in a txt file 如何从 MongoDb 获取包含开始日期和结束日期的数据? - How can I get data from MongoDb with starting and ending date? 如何从python中的pdf文件中提取所有带有关键字的行? - How to extract all the lines with keywords from pdf files in python? 提取 .txt 文件中两个关键字之间的所有单词 - Extract all words between two keywords in .txt file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM