简体   繁体   English

python 正则表达式,多行匹配,但仍想获取行号

[英]python regex, match in multiline, but still want to get the line number

I have lots of log files, and want to search some patterns using multiline, but in order to locate matched string easily, I still want to see the line number for matched area.我有很多日志文件,并且想使用多行搜索一些模式,但是为了轻松找到匹配的字符串,我仍然想查看匹配区域的行号。

Any good suggestion.有什么好的建议。 (code sample is copied) (复制代码示例)

string="""
####1
ttteest
####1
ttttteeeestt

####2

ttest
####2
"""

import re
pattern = '.*?####(.*?)####'
matches= re.compile(pattern, re.MULTILINE|re.DOTALL).findall(string)
for item in matches:
    print "lineno: ?", "matched: ", item

[UPDATE] the lineno is the actual line number [更新] lineno 是实际的行号

So the output I want looks like:所以我想要的输出看起来像:

    lineno: 1, 1
    ttteest
    lineno: 6, 2
    ttttteeeestt

What you want is a typical task that regex is not very good at;你想要的是一个正则表达式不太擅长的典型任务; parsing.解析。

You could read the logfile line by line, and search that line for the strings you are using to delimit your search.您可以逐行读取日志文件,并在该行中搜索用于分隔搜索的字符串。 You could use regex line by line, but it is less efficient than regular string matching unless you are looking for complicated patterns.您可以逐行使用正则表达式,但它比常规字符串匹配效率低,除非您正在寻找复杂的模式。

And if you are looking for complicated matches, I'd like to see it.如果你正在寻找复杂的比赛,我想看看。 Searching every line in a file for #### while maintaining the line count is easier without regex.如果没有正则表达式,在保持行数的同时搜索文件中的每一行以查找####会更容易。

You can store the line numbers before hand only and afterwards look for it.您可以仅在手前存储行号,然后再查找它。

import re

string="""
####1
ttteest
####1
ttttteeeestt

####2

ttest
####2
"""

end='.*\n'
line=[]
for m in re.finditer(end, string):
    line.append(m.end())

pattern = '.*?####(.*?)####'
match=re.compile(pattern, re.MULTILINE|re.DOTALL)
for m in re.finditer(match, string):
    print 'lineno :%d, %s' %(next(i for i in range(len(line)) if line[i]>m.start(1)), m.group(1))

This can be done fairly efficiently by:这可以通过以下方式相当有效地完成:

  • Finding all matches查找所有匹配项
  • Looping over newlines, storing the {offset: line_number} mapping up until the last match.循环换行,存储{offset: line_number}映射直到最后一个匹配。
  • For each match, reverse find the offset of the first newline beforehand and looking up it's line number in the map.对于每场比赛,事先反向找到第一个换行符的偏移量,并在地图中查找它的行号。

This avoids counting back to the beginning of the file for every match.这样可以避免每次匹配都倒数到文件的开头。

The following function is similar to re.finditer下面的函数类似于re.finditer

def finditer_with_line_numbers(pattern, string, flags=0):
    '''
    A version of 're.finditer' that returns '(match, line_number)' pairs.
    '''
    import re

    matches = list(re.finditer(pattern, string, flags))
    if not matches:
        return []

    end = matches[-1].start()
    # -1 so a failed 'rfind' maps to the first line.
    newline_table = {-1: 0}
    for i, m in enumerate(re.finditer('\\n', string), 1):
        # Don't find newlines past our last match.
        offset = m.start()
        if offset > end:
            break
        newline_table[offset] = i

    # Failing to find the newline is OK, -1 maps to 0.
    for m in matches:
        newline_offset = string.rfind('\n', 0, m.start())
        line_number = newline_table[newline_offset]
        yield (m, line_number)

If you want the contents, you can replace the last loop with:如果需要内容,可以将最后一个循环替换为:

    for m in matches:
        newline_offset = string.rfind('\n', 0, m.start())
        newline_end = string.find('\n', m.end())  # '-1' gracefully uses the end.
        line = string[newline_offset + 1:newline_end]
        line_number = newline_table[newline_offset]
        yield (m, line_number, line)

Note that it would be nice to avoid having to create a list from finditer , however that means we won't know when to stop storing newlines (where it could end up storing many newlines even if the only pattern match is at the beginning of the file) .请注意,避免必须从finditer创建列表会很好,但这意味着我们不知道何时停止存储换行符(即使唯一的模式匹配位于文件)

If it was important to avoid storing all matches - it's possible to make an iterator that scans newlines as-needed, though not sure this would give you much advantage in practice.如果避免存储所有匹配项很重要 - 可以创建一个根据需要扫描换行符的迭代器,但不确定这在实践中会给您带来多大优势。

The finditer function can tell you the character range that matched. finditer 函数可以告诉您匹配的字符范围。 From this you can use a simple newline regular expression to count how many newlines were before the match.由此,您可以使用简单的换行正则表达式来计算匹配前有多少换行。 Add one to the number of newlines to get the line number, as our convention in manipulating text in an editor is to call the first line 1 rather than 0.将换行数加一以获得行号,因为我们在编辑器中操作文本的惯例是将第一行称为 1 而不是 0。

def multiline_re_with_linenumber():
    string="""
####1
ttteest
####1
ttttteeeestt

####2

ttest
####2
"""
    re_pattern = re.compile(r'.*?####(.*?)####', re.DOTALL)
    re_newline = re.compile(r'\n')
    count = 0
    for m in re_pattern.finditer(string):
        count += 1
        start_line = len(re_newline.findall(string, 0, m.start(1)))+1
        end_line = len(re_newline.findall(string, 0, m.end(1)))+1
        print ('"""{}"""\nstart={}, end={}, instance={}'.format(m.group(1), start_line, end_line, count))

Gives this output给出这个输出

"""1
ttteest
"""
start=2, end=4, instance=1
"""2

ttest
"""
start=7, end=10, instance=2

I believe this does more or less what you want:我相信这或多或少可以满足您的要求:

import re

string="""
####1
ttteest
####1
ttttteeeestt

####2

ttest
####2
"""

pattern = '.*?####(.*?)####'
matches = re.compile(pattern, re.MULTILINE|re.DOTALL)
for match in matches.finditer(string):
    start, end = string[0:match.start()].count("\n"), string[0:match.end()].count("\n")
    print("lineno: %d-%d matched: %s" % (start, end, match.group()))

It might be a little slower than other options because it repeatedly does a substring match and search on the string, but since the string is small in your example, i think it's worth the tradeoff for simplicity.它可能比其他选项慢一点,因为它反复对字符串进行子字符串匹配和搜索,但由于在您的示例中字符串很小,我认为为了简单起见值得权衡。

What we gain here is also the range of lines that match the pattern, which allows us to extract the whole string in one swoop as well.我们在这里获得的也是与模式匹配的行范围,这使我们也可以一口气提取整个字符串。 We might optimize this further by counting the number of newlines in the match instead of going straight for the end, for what it's worth.我们可以通过计算匹配中换行符的数量来进一步优化这一点,而不是直接到最后,因为它的价值。

import re

text = """
####1
ttteest
####1
ttttteeeestt

####2   

ttest
####2
"""

pat = ('^####(\d+)'
       '(?:[^\S\n]*\n)*'
       '\s*(.+?)\s*\n'
       '^####\\1(?=\D)')
regx = re.compile(pat,re.MULTILINE)

print '\n'.join("lineno: %s  matched: %s" % t
                for t in regx.findall(text))

result结果

lineno: 1  matched: ttteest
lineno: 2  matched: ttest

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM