简体   繁体   English

多行正则表达式匹配检索行号和匹配

[英]Multiline regex match retrieving line numbers and matches

I'm attempting to iterate over all lines in a file to match a pattern that could;我正在尝试遍历文件中的所有行以匹配可能的模式;

  1. Occur anywhere in the file出现在文件的任何地方
  2. Occur multiple times in the same file在同一个文件中多次出现
  3. Occur multiple times on the same line在同一行多次出现
  4. The string I'm searching for could be spread across multiple lines for one regex pattern对于一个正则表达式模式,我正在搜索的字符串可以分布在多行中

An example input would be;一个示例输入是;

new File()
new
File()
there is a new File()
new
    
    
    
File()
there is not a matching pattern here File() new
new File() test new File() occurs twice on this line

Example output would be;示例 output 将是;

new File() Found on line 1  
new File() Found on lines 2 & 3 
new File() Found on line 4 
new File() Found on lines 5 & 9 
new File() Found on line 11
new File() Found on line 11 
6 occurrences of new File() pattern in test.txt (Filename)

The regex pattern would look something like;正则表达式模式看起来像;

pattern = r'new\s+File\s*\({1}\s*\){1}'

Looking at the docs here , I can see match, findall and finditer all return matches at the beginning of strings but I don't see a way of using the search function which looks at any location for a regex where the string we're searching for is over multiple lines (Number four in my requirements above).查看此处的文档,我可以看到 match、findall 和 finditer 都在字符串的开头返回匹配项,但我没有看到使用搜索 function 的方法,该搜索在任何位置查找我们正在搜索的字符串的正则表达式for 超过多行(上面我的要求中的第四个)。

Simple enough to match more than one occurence of the regex per line with;足够简单,可以匹配每行不止一次出现的正则表达式;

example input:示例输入:

line = "new File() new File()"

Code:代码:

i = 0
matches = []
while i < len(line):
    while line:
        matchObj = re.search(r"new\s+File\s*\({1}\s*\){1}", line, re.MULTILINE | re.DOTALL)
        if matchObj:
            line = line[matchObj.end():]
            matches.append(matchObj.group())

print(matches)

Prints the following matches - Not including line numbers ect for now:打印以下匹配项 - 目前不包括行号等:

['new File()', 'new File()']

Is there a way to do what I'm looking for with Python's regex?有没有办法用 Python 的正则表达式来做我正在寻找的东西?

you could first find all \n characters in the text and their respective position/character index.您可以首先找到文本中的所有\n字符及其各自的位置/字符索引。 since each \n ...well...starts a new line, the index of each value in this list indicates the line number the found \n character terminates.由于每个\n ...well... 开始一个新行,因此此列表中每个值的索引表示找到的\n字符终止的行号。 then search all occurrences of you pattern and use the aforementioned list to look up the start/end position of the match...然后搜索所有出现的模式并使用上述列表查找匹配的开始/结束 position ...

import re
import bisect

text = """new 
File()
aa new File()
new
File()
there is a new File() and new
File() again
new
    
    
    
File()
there is not a matching pattern here File() new
new File() test new File() occurs twice on this line
"""

# character indices of all \n characters in text
nl = [m.start() for m in re.finditer("\n", text, re.MULTILINE|re.DOTALL)]

matches = list(re.finditer(r"(new\s+File\(\))", text, re.MULTILINE|re.DOTALL))
match_count = 0
for m in matches:
    match_count += 1
    r = range(bisect.bisect(nl, m.start()-1), bisect.bisect(nl, m.end()-1)+1)
    print(re.sub(r"\s+", " ", m.group(1), re.DOTALL), "found on line(s)", *r)
print(f"{match_count} occurrences of new File() found in file....")

output: output:

new File() found on line(s) 0 1
new File() found on line(s) 2
new File() found on line(s) 3 4
new File() found on line(s) 5
new File() found on line(s) 5 6
new File() found on line(s) 7 8 9 10 11
new File() found on line(s) 13
new File() found on line(s) 13
8 occurrences of new File() found in file....

You can count the number of newlines before the match, and then count the number of newlines in the match value, and combine the line numbers: See the Python demo :可以先统计匹配前的换行数,再统计匹配值中的换行数,合并行号:见Python演示

import re
s='new File()\nnew\nFile()\nthere is a new File()\nnew\n \n \n \nFile()\nthere is not a matching pattern here File() new\nnew File() test new File() occurs twice on this line'
pattern = r'new\s+File\s*\(\s*\)'
for m in re.finditer(pattern, s):
    linenums = [s[:m.start()].count('\n') + 1]
    for _ in range(m.group().count('\n')):
        linenums.append(linenums[-1] + 1)
    print('{} Found on line {}'.format(re.sub(r'\s+', ' ', m.group()), ", ".join(map(str,linenums))))

See the online Python demo .请参阅在线 Python 演示

Output: Output:

new File() Found on line 1
new File() Found on line 2, 3
new File() Found on line 4
new File() Found on line 5, 6, 7, 8, 9
new File() Found on line 11
new File() Found on line 11

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM