[英]Multi-line regex fails to match even though individual items to
I'm trying to search through a bunch of large text files for specific information. 我正在尝试搜索大量的大型文本文件以获取特定信息。
#!/usr/bin/env python
# pythnon 3.4
import re
sometext = """
lots
of
text here
Sentinel starts
--------------------
item_one item_one_result
item_two item_two_result
--------------------
lots
more
text here
Sentinel starts
--------------------
item_three item_three_result
item_four item_four_result
item_five item_five_result
--------------------
even
more
text here
Sentinel starts
--------------------
item_six item_six_result
--------------------
"""
sometextpattern = re.compile( '''.*Sentinel\s+starts.*$ # sentinel
^.*-+.*$ # dividing line
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+ # item details
^.*-+.*$ # dividing line
''', flags = re.MULTILINE | re.VERBOSE)
print( re.findall( sometextpattern, sometext ) )
Individually, the sentinels and dividing lines match on their own. 前哨和分界线各自独立匹配。 How do I make this work together?
我如何一起工作? ie I would like this to print:
即我想打印:
[('item_one','item_one_result'),('item_two','item_two_result'),('item_three','item_three_result'),('item_four','item_four_result'),('item_five','item_five_results'),('item_six','item_six_results')]
The regex multiline matching tag only makes ^
and $
match the beginning and end of each line, respectively. regex多行匹配标记仅使
^
和$
匹配每行的开头和结尾。 If you want to match multiple lines, you will need to add a whitespace meta character '\\\\s'
to match the newline. 如果要匹配多行,则需要添加一个空格元字符
'\\\\s'
来匹配换行符。
.*Sentinel\s+starts.*$\s
^.*-+.*$\s
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+
^.*-+.*$
Also the string you are using does not have the required string escaping. 另外,您使用的字符串没有必需的字符串转义。 I would recommend using the
r''
type string instead. 我建议改用
r''
类型的字符串。 That way you do not have to escape your backslashes. 这样,您就不必逃脱反斜杠。
Use four capturing groups in-order to print the text you want inside the list. 按顺序使用四个捕获组以在列表中打印所需的文本。
>>> import regex
>>> text = """ lots
of
text here
Sentinel starts
--------------------
item_one item_one_result
item_two item_two_result
--------------------
lots
more
text here
Sentinel starts
--------------------
item_three item_three_result
item_four item_four_result
item_five item_five_result
--------------------
even
more
text here
Sentinel starts
--------------------
item_six item_six_result
--------------------"""
>>> regex.findall(r'(?:(?:\bSentinel starts\s*\n\s*-+\n\s*|-+)|(?<!^)\G) *(\w+) *(\w+)\n*', text)
[('item_one', 'item_one_result'), ('item_two', 'item_two_result'), ('item_three', 'item_three_result'), ('item_four', 'item_four_result'), ('item_five', 'item_five_result'), ('item_six', 'item_six_result')]
\\s*
matches zero or more space characters and \\S+
matches one or more non-space characters. \\s*
匹配零个或多个空格字符, \\S+
匹配一个或多个非空格字符。 \\G
assert position at the end of the previous match or the start of the string for the first match. \\G
在上一个匹配项的末尾或第一个匹配项的字符串的开始处声明位置。
Try these regex: 试试这些正则表达式:
for m in re.findall(r'(?:Sentinel starts\n[-\n]*)([^-]+)', sometext, flags=re.M ):
print(list(re.findall(r'(\w+)\s+(\w+)', m)))
It gives you a list of key,value tuples: 它为您提供键,值元组的列表:
# [('item_one', 'item_one_result'), ('item_two', 'item_two_result')]
# [('item_three', 'item_three_result'), ('item_four', 'item_four_result')]
Because the text has trailing spaces, change the regex in the for statement for this one: 由于文本具有尾随空格,因此请在for语句中更改此正则表达式:
r'(?:Sentinel starts\s+-*)([^-]*\b)'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.