简体   繁体   English

即使个别项目要

[英]Multi-line regex fails to match even though individual items to

I'm trying to search through a bunch of large text files for specific information. 我正在尝试搜索大量的大型文本文件以获取特定信息。

#!/usr/bin/env python
# pythnon 3.4
import re
sometext = """
    lots
    of
    text here
    Sentinel starts
    --------------------
    item_one               item_one_result
    item_two               item_two_result
    --------------------
    lots
    more
    text here
    Sentinel starts
    --------------------
    item_three               item_three_result
    item_four                item_four_result
    item_five                item_five_result
    --------------------
    even
    more
    text here
    Sentinel starts
    --------------------
    item_six                item_six_result
    --------------------
    """
sometextpattern = re.compile( '''.*Sentinel\s+starts.*$                           # sentinel
                                 ^.*-+.*$                                         # dividing line
                                 ^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+  # item details
                                 ^.*-+.*$                                         # dividing line                                  
                              ''', flags = re.MULTILINE | re.VERBOSE)
print( re.findall( sometextpattern, sometext ) )

Individually, the sentinels and dividing lines match on their own. 前哨和分界线各自独立匹配。 How do I make this work together? 我如何一起工作? ie I would like this to print: 即我想打印:

[('item_one','item_one_result'),('item_two','item_two_result'),('item_three','item_three_result'),('item_four','item_four_result'),('item_five','item_five_results'),('item_six','item_six_results')]

The regex multiline matching tag only makes ^ and $ match the beginning and end of each line, respectively. regex多行匹配标记仅使^$匹配每行的开头和结尾。 If you want to match multiple lines, you will need to add a whitespace meta character '\\\\s' to match the newline. 如果要匹配多行,则需要添加一个空格元字符'\\\\s'来匹配换行符。

.*Sentinel\s+starts.*$\s
^.*-+.*$\s
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+
^.*-+.*$

正则表达式可视化

Debuggex Demo Debuggex演示


Also the string you are using does not have the required string escaping. 另外,您使用的字符串没有必需的字符串转义。 I would recommend using the r'' type string instead. 我建议改用r''类型的字符串。 That way you do not have to escape your backslashes. 这样,您就不必逃脱反斜杠。

Use four capturing groups in-order to print the text you want inside the list. 按顺序使用四个捕获组以在列表中打印所需的文本。

>>> import regex
>>> text = """    lots
    of
    text here
    Sentinel starts
    --------------------
    item_one               item_one_result
    item_two               item_two_result
    --------------------
    lots
    more
    text here
    Sentinel starts
    --------------------
    item_three               item_three_result
    item_four                item_four_result
    item_five                item_five_result
    --------------------
    even
    more
    text here
    Sentinel starts
    --------------------
    item_six                item_six_result
    --------------------"""
>>> regex.findall(r'(?:(?:\bSentinel starts\s*\n\s*-+\n\s*|-+)|(?<!^)\G) *(\w+) *(\w+)\n*', text)
[('item_one', 'item_one_result'), ('item_two', 'item_two_result'), ('item_three', 'item_three_result'), ('item_four', 'item_four_result'), ('item_five', 'item_five_result'), ('item_six', 'item_six_result')]

\\s* matches zero or more space characters and \\S+ matches one or more non-space characters. \\s*匹配零个或多个空格字符, \\S+匹配一个或多个非空格字符。 \\G assert position at the end of the previous match or the start of the string for the first match. \\G在上一个匹配项的末尾或第一个匹配项的字符串的开始处声明位置。

DEMO 演示

Try these regex: 试试这些正则表达式:

for m in re.findall(r'(?:Sentinel starts\n[-\n]*)([^-]+)', sometext, flags=re.M ):
    print(list(re.findall(r'(\w+)\s+(\w+)', m)))

It gives you a list of key,value tuples: 它为您提供键,值元组的列表:

# [('item_one', 'item_one_result'), ('item_two', 'item_two_result')]
# [('item_three', 'item_three_result'), ('item_four', 'item_four_result')]

Because the text has trailing spaces, change the regex in the for statement for this one: 由于文本具有尾随空格,因此请在for语句中更改此正则表达式:

r'(?:Sentinel starts\s+-*)([^-]*\b)'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM