繁体   English   中英

如何在匹配字符串之间提取文本,包括匹配字符串和行

[英]How to extract text between matching strings including match strings and lines

我正在使用python来提取匹配字符串之间的某些字符串。 这些字符串是从列表生成的,该列表再次由单独的python函数动态生成。 我正在处理的清单如下: -

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]

我想要的输出类似于: -

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output

如您所见,我想提取以line1开头并以line3结尾的文本/行(直到行结尾) 最终输出包括匹配的单词(即line1和line3)。

我试过的代码是: -

# Convert list to string first
list_to_str = '\n'.join(sample_list)
# Get desired output
print(re.findall('\nline1(.*?)\nline2(.*?)\nline3($)', list_to_str, re.DOTALL))

这是我作为输出()得到的: -

[]

任何帮助表示赞赏。

编辑1: -我做了一些工作,找到了最近的解决方案: -

matches = (re.findall(r"^line1(.*)\nline2(.*)\nline3(.*)$", list_to_str, re.MULTILINE))

for match in matches:
    print('\n'.join(match))

它给了我这个输出: -

 this line is the first line
 this line is second line to be included in output
 this is the third and it should also be included in output
 this may contain other strings as well
 this line is second line to be included in output...
 this is the third should also be included in output

输出几乎正确,但不包括匹配文本。

如果你正在寻找1,2和3行的序列,没有重复
就是这个

line1.*\\s*(?!\\s|line[13])line2.*\\s*(?!\\s|line[12])line3.*

解释

 line1 .* \s*             # line 1 plus newline(s)
 (?! \s | line [13] )     # Next cannot be line 1 or 3 (or whitespace)
 line2 .* \s*             # line 2 plus newline(s)
 (?! \s | line [12] )     # Next cannot be line 1 or 2 (or whitespace)
 line3 .*                 # line 3 

如果要捕获行内容,只需将捕获组放在(.*)旁边

这可能不是最清晰的方式(您可能想要使用正则表达式),但输出您想要的内容:

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]
output = []
text = str
line1 = ""
line2 = ""
line3 = ""
prevStart = ""
for text in sample_list:
    if prevStart == "":
        if text.startswith("line1"):
            prevStart = "line1"
            line1 = text
    elif prevStart == "line1":
        if text.startswith("line2"):
            prevStart ="line2"
            line2 = text
        elif text.startswith("line1"):
            line1 = text
            prevStart = "line1"
        else:
            prevStart = ""
    elif prevStart == "line2":
        if text.startswith("line3"):
            prevStart = ""
            line3 = text
        else:
            prevStart = ""
    if line1 != "" and line2 != "" and line3 != "":
        output.append(line1)
        output.append(line2)
        output.append(line3)
        line1 = ""
        line2 = ""
        line3 = ""

for line in output:
    print line

此代码的输出是:

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM