简体   繁体   English

在 Python 中使用正则表达式的文本后提取字符串

[英]Extract a string after a text with regex in Python

I have a doc file that it has the following structure:我有一个 doc 文件,它具有以下结构:

This is a fairy tale written by

    John Doe and Mary Smith
    
    Auckland,somewhere
    
 This story is awesome

I would like to extract the two lines of text which are:我想提取两行文本,它们是:

        John Doe and Mary Smith
        
        Auckland,somewhere

and append those values into a list by using regex.和 append 使用正则表达式将这些值放入列表中。 The two lines that I want to extract are always between the lines This is a fairy tale written by and This story is awesome .我想提取的两行字总是在字里行间This is a fairy taleThis story is awesome了。 How can I do that?我怎样才能做到这一点? I have tried some combinations with before_keyword,keyword,after_keyword=text.partition(regex) , but no luck at all.我尝试了一些与before_keyword,keyword,after_keyword=text.partition(regex)的组合,但一点运气都没有。

You can use a regex with re.DOTALL that enables .您可以使用带有re.DOTALL的正则表达式来启用. to match any character including newlines.匹配任何字符,包括换行符。 Once you have the text between the two delimiters, you can use another regex without the re.DOTALL to extract lines that contain at least one non-whitespace character ( \S ).一旦在两个分隔符之间有了文本,就可以使用另一个不带re.DOTALL的正则表达式来提取至少包含一个非空白字符 ( \S ) 的行。

import re

lst = []

with open('input.txt') as f:
    text = f.read()

match = re.search('This is a fairy tale written by(.*?)This story is awesome', 
                  text, re.DOTALL)

if match:
    lst.extend(re.findall('.*\S.*', match.group(1)))

print(lst)

Gives:给出:

['    John Doe and Mary Smith', '    Auckland,somewhere']

You may start with this:你可以从这个开始:

re.search(r'(?<=This is a fairy tale written by\n).*?(?=\n\s*This story is awesome)', s, re.MULTILINE|re.DOTALL).group(0)

and fine-tune this regex.并微调这个正则表达式。 re.MULTILINE may be omitted as you do not have ^ or $ anyway, but re.DOTALL is required to let . re.MULTILINE可以省略,因为您没有^$反正,但re.DOTALL需要 let . to match newline as well.也匹配换行符。 The regex above uses look ahead and look behind (?<=) , (?=) .上面的正则表达式使用向前看和向后看(?<=) , (?=) If you do not like that, you can use parentheses instead for captures.如果您不喜欢这样,您可以使用括号代替捕获。

If you can create a list of strings from your docfile, then no need to use a regex.如果您可以从 docfile 创建字符串列表,则无需使用正则表达式。 Just do this simple program:只需执行这个简单的程序:

fileContent = ['This is a fairy tale written by','John Doe and Mary Smith','Auckland,somewhere','This story is awesome',
               'Some other things', 'story texts', 'Not Important data',
               'This is a fairy tale written by','Kem Cho?','Majama?','This story is awesome', 'Not important data']
               
authorsList = []
for i in range(len(fileContent)-3):
    if fileContent[i] == 'This is a fairy tale written by' and fileContent[i+3] == 'This story is awesome':
        authorsList.append([fileContent[i+1], fileContent[i+2]])

print(authorsList)

Here I simply check for 'This is a fairy tale written by' and 'This story is awesome' and if it is found, append text between it in your list.在这里,我只需检查'This is a fairy tale written by''This story is awesome' ,如果找到,则在您的列表中显示 append 文本。

Output: Output:

[['John Doe and Mary Smith', 'Auckland,somewhere'], ['Kem Cho?', 'Majama?']]

Try using this instead.尝试改用它。 It should match anything between these two strings.它应该匹配这两个字符串之间的任何内容。

re.search(r'(?<=This is a fairy tale).*?(?=This story is awesome)',text) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM