简体   繁体   中英

re.findall multiline python

re.findall with re.M is not finding the multiline i'm trying to search for

I am trying to extract all multiline strings matching a pattern from a file

Example from file book.txt :

Title: Le Morte D'Arthur, Volume I (of II)
       King Arthur and of his Noble Knights of the Round Table

Author: Thomas Malory

Editor: William Caxton

Release Date: March, 1998  [Etext #1251]
Posting Date: November 6, 2009

Language: English

Title: Pride and Prejudice

Author: Jane Austen

Posting Date: August 26, 2008 [EBook #1342]
Release Date: June, 1998
Last Updated: October 17, 2016

Language: English

The following code only returns the first line Le Morte D'Arthur, Volume I (of II)

re.findall('^Title:\s(.+)$', book, re.M)

I am expecting the output to be

[' Le Morte D'Arthur, Volume I (of II)\\n King Arthur and of his Noble Knights of the Round Table', ' Pride and Prejudice']

To clarify,
- The second line is optional, its there in some files and not there in others. Also there's more text after the second line that I don't want to read.
- Using re.findall(r'Title: (.+\\n.+)$', text, flags=re.MULTILINE) works but fails if the second line is just blank.
- I'm running python3.7.
- I'm converting a txt file to a string and then running re on the str.
- The following don't work either:
re.findall(r'^Title:\\s(.+)$', text, re.S)
re.findall(r'^Title:\\s(.+)$', text, re.DOTALL)

I'm guessing that maybe this expression,

(?<=Title:\s)(.*?)\s*(?=Author)

might be close to what might be desired to be designed.

DEMO

Test

import re

regex = r"(?<=Title:\s)(.*?)\s*(?=Author)"

test_str = ("Title: Le Morte D'Arthur, Volume I (of II)\n"
    "       King Arthur and of his Noble Knights of the Round Table\n\n"
    "Title: Le Morte D'Arthur, Volume I (of II)\n"
    "       King Arthur and of his Noble Knights of the Round Table")

print(re.findall(regex, test_str, re.DOTALL))

Output

["Le Morte D'Arthur, Volume I (of II)\n       King Arthur and of his Noble Knights of the Round Table\n\n", "Le Morte D'Arthur, Volume I (of II)\n       King Arthur and of his Noble Knights of the Round Table"]

You can use your regex with the DOTALL flag for allowing your . to match the newline char :

re.findall('^Title:\s(.+)$', book, re.DOTALL)

output :

Le Morte D'Arthur, Volume I (of II)\n       King Arthur and of his Noble Knights of the Round Table

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM