re.findall with re.M is not finding the multiline i'm trying to search for
I am trying to extract all multiline strings matching a pattern from a file
Example from file book.txt
:
Title: Le Morte D'Arthur, Volume I (of II)
King Arthur and of his Noble Knights of the Round Table
Author: Thomas Malory
Editor: William Caxton
Release Date: March, 1998 [Etext #1251]
Posting Date: November 6, 2009
Language: English
Title: Pride and Prejudice
Author: Jane Austen
Posting Date: August 26, 2008 [EBook #1342]
Release Date: June, 1998
Last Updated: October 17, 2016
Language: English
The following code only returns the first line Le Morte D'Arthur, Volume I (of II)
re.findall('^Title:\s(.+)$', book, re.M)
I am expecting the output to be
[' Le Morte D'Arthur, Volume I (of II)\\n King Arthur and of his Noble Knights of the Round Table', ' Pride and Prejudice']
To clarify,
- The second line is optional, its there in some files and not there in others. Also there's more text after the second line that I don't want to read.
- Using re.findall(r'Title: (.+\\n.+)$', text, flags=re.MULTILINE)
works but fails if the second line is just blank.
- I'm running python3.7.
- I'm converting a txt file to a string and then running re
on the str.
- The following don't work either:
re.findall(r'^Title:\\s(.+)$', text, re.S)
re.findall(r'^Title:\\s(.+)$', text, re.DOTALL)
I'm guessing that maybe this expression,
(?<=Title:\s)(.*?)\s*(?=Author)
might be close to what might be desired to be designed.
import re
regex = r"(?<=Title:\s)(.*?)\s*(?=Author)"
test_str = ("Title: Le Morte D'Arthur, Volume I (of II)\n"
" King Arthur and of his Noble Knights of the Round Table\n\n"
"Title: Le Morte D'Arthur, Volume I (of II)\n"
" King Arthur and of his Noble Knights of the Round Table")
print(re.findall(regex, test_str, re.DOTALL))
["Le Morte D'Arthur, Volume I (of II)\n King Arthur and of his Noble Knights of the Round Table\n\n", "Le Morte D'Arthur, Volume I (of II)\n King Arthur and of his Noble Knights of the Round Table"]
You can use your regex with the DOTALL
flag for allowing your .
to match the newline char :
re.findall('^Title:\s(.+)$', book, re.DOTALL)
output :
Le Morte D'Arthur, Volume I (of II)\n King Arthur and of his Noble Knights of the Round Table
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.