I have a string like so:
'\n479 Appendix I\n1114\nAppendix I 481\n'
and want to use a regular expression to find and return
['479 Appendix I', 'Appendix I 481']
I first tried this expression:
pattern = r'''
(?: \d+ \s)? Appendix \s+ \w+ (?: \s \d+)?
'''
regex = re.compile(pattern, re.VERBOSE)
regex.findall(s)
But this returns
['479 Appendix I\n1114', 'Appendix I 481']
because \\s
also matches \\n
. Following one of the answers in this post Python regex match space only , I tried the following:
pattern = r'''
(?: \d+ [^ \S\t\n])? Appendix \s+ \w+ (?: [^ \S\t\n] \d+)?
'''
regex = re.compile(pattern, re.VERBOSE)
regex.findall(s)
which however didn't return the desired result, giving:
['Appendix I', 'Appendix I']
What expression would work in this case?
import re
s = '\n479 Appendix I\n1114\nAppendix I 481\n'
for g in re.findall(r'^.*[^\d\n].*$', s, flags=re.M):
print(g)
Prints:
479 Appendix I
Appendix I 481
This regex will match all lines that contain at least one character different than digit or newline. Explanation of this regex here .
This regex is a bit more robust than the one in the other answer because it explicitly anchors at "Appendix":
pattern = '(?:\d*[\t ]+)?Appendix\s+\w+(?:[\t ]+\d*)?'
re.findall(pattern, s)
#['479 Appendix I', 'Appendix I 481']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.