简体   繁体   中英

Python Regular Expression to match space but not newline

I have a string like so:

'\n479 Appendix I\n1114\nAppendix I 481\n'

and want to use a regular expression to find and return

['479 Appendix I', 'Appendix I 481']

I first tried this expression:

pattern = r'''
(?: \d+ \s)? Appendix \s+ \w+ (?: \s \d+)?
'''

regex = re.compile(pattern, re.VERBOSE)

regex.findall(s)

But this returns

['479 Appendix I\n1114', 'Appendix I 481']

because \\s also matches \\n . Following one of the answers in this post Python regex match space only , I tried the following:

pattern = r'''
(?: \d+ [^ \S\t\n])? Appendix \s+ \w+ (?: [^ \S\t\n] \d+)?
'''

regex = re.compile(pattern, re.VERBOSE)

regex.findall(s)

which however didn't return the desired result, giving:

['Appendix I', 'Appendix I']

What expression would work in this case?

import re

s = '\n479 Appendix I\n1114\nAppendix I 481\n'

for g in re.findall(r'^.*[^\d\n].*$', s, flags=re.M):
    print(g)

Prints:

479 Appendix I
Appendix I 481

This regex will match all lines that contain at least one character different than digit or newline. Explanation of this regex here .

This regex is a bit more robust than the one in the other answer because it explicitly anchors at "Appendix":

pattern = '(?:\d*[\t ]+)?Appendix\s+\w+(?:[\t ]+\d*)?'
re.findall(pattern, s)
#['479 Appendix I', 'Appendix I 481']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM