简体   繁体   中英

python: how to split this string with a regex?

Simple one here but I'm fairly new to Python.

I have a string like this:

this is page one of an article 
<!--pagebreak page two --> this is page two 
<!--pagebreak--> this is the third page 
<!--pagebreak page four --> last page
// newlines added for readability

I need to split the string using this regex: <!--pagebreak(*.?)--> - the idea is that sometimes the <!--pagebreak--> comments have a 'title' (which I use in my templates), other times they don't.

I tried this:

re.split("<!--pagebreak*.?-->", str)

which returned only the items with 'titles' in the pagebreak (and didn't split them correctly either). What am I doing wrong here?

Change *.? into .*? :

re.split("<!--pagebreak.*?-->", str)

Your current regex accepts any number of literal k 's, optionally followed by (any character).

Also, I would recommend using raw strings ( r"..." ) for your regular expressions. It's not necessary in this case, but it's an easy way to spare yourself a few headaches.

You swapped the . with the * . The correct regex is:

<!--pagebreak.*?-->

Definitely an issue of swapping the . and *. "." matches all and the asterisk indicates that you'll take as many characters as you can get (limited of course by the non-greedy qualifier "?")

import re

s = """this is page one of an article 
<!--pagebreak page two --> this is page two 
<!--pagebreak--> this is the third page 
<!--pagebreak page four --> last page"""

print re.split(r'<!--pagebreak.*?-->', s)

Outputs:

['this is page one of an article \\n', ' this is page two \\n', ' this is the third page \\n', ' last page']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM