Simple one here but I'm fairly new to Python.
I have a string like this:
this is page one of an article
<!--pagebreak page two --> this is page two
<!--pagebreak--> this is the third page
<!--pagebreak page four --> last page
// newlines added for readability
I need to split the string using this regex: <!--pagebreak(*.?)-->
- the idea is that sometimes the <!--pagebreak-->
comments have a 'title' (which I use in my templates), other times they don't.
I tried this:
re.split("<!--pagebreak*.?-->", str)
which returned only the items with 'titles' in the pagebreak (and didn't split them correctly either). What am I doing wrong here?
Change *.?
into .*?
:
re.split("<!--pagebreak.*?-->", str)
Your current regex accepts any number of literal k
's, optionally followed by (any character).
Also, I would recommend using raw strings ( r"..."
) for your regular expressions. It's not necessary in this case, but it's an easy way to spare yourself a few headaches.
You swapped the .
with the *
. The correct regex is:
<!--pagebreak.*?-->
Definitely an issue of swapping the . and *. "." matches all and the asterisk indicates that you'll take as many characters as you can get (limited of course by the non-greedy qualifier "?")
import re
s = """this is page one of an article
<!--pagebreak page two --> this is page two
<!--pagebreak--> this is the third page
<!--pagebreak page four --> last page"""
print re.split(r'<!--pagebreak.*?-->', s)
Outputs:
['this is page one of an article \\n', ' this is page two \\n', ' this is the third page \\n', ' last page']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.