python: how to split this string with a regex?

Question

Simple one here but I'm fairly new to Python.

I have a string like this:

this is page one of an article 
<!--pagebreak page two --> this is page two 
<!--pagebreak--> this is the third page 
<!--pagebreak page four --> last page
// newlines added for readability

I need to split the string using this regex:  - the idea is that sometimes the  comments have a 'title' (which I use in my templates), other times they don't.

I tried this:

re.split("<!--pagebreak*.?-->", str)

which returned only the items with 'titles' in the pagebreak (and didn't split them correctly either). What am I doing wrong here?

Answer 1

Change *.? into .*? :

re.split("<!--pagebreak.*?-->", str)

Your current regex accepts any number of literal k 's, optionally followed by (any character).

Also, I would recommend using raw strings ( r"..." ) for your regular expressions. It's not necessary in this case, but it's an easy way to spare yourself a few headaches.

Answer 2

You swapped the . with the * . The correct regex is:

<!--pagebreak.*?-->

Answer 3

Definitely an issue of swapping the . and *. "." matches all and the asterisk indicates that you'll take as many characters as you can get (limited of course by the non-greedy qualifier "?")

import re

s = """this is page one of an article 
<!--pagebreak page two --> this is page two 
<!--pagebreak--> this is the third page 
<!--pagebreak page four --> last page"""

print re.split(r'<!--pagebreak.*?-->', s)

Outputs:

['this is page one of an article \\n', ' this is page two \\n', ' this is the third page \\n', ' last page']

python: how to split this string with a regex?

Question

3 answers

solution1
2 ACCPTED 2012-10-04 08:41:03

solution2
2 2012-10-04 08:41:35

solution3
2 2012-10-04 08:52:22

python: how to split this string with a regex?

Question

3 answers

solution1 2 ACCPTED 2012-10-04 08:41:03

solution2 2 2012-10-04 08:41:35

solution3 2 2012-10-04 08:52:22

solution1
2 ACCPTED 2012-10-04 08:41:03

solution2
2 2012-10-04 08:41:35

solution3
2 2012-10-04 08:52:22