Match last occurrence with regex

Question

I would like to match last occurrence of a pattern using regex.

I have some text structured this way:

Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>

I want to match the last text between two   in my case  Tizi Ouzou  , ideally the Tizi Ouzou string

Note that there is some white spaces after the last  

I've tried this:

<br>.*<br>\s*$

but it selects everything starting from the first   to the last.

NB: I'm on python, and I'm using pythex to test my regex

Answer 1

For me the clearest way is:

>>> re.findall('<br>(.*?)<br>', text)[-1]
'Tizi Ouzou'

Answer 2

A non regex approach using the builtin str functions:

text = """
Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>       """

res = text.rsplit('<br>', 2)[-2]
#Tizi Ouzou

Answer 3

Have a look at the related questions: you shouldn't parse HTML with regex . Use a regex parser instead. For Python, I hear Beautiful Soup is the way to go.

Anyway, if you want to do it with regex, you need to make sure that .* cannot go past another   . To do that, before consuming each character we can use a lookahead to make sure that it doesn't start another   :

<br>(?:(?!<br>).)*<br>\s*$

Answer 4

You can use in greedy quantifier with a reduced character class (assuming you have no tags between you   ):

<br>([^<]*)<br>\s*$

or

<br>((?:[^<]+|<(?!br>))*)<br>\s*$

to allow tags inside.

Since the string you search is Tizi Ouzou without   you can extract the first capturing group.

Answer 5

How about [^<>]* instead of .* :

import re


text = """Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """


print re.search('<br>([^<>]*)<br>\s*$', text).group(1)

prints

Tizi Ouzou

Answer 6

Try:

re.match(r'(?s).*<br>(?=.*<br>)(.*)<br>', s).group(1)

It first consumes all data until last   and backtracks until it checks with a look-ahead that there is another   after it, and then extracts the content between them.

It yields:

Tizi Ouzou

EDIT : No need to look-ahead. Alternative (with same result) based in comment of m.buettner

re.match(r'(?s).*<br>(.*)<br>', s).group(1)

Match last occurrence with regex

Question

6 answers

solution1
15 2013-08-24 19:56:33

solution2
14 ACCPTED 2013-08-24 19:45:21

solution3
7 2013-08-24 19:46:51

solution4
6 2013-08-24 19:44:16

solution5
4 2013-08-24 19:46:29

solution6
3 2013-08-24 19:44:45

Match last occurrence with regex

Question

6 answers

solution1 15 2013-08-24 19:56:33

solution2 14 ACCPTED 2013-08-24 19:45:21

solution3 7 2013-08-24 19:46:51

solution4 6 2013-08-24 19:44:16

solution5 4 2013-08-24 19:46:29

solution6 3 2013-08-24 19:44:45

solution1
15 2013-08-24 19:56:33

solution2
14 ACCPTED 2013-08-24 19:45:21

solution3
7 2013-08-24 19:46:51

solution4
6 2013-08-24 19:44:16

solution5
4 2013-08-24 19:46:29

solution6
3 2013-08-24 19:44:45