简体   繁体   中英

Match last occurrence with regex

I would like to match last occurrence of a pattern using regex.

I have some text structured this way:

Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>                        

I want to match the last text between two <br> in my case <br>Tizi Ouzou<br> , ideally the Tizi Ouzou string

Note that there is some white spaces after the last <br>

I've tried this:

<br>.*<br>\s*$

but it selects everything starting from the first <br> to the last.

NB: I'm on python, and I'm using pythex to test my regex

For me the clearest way is:

>>> re.findall('<br>(.*?)<br>', text)[-1]
'Tizi Ouzou'

A non regex approach using the builtin str functions:

text = """
Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>       """

res = text.rsplit('<br>', 2)[-2]
#Tizi Ouzou

Have a look at the related questions: you shouldn't parse HTML with regex . Use a regex parser instead. For Python, I hear Beautiful Soup is the way to go.

Anyway, if you want to do it with regex, you need to make sure that .* cannot go past another <br> . To do that, before consuming each character we can use a lookahead to make sure that it doesn't start another <br> :

<br>(?:(?!<br>).)*<br>\s*$

You can use in greedy quantifier with a reduced character class (assuming you have no tags between you <br> ):

<br>([^<]*)<br>\s*$

or

<br>((?:[^<]+|<(?!br>))*)<br>\s*$

to allow tags inside.

Since the string you search is Tizi Ouzou without <br> you can extract the first capturing group.

How about [^<>]* instead of .* :

import re


text = """Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """


print re.search('<br>([^<>]*)<br>\s*$', text).group(1)

prints

Tizi Ouzou

Try:

re.match(r'(?s).*<br>(?=.*<br>)(.*)<br>', s).group(1)

It first consumes all data until last <br> and backtracks until it checks with a look-ahead that there is another <br> after it, and then extracts the content between them.

It yields:

Tizi Ouzou

EDIT : No need to look-ahead. Alternative (with same result) based in comment of m.buettner

re.match(r'(?s).*<br>(.*)<br>', s).group(1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM