简体   繁体   English

将最后一次匹配与正则表达式匹配

[英]Match last occurrence with regex

I would like to match last occurrence of a pattern using regex. 我想使用正则表达式匹配模式的最后一次出现。

I have some text structured this way: 我有这样的文字结构:

Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>                        

I want to match the last text between two <br> in my case <br>Tizi Ouzou<br> , ideally the Tizi Ouzou string 我希望在我的案例中匹配两个<br>之间的最后一个文本<br>Tizi Ouzou<br> ,理想情况下Tizi Ouzou字符串

Note that there is some white spaces after the last <br> 请注意,是最后经过一番空格<br>

I've tried this: 我试过这个:

<br>.*<br>\s*$

but it selects everything starting from the first <br> to the last. 但它选择了从第一个开始<br>到最后。

NB: I'm on python, and I'm using pythex to test my regex 注意:我正在使用python,我正在使用pythex来测试我的正则表达式

For me the clearest way is: 对我来说最清楚的方法是:

>>> re.findall('<br>(.*?)<br>', text)[-1]
'Tizi Ouzou'

A non regex approach using the builtin str functions: 使用内置str函数的非正则表达式方法:

text = """
Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>       """

res = text.rsplit('<br>', 2)[-2]
#Tizi Ouzou

Have a look at the related questions: you shouldn't parse HTML with regex . 看看相关的问题: 你不应该用正则表达式解析HTML Use a regex parser instead. 请改用正则表达式解析器。 For Python, I hear Beautiful Soup is the way to go. 对于Python,我听说美丽的汤是要走的路。

Anyway, if you want to do it with regex, you need to make sure that .* cannot go past another <br> . 无论如何,如果你想用正则表达式做,你需要确保.*不能超越另一个<br> To do that, before consuming each character we can use a lookahead to make sure that it doesn't start another <br> : 要做到这一点,消费每个角色之前,我们可以用一个前瞻 ,以确保它不会启动另一个<br>

<br>(?:(?!<br>).)*<br>\s*$

You can use in greedy quantifier with a reduced character class (assuming you have no tags between you <br> ): 您可以用减少的字符类贪婪量词使用(假设你有没有标签之间你<br> ):

<br>([^<]*)<br>\s*$

or 要么

<br>((?:[^<]+|<(?!br>))*)<br>\s*$

to allow tags inside. 允许内部标签。

Since the string you search is Tizi Ouzou without <br> you can extract the first capturing group. 由于你搜索的字符串是没有<br> Tizi Ouzou <br>你可以提取第一个捕获组。

How about [^<>]* instead of .* : 怎么样[^<>]*而不是.*

import re


text = """Pellentesque habitant morbi tristique senectus et netus et
lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae
ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam
egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """


print re.search('<br>([^<>]*)<br>\s*$', text).group(1)

prints 版画

Tizi Ouzou

Try: 尝试:

re.match(r'(?s).*<br>(?=.*<br>)(.*)<br>', s).group(1)

It first consumes all data until last <br> and backtracks until it checks with a look-ahead that there is another <br> after it, and then extracts the content between them. 它首先消耗的所有数据,直到最后的<br>和回溯,直到它与前瞻,还有另一种检查<br>后,然后提取它们之间的内容。

It yields: 它产生:

Tizi Ouzou

EDIT : No need to look-ahead. 编辑 :不需要展望。 Alternative (with same result) based in comment of m.buettner 替代(具有相同的结果)基于m.buettner的评论

re.match(r'(?s).*<br>(.*)<br>', s).group(1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM