简体   繁体   中英

Trying to match this regex

I have been trying to match this regex to no avail. What i need to do is do a non greedy match which will match the latest number to a specific word in this case: Next:

Here is the text:

<a href="/forum/view-forum/standard-trading-shops/page/1">Prev</a>
<a href="/forum/view-forum/standard-trading-shops/page/1">1</a>
<a class="current" href="/forum/view-forum/standard-trading-shops/page/2">2</a>
<a href="/forum/view-forum/standard-trading-shops/page/3">3</a>
<a href="/forum/view-forum/standard-trading-shops/page/4">4</a>
<span class="separator">...</span><a href="/forum/view-forum/standard-trading-shops/page/3029">3029</a>
<a href="/forum/view-forum/standard-trading-shops/page/3030">3030</a>
<a href="/forum/view-forum/standard-trading-shops/page/3">Next</a>

I need to find 3030 as my answer which in extend is the highest number from the passage.

What i tired to do:

(/d)+.*?Next

This however always matches (1) the first number on the 2nd line instead of the highest number 3030. It was my understanding that .*? does a non greedy match which should match the latest occurrence.

Can anyone help me? thanks M

^[\s\S]*>(\d+)<

You can try this.Grab the group 1 or capture 1 .See demo.

https://regex101.com/r/sJ9gM7/28

Here you do a greedy match upto a number .So this will stop at the last occurance of number between >< . . will not match newlines by default so either DOTALL or [\\s\\S] can be used.

Parsing HTML with regexes is generally ill-advised. This website explains why and gives you better alternatives in all major languages.

You haven't specified which language you're working in, but this regex will work in most cases:

(\d+)(?:<[^>]+>[^<]*){2}Next

正则表达式可视化

Debuggex Demo

The number will be in the first capture-group. Effectively I'm saying that after the number should be {2} instances of of < then any characters that aren't > until the > and optionally some characters that aren't < until the next instance. After those 2 instances of <something> should be the word Next .

Using BeautifulSoup is the preferred method for parsing HTML.

s = """<a href="/forum/view-forum/standard-trading-shops/page/1">Prev</a>
<a href="/forum/view-forum/standard-trading-shops/page/1">1</a>
<a class="current" href="/forum/view-forum/standard-trading-shops/page/2">2</a>
<a href="/forum/view-forum/standard-trading-shops/page/3">3</a>
<a href="/forum/view-forum/standard-trading-shops/page/4">4</a>
<span class="separator">...</span><a href="/forum/view-forum/standard-trading-shops/page/3029">3029</a>
<a href="/forum/view-forum/standard-trading-shops/page/3030">3030</a>
<a href="/forum/view-forum/standard-trading-shops/page/3">Next</a>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(s)
text = soup.text.splitlines()
index = text.index('Next')
result = text[index-1]

>>> print result
3030

Not as elegant as a regular expression, but it's the proper way to do it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM