简体   繁体   中英

Matching from the last occurence of a character in a string with Regex

Yes I know, don't parse html with regex. That said:

I am trying to capture content between any tag with the word "Title" in the first tag.

I started with:

(?P<QUALIFY_TITLE><(.*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)

Where the Named Group Capture is a known word/string I am looking for. I also capture for research sake the QUALIFY_TITLE Name group. I do this because I don't want the string/term unless I 'qualify' it in this way.

However, if I have part of an html that looks like this:

<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title">KnownTermIWant</title>

Although I get the CAPTURE String I want (KnownTermIWant), the Qualify string starts from the very first "

I am trying to have the QUALIFY_TITLE start/capture from the last "<" before the title not the first in other words QUALIFY TITLE should be:

<div id="divTitle

or even

<div id="divTitle" class="title">

but I am currently getting

<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title"

The problem is that a regex-search will try to match at the first possible opportunity, and non-greedy quantifiers ( *? instead of * ) do not affect whether something is a match. For example, given the string abcd , the regex .*?d will match the whole thing, because .*? will still match as much as it needs to in order to ensure that the regex matches.

Do you see what I mean?

So you need to make your subexpressions more precise; for example, instead of <(.*?)(title)(.*?)> , you should write <([^>]*)(title)([^>]*)> .

The problem

There's only one problem here, you are matching exactly what you've asked for :)

The process

If you want to match only the last tag, ask yourself this question:

"What is inside every preceding tag, but not inside the one I want?"

The conclusion

The answer is the open/close tags themselves:

(?P<QUALIFY_TITLE><([^<>]*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)
                    ^^^^^

Your code was quite a big mess, but I'm going to answer the question in the title, in a much more simplified way:

In this sample code:

<div>Example text<div>Foo bar</div> Hello world <div>Lorem ipsum</div></div> hi

if you want to match from the first <div> to the last </div> , you could just use a greedy quantifier, such as + or * :

/<div>(.*)<\/div>/

That will match the whole string, until the very last </div> .

Demo

If this doesn't answer your question, the complexity of the regular expression would quickly get higher very fast (it's bascially exponentially more complex for extra requirements), so like you said in your very first line, just use a parser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM