So, I have this text:
<a href="/find/1">testing</a>
<strong>a known text</strong>
<p>testing2</p>
<p>this paragraphs are dynamically</p>
...
<a href="/find/2/">testing again</a>
<a href="/find/3/">testing and again</a>
I want to get all the hrefs that are under the a known text
I use this regex to get all the matches: (?<=<a\\ href=")/find/.*?(?=")
But I also get the result: /find/1 which is a result that I don't want.
I've tried this: a known tex[\\w\\W](?<=<a\\ href=")/find/*?(?=")
but it's not working. I have no idea how to get this done correctly. Basically I want to get only /find/2/ and /find/3
PS: I am not really using C# but a software that is made in C# and uses the C# regex.
I have this regex, which is a bit different from Marcin's but I'm not used to have variable length regex in lookbehinds:
var regex = new Regex(@"(?:a known text|(?<!^)\G)[\w\W]+?((?<=<a\ href="")/find/.*?(?=""))");
Which is believe should make the regex a little bit more efficient.
\\G
is a special character which matches where the previous match ended, so that after finding the first /find/
, it tries matching again. I had to put a negative lookbehind to prevent it from matching newline as well.
a known tex[\w\W](?<=<a\ href=")/find/*?(?=")
Concerning your regex, some little mistakes you made was to forget the quantifier for [\\w\\W]
and the dot for *?
after /find/
. Using a known tex[\\w\\W]+(?<=<a\\ href=")(/find/.*?)(?=")
would have got you only /find/2/
, which is already better than nothing!
EDIT: As AlanMoore rightly pointed out, you can simplify the regex:
var regex = new Regex(@"(?:a known text|(?<!^)\G)[\w\W]+?<a href=""(/find/.*?)""");
And to make the .
match newlines, we can use (?s)
and remove the [\\w\\W]
part:
var regex = new Regex(@"(?s)(?:a known text|(?<!^)\G).*?<a href=""(/find/.*?)""");
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.