Before anyone says it I know I should use a proper parser but for my use case it is better to use a regular expression.
I have the following regex to try and match text outside of html tags:
(?<!<[^>]*)(?<Text>.+?)
However this seems to be matching the opening bracket of the tag, ie <
. How can I fix this?
Example input:
<span style="color:blue">some <strong>bold</strong> text</span>
Expected:
some bold text
Got:
<some <bold< text<
The problem is that you are using .
that matches any character. Replace it with a negated character class, like [^<>]
that matches any char but <
and >
and use a greedy quantifier *
(to match 0 or more occurrences) or +
(to match 1 or more occurrences):
(?<!<[^>]*)(?<Text>[^<>]*)
See the regex demo
BTW, using (?<Text>.+?)
at the end of the pattern only makes the regex engine match 1 char since the +?
is a lazy quantifier matching 1 or more occurrences but as few as possible (and since 1 is enough, it will always match just 1 char). Usually, there must be some other pattern after such a lazily quantified one, else, it usually does not fetch the right texts.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.