Match text outside of html tags

Question

Before anyone says it I know I should use a proper parser but for my use case it is better to use a regular expression.

I have the following regex to try and match text outside of html tags:

(?<!<[^>]*)(?<Text>.+?)

However this seems to be matching the opening bracket of the tag, ie < . How can I fix this?

Example input:

<span style="color:blue">some <strong>bold</strong> text</span>

Expected:

some bold text

Got:

<some <bold< text<

Link to RegexStorm.

Answer 1

The problem is that you are using . that matches any character. Replace it with a negated character class, like [^<>] that matches any char but < and > and use a greedy quantifier * (to match 0 or more occurrences) or + (to match 1 or more occurrences):

(?<!<[^>]*)(?<Text>[^<>]*)

See the regex demo

BTW, using (?<Text>.+?) at the end of the pattern only makes the regex engine match 1 char since the +? is a lazy quantifier matching 1 or more occurrences but as few as possible (and since 1 is enough, it will always match just 1 char). Usually, there must be some other pattern after such a lazily quantified one, else, it usually does not fetch the right texts.

Match text outside of html tags

Question

1 answers

solution1
4 ACCPTED 2017-01-12 12:17:42

Match text outside of html tags

Question

1 answers

solution1 4 ACCPTED 2017-01-12 12:17:42

solution1
4 ACCPTED 2017-01-12 12:17:42