简体   繁体   中英

Match text outside of html tags

Before anyone says it I know I should use a proper parser but for my use case it is better to use a regular expression.

I have the following regex to try and match text outside of html tags:

(?<!<[^>]*)(?<Text>.+?)

However this seems to be matching the opening bracket of the tag, ie < . How can I fix this?

Example input:

<span style="color:blue">some <strong>bold</strong> text</span>

Expected:

some bold text

Got:

<some <bold< text<

Link to RegexStorm.

The problem is that you are using . that matches any character. Replace it with a negated character class, like [^<>] that matches any char but < and > and use a greedy quantifier * (to match 0 or more occurrences) or + (to match 1 or more occurrences):

(?<!<[^>]*)(?<Text>[^<>]*)

See the regex demo

BTW, using (?<Text>.+?) at the end of the pattern only makes the regex engine match 1 char since the +? is a lazy quantifier matching 1 or more occurrences but as few as possible (and since 1 is enough, it will always match just 1 char). Usually, there must be some other pattern after such a lazily quantified one, else, it usually does not fetch the right texts.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM