简体   繁体   中英

Regex to match “>” in HTML

I need a regex which matches ">" character in a HTML string, but doesn't match tag's closed bracket. Example:

<span id="bla"> bla bla a > b bla bla bla <a>bla </a> </span>

The regex should match the ">" between a anb b

You can use a negative lookbehind: (?<!\\<[^>]+)\\> .
Un tested

This will match any > character that isn't preceded by the beginning of an HTML (a sequence starting with < and not containing > )

以下正则表达式应该起作用:

([^/]>)+

What you need is a regex that finds "unpaired" greater-than signs; >s that are not preceded by a < as you'd find in a tag.

Try this: "(?<!\\<[^<>]+)\\>" It should match a greater-than that is not part of an HTML tag; that is, a construct consisting of a less-than, some number of characters other than the angle-bracket characters, then a greater than.

EDIT: put in SLak's suggestions. I'll keep the < in the "not match" block just in case the less-than being matched is also not part of a tag, for instance << or <-. It shouldn't hurt the pattern's ability to match proper tags.

A specific solution rather than just an admonition:

" Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away. " - http://www.crummy.com/software/BeautifulSoup/

Don't use regex to parse html -

" Among programmers of any experience, it is generally regarded as A Bad Idea to attempt to parse HTML with regular expressions. " - Link

and " You can't parse [X]HTML with regex " - 4352 votes at the time of this posting

" Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. Be lazy, use ... " something designed for that purpose.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM