简体   繁体   中英

regex splitting tags in the string

I have following regex (<.*?>.*?</.*?>|[\w[-]]+)\p{Punct}* which works perfectly for most string with tags but if a tag is not preceded by space then it breaks the tag while finding a match.

Please help me in modifying this regex such that it doesn't break tags. All I am looking is to split on spaces but not if space is within a tag.

For Example:

BIRD-<abc attr="co_1">ab</span> @apos;<abc attr="co_12">cd</span>FEE DEF

should split into:

BIRD-&ltabc attr="co_1">ab</span> 
@apos;<abc attr="co_12">cd</span>FEE  
DEF

I am currently using a matcher to match this pattern and get the tokens

Matcher matcher = REGEX.matcher(newString);

while (matcher.find()) 
{
    token = matcher.group();
}

Try this:

.*?<.*?>.*?</.*?>[^\s]*

It will produce the result you expect.

I would be wary of performing that type of parsing using regex. The pattern you are suggesting, as well as various adaptations of it may start behaving weirdly if attributes contain the > and/or < characters. The following example would throw your pattern off, for example.

<element attr="></>">value</element>

Any time you need to parse or process an XML file, I would advice you to consider using a proper XML parser. Please see this answer for a longer explanation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM