简体   繁体   English

正则表达式拆分字符串中的标签

[英]regex splitting tags in the string

I have following regex (<.*?>.*?</.*?>|[\w[-]]+)\p{Punct}* which works perfectly for most string with tags but if a tag is not preceded by space then it breaks the tag while finding a match.我有以下正则表达式(<.*?>.*?</.*?>|[\w[-]]+)\p{Punct}*它适用于大多数带标签的字符串,但如果前面没有标签按空格,然后它会在找到匹配项时破坏标签。

Please help me in modifying this regex such that it doesn't break tags.请帮助我修改此正则表达式,使其不会破坏标签。 All I am looking is to split on spaces but not if space is within a tag.我所寻找的只是在空格上分割,但如果空间在标签内,则不是。

For Example:例如:

BIRD-<abc attr="co_1">ab</span> @apos;<abc attr="co_12">cd</span>FEE DEF

should split into:应该分成:

BIRD-&ltabc attr="co_1">ab</span> 
@apos;<abc attr="co_12">cd</span>FEE  
DEF

I am currently using a matcher to match this pattern and get the tokens我目前正在使用匹配器来匹配此模式并获取令牌

Matcher matcher = REGEX.matcher(newString);

while (matcher.find()) 
{
    token = matcher.group();
}

Try this:尝试这个:

.*?<.*?>.*?</.*?>[^\s]*

It will produce the result you expect.它将产生您期望的结果。

I would be wary of performing that type of parsing using regex.我会警惕使用正则表达式执行这种类型的解析。 The pattern you are suggesting, as well as various adaptations of it may start behaving weirdly if attributes contain the > and/or < characters.如果属性包含 > 和/或 < 字符,您建议的模式以及它的各种改编可能会开始表现得很奇怪。 The following example would throw your pattern off, for example.例如,以下示例会抛出您的模式。

<element attr="></>">value</element>

Any time you need to parse or process an XML file, I would advice you to consider using a proper XML parser.每当您需要解析或处理 XML 文件时,我建议您考虑使用适当的 XML 解析器。 Please see this answer for a longer explanation.请参阅此答案以获得更长的解释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM