简体   繁体   English

java regex在xml内的空白处分割为空白

[英]java regex split at whitespace except whitespace inside xml

I got English sentences whose words are XML-tagged, for example: 我得到了带有XML标签的英语句子,例如:

<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.

There are exactly those three possibilities for xml tags as the sentence shows ( <XXX> , <YYY> , <ZZZ> ). 句子显示的xml标记确实存在这三种可能性( <XXX><YYY><ZZZ> )。 The word count inside any of those tags can be infinite. 这些标签中的任何一个标签中的字数可以是无限的。

I need to split them at whitespaces ignoring whitespaces inside those XML tags. 我需要在空白处拆分它们,而忽略这些XML标记内的空白。 The code looks like: 代码如下:

String mySentence = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.";
String[] mySentenceSplit = mySentence.split("someUnknownRegex");
for (int i = 0; i < mySentenceSplit.length; i++) {
    System.out.println(mySentenceSplit[i]);
}

Specifically for the example above the output should be like: 特别是对于上面的示例,输出应为:

mySentenceSplit[0] = <XXX>word1</XXX>
mySentenceSplit[1] = word2 
mySentenceSplit[2] = word3 
mySentenceSplit[3] = <YYY>word4 word5 word6</YYY>
mySentenceSplit[4] = word7 
mySentenceSplit[5] = word8 
mySentenceSplit[6] = word9 
mySentenceSplit[7] = word10
mySentenceSplit[8] = <ZZZ>word11 word12</ZZZ>.

What do i have to insert into "someUnknownRegex" to achieve this ? 要实现此目的,我必须在“ someUnknownRegex”中插入什么?

Using capturing group and backreference: 使用捕获组和反向引用:

String sentence = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.";
Pattern pattern = Pattern.compile("<(\\w+)[^>]*>.*?</\\1>\\.?|\\S+");
Matcher matcher = pattern.matcher(sentence);

while (matcher.find()) {
    System.out.println(matcher.group());
}

output: 输出:

<XXX>word1</XXX>
word2
word3
<YYY>word4 word5 word6</YYY>
word7
word8
word9
word10
<ZZZ>word11 word12</ZZZ>.

这是您想要的分割正则表达式:

String[] split = str.split(" +(?=[^<]*(<[^/]|$)");

kiltek, resurrecting this question because it had a simple regex solution that wasn't mentioned. kiltek,复活了这个问题,因为它有一个未提及的简单正则表达式解决方案。 (Found your question while doing some research for a regex bounty quest .) (在进行正则表达式赏金任务研究时发现了您的问题。)

With all the disclaimers about using regex to parse xml, here is a simple regex to do it: 关于使用正则表达式解析xml的所有免责声明,这里有一个简单的正则表达式可以做到这一点:

<.*?</[^>]*>|( )

The left side of the alternation matches complete xml tags. 交替的左侧与完整的xml标记匹配。 We will ignore these matches. 我们将忽略这些匹配。 The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left. 右侧匹配并捕获到第1组的空间,我们知道它们是正确的空间,因为它们与左侧的表达式不匹配。

Here is working code (see online demo ): 这是工作代码(请参阅在线演示 ):

import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;

class Program {
public static void main (String[] args) throws java.lang.Exception  {

String subject = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>";
Pattern regex = Pattern.compile("<.*?</[^>]*>|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
    if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
    else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program

Reference 参考

  1. How to match pattern except in situations s1, s2, s3 除情况s1,s2,s3之外如何匹配模式
  2. How to match a pattern unless... 如何匹配模式,除非...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM