按匹配和不匹配正则表达式的方式拆分字符串

Question

I am currently have a program that can find all the regexs that are in a string, however for a different part I want the parts that match the regex and the parts that don't. 我目前有一个程序，可以找到字符串中的所有正则表达式，但是对于其他部分，我需要与正则表达式匹配的部分和不匹配的部分。

So if I had <h1> hello world </h1> I would want to be able to split it up into [ <h1> , hello world , </h1> ]. 因此，如果我有<h1> hello world </h1>我希望能够将其拆分为[ <h1> ， hello world </h1> ]。

Does anyone have any ideas on how to they would go about this? 是否有人对如何解决这个问题有任何想法？

Here is my code that splits up the string to find the regex part 这是我的代码，用于拆分字符串以找到正则表达式部分

ArrayList<String> foundTags = new ArrayList<String>();
Pattern p = Pattern.compile("<(.*?)>");
Matcher m = p.matcher(HTMLLine);
while(m.find()){
    foundTags.add(m.group(0));
}

Answer 1

For example : 例如：

String text = "testing<hi>bye</hi><b>bla bla!";
Pattern p = Pattern.compile("<(.*?)>");
Matcher m = p.matcher(text);
int last_match = 0;
List<String> splitted=new ArrayList<>();
while (m.find()) {
        splitted.add(text.substring(last_match,m.start()));
        splitted.add(m.group());
        last_match = m.end();
    }
    splitted.add(text.substring(last_match));
System.out.println(splitted.toString());

prints [testing, <hi>, bye, </hi>, , <b>, bla bla!] 打印[testing, <hi>, bye, </hi>, , <b>, bla bla!]

Is that what you want? 那是你要的吗？ You can easily fix the code to omit empty elements if you don't want them: 如果您不希望空元素，可以轻松修复该代码以省略空元素：

while (m.find()) {
    if(last_match != m.start())
        splitted.add(text.substring(last_match,m.start()));
    splitted.add(m.group());
    last_match = m.end();
}
if(last_match != text.length())
    splitted.add(text.substring(last_match));

Bear in mind, as pointed out in the comments: using regex to parse arbitrary HTML/XML is in general a bad idea. 请记住，正如评论中所指出的那样：使用regex解析任意HTML / XML通常是一个坏主意。

Answer 2

You can use the regex grouping ability to retrieve the different parts of the match. 您可以使用正则表达式分组功能来检索匹配项的不同部分。 For example: 例如：

ArrayList<String> list = new ArrayList<String>();
Pattern p = Pattern.compile("(<.*?>)(.*)(<.*?>)");
Matcher m = p.matcher("<h1> Hello World </h1>");
while(m.find()){
    list.add(m.group(1));
    list.add(m.group(2));
    list.add(m.group(3));
}

Would give you the list you wanted: ["<h1>", " Hello World ", "</h1>"] . 将为您提供所需的列表： ["<h1>", " Hello World ", "</h1>"] 。 Note that group number 0 is the full matched expression. 请注意，组号0是完全匹配的表达式。

按匹配和不匹配正则表达式的方式拆分字符串

问题描述

2 个解决方案

解决方案1
0 已采纳 2013-03-26 02:26:54

解决方案2
0 2013-03-26 02:35:43

按匹配和不匹配正则表达式的方式拆分字符串

问题描述

2 个解决方案

解决方案1 0 已采纳 2013-03-26 02:26:54

解决方案2 0 2013-03-26 02:35:43

解决方案1
0 已采纳 2013-03-26 02:26:54

解决方案2
0 2013-03-26 02:35:43