简体   繁体   English

按匹配和不匹配正则表达式的方式拆分字符串

[英]Splitting up a string by what matches and does not match the regex

I am currently have a program that can find all the regexs that are in a string, however for a different part I want the parts that match the regex and the parts that don't. 我目前有一个程序,可以找到字符串中的所有正则表达式,但是对于其他部分,我需要与正则表达式匹配的部分和不匹配的部分。

So if I had <h1> hello world </h1> I would want to be able to split it up into [ <h1> , hello world , </h1> ]. 因此,如果我有<h1> hello world </h1>我希望能够将其拆分为[ <h1>hello world </h1> ]。

Does anyone have any ideas on how to they would go about this? 是否有人对如何解决这个问题有任何想法?

Here is my code that splits up the string to find the regex part 这是我的代码,用于拆分字符串以找到正则表达式部分

ArrayList<String> foundTags = new ArrayList<String>();
Pattern p = Pattern.compile("<(.*?)>");
Matcher m = p.matcher(HTMLLine);
while(m.find()){
    foundTags.add(m.group(0));
}

For example : 例如

String text = "testing<hi>bye</hi><b>bla bla!";
Pattern p = Pattern.compile("<(.*?)>");
Matcher m = p.matcher(text);
int last_match = 0;
List<String> splitted=new ArrayList<>();
while (m.find()) {
        splitted.add(text.substring(last_match,m.start()));
        splitted.add(m.group());
        last_match = m.end();
    }
    splitted.add(text.substring(last_match));
System.out.println(splitted.toString());

prints [testing, <hi>, bye, </hi>, , <b>, bla bla!] 打印[testing, <hi>, bye, </hi>, , <b>, bla bla!]

Is that what you want? 那是你要的吗? You can easily fix the code to omit empty elements if you don't want them: 如果您不希望空元素,可以轻松修复该代码以省略空元素:

while (m.find()) {
    if(last_match != m.start())
        splitted.add(text.substring(last_match,m.start()));
    splitted.add(m.group());
    last_match = m.end();
}
if(last_match != text.length())
    splitted.add(text.substring(last_match));

Bear in mind, as pointed out in the comments: using regex to parse arbitrary HTML/XML is in general a bad idea. 请记住,正如评论中所指出的那样:使用regex解析任意HTML / XML通常是一个坏主意。

You can use the regex grouping ability to retrieve the different parts of the match. 您可以使用正则表达式分组功能来检索匹配项的不同部分。 For example: 例如:

ArrayList<String> list = new ArrayList<String>();
Pattern p = Pattern.compile("(<.*?>)(.*)(<.*?>)");
Matcher m = p.matcher("<h1> Hello World </h1>");
while(m.find()){
    list.add(m.group(1));
    list.add(m.group(2));
    list.add(m.group(3));
}

Would give you the list you wanted: ["<h1>", " Hello World ", "</h1>"] . 将为您提供所需的列表: ["<h1>", " Hello World ", "</h1>"] Note that group number 0 is the full matched expression. 请注意,组号0是完全匹配的表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM