[英]Splitting up a string by what matches and does not match the regex
I am currently have a program that can find all the regexs that are in a string, however for a different part I want the parts that match the regex and the parts that don't. 我目前有一个程序,可以找到字符串中的所有正则表达式,但是对于其他部分,我需要与正则表达式匹配的部分和不匹配的部分。
So if I had <h1> hello world </h1>
I would want to be able to split it up into [ <h1>
, hello world
, </h1>
]. 因此,如果我有
<h1> hello world </h1>
我希望能够将其拆分为[ <h1>
, hello world
</h1>
]。
Does anyone have any ideas on how to they would go about this? 是否有人对如何解决这个问题有任何想法?
Here is my code that splits up the string to find the regex part 这是我的代码,用于拆分字符串以找到正则表达式部分
ArrayList<String> foundTags = new ArrayList<String>();
Pattern p = Pattern.compile("<(.*?)>");
Matcher m = p.matcher(HTMLLine);
while(m.find()){
foundTags.add(m.group(0));
}
For example : 例如 :
String text = "testing<hi>bye</hi><b>bla bla!";
Pattern p = Pattern.compile("<(.*?)>");
Matcher m = p.matcher(text);
int last_match = 0;
List<String> splitted=new ArrayList<>();
while (m.find()) {
splitted.add(text.substring(last_match,m.start()));
splitted.add(m.group());
last_match = m.end();
}
splitted.add(text.substring(last_match));
System.out.println(splitted.toString());
prints [testing, <hi>, bye, </hi>, , <b>, bla bla!]
打印
[testing, <hi>, bye, </hi>, , <b>, bla bla!]
Is that what you want? 那是你要的吗? You can easily fix the code to omit empty elements if you don't want them:
如果您不希望空元素,可以轻松修复该代码以省略空元素:
while (m.find()) {
if(last_match != m.start())
splitted.add(text.substring(last_match,m.start()));
splitted.add(m.group());
last_match = m.end();
}
if(last_match != text.length())
splitted.add(text.substring(last_match));
Bear in mind, as pointed out in the comments: using regex to parse arbitrary HTML/XML is in general a bad idea. 请记住,正如评论中所指出的那样:使用regex解析任意HTML / XML通常是一个坏主意。
You can use the regex grouping ability to retrieve the different parts of the match. 您可以使用正则表达式分组功能来检索匹配项的不同部分。 For example:
例如:
ArrayList<String> list = new ArrayList<String>();
Pattern p = Pattern.compile("(<.*?>)(.*)(<.*?>)");
Matcher m = p.matcher("<h1> Hello World </h1>");
while(m.find()){
list.add(m.group(1));
list.add(m.group(2));
list.add(m.group(3));
}
Would give you the list you wanted: ["<h1>", " Hello World ", "</h1>"]
. 将为您提供所需的列表:
["<h1>", " Hello World ", "</h1>"]
。 Note that group number 0 is the full matched expression. 请注意,组号0是完全匹配的表达式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.