[英]complex regular expression in Java
I have a rather complex (to me it seems rather complex) problem that I'm using regular expressions in Java for:我有一个相当复杂(对我来说似乎相当复杂)的问题,我在 Java 中使用正则表达式:
I can get any text string that must be of the format:我可以得到任何必须采用以下格式的文本字符串:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
I started with a regular expression for extracting the text between the M:/:D:/:C:/:Q: as:我从一个正则表达式开始,用于提取M:/:D:/:C:/:Q: 之间的文本:
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\\\.0-9]+)";
And that works fine if the <either a url or string>
is just an alphanumeric string.如果<either a url or string>
只是一个字母数字字符串,那么这很好用。 But it all falls apart when the embedded string is a url of the format:但是当嵌入的字符串是格式的 url 时,一切都会崩溃:
tcp://someurl.something:port tcp://someurl.something:port
Can anyone help me adjust the above reg exp to extract the text after :D: to be either a url or a alpha-numeric string ?任何人都可以帮我调整上面的 reg exp 以提取:D: 之后的文本是一个 url 或一个字母数字字符串吗?
Here's an example:下面是一个例子:
public static void main(String[] args) {
String name = "M:myString1:D:tcp://someurl.com:8989:C:myString2:Q:1";
boolean matchFound = false;
ArrayList<String> values = new ArrayList<>();
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
Matcher m3 = Pattern.compile(pattern2).matcher(name);
while (m3.find()) {
matchFound = true;
String m = m3.group(2);
System.out.println("regex found match: " + m);
values.add(m);
}
}
In the above example, my results would be:在上面的例子中,我的结果是:
myString1
tcp://someurl.com:8989
myString2
1
And note that the Strings can be of variable length, alphanumeric, but allowing some characters (such as the url format with :// and/or . - characters请注意,字符串可以是可变长度的字母数字,但允许某些字符(例如带有 :// 和/或 . - 字符的 url 格式
You mention that the format is constant:你提到格式是不变的:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
Capture groups can do this for you with the pattern:捕获组可以使用以下模式为您执行此操作:
"M:(.*):D:(.*):C:(.*):Q:(.*)"
Or you can do a String.split()
with a pattern of "M:|:D:|:C:|:Q:"
.或者您可以使用"M:|:D:|:C:|:Q:"
模式执行String.split()
。 However, the split will return an empty element at the first index.但是,拆分将在第一个索引处返回一个空元素。 Everything else will follow.其他一切都会随之而来。
public static void main(String[] args) throws Exception {
System.out.println("Regex: ");
String data = "M:<some text>:D:tcp://someurl.something:port:C:<some more text>:Q:<a number>";
Matcher matcher = Pattern.compile("M:(.*):D:(.*):C:(.*):Q:(.*)").matcher(data);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
}
System.out.println();
System.out.println("String.split(): ");
String[] pieces = data.split("M:|:D:|:C:|:Q:");
for (String piece : pieces) {
System.out.println(piece);
}
}
Results:结果:
Regex:
<some text>
tcp://someurl.something:port
<some more text>
<a number>
String.split():
<some text>
tcp://someurl.something:port
<some more text>
<a number>
To extract the URL/text part you don't need the regular expression.要提取 URL/文本部分,您不需要正则表达式。 Use用
int startPos = input.indexOf(":D:")+":D:".length();
int endPos = input.indexOf(":C:", startPos);
String urlOrText = input.substring(startPos, endPos);
Assuming you need to do some validation along with the parsing:假设您需要在解析的同时进行一些验证:
break the regex into different parts like this:将正则表达式分成不同的部分,如下所示:
String m_regex = "[\\w.]+"; //in jsva a . in [] is just a plain dot
String url_regex = "."; //theres a bunch online, pick your favorite.
String d_regex = "(?:" + url_regex + "|\\p{Alnum}+)"; // url or a sequence of alphanumeric characters
String c_regex = "[\\w.]+"; //but i'm assuming you want this to be a bit more strictive. not sure.
String q_regex = "\\d+"; //what sort of number exactly? assuming any string of digits here
String regex = "M:(?<M>" + m_regex + "):"
+ "D:(?<D>" + d_regex + "):"
+ "C:(?<D>" + c_regex + "):"
+ "Q:(?<D>" + q_regex + ")";
Pattern p = Pattern.compile(regex);
Might be a good idea to keep the pattern as a static field somewhere and compile it in a static block so that the temporary regex strings don't overcrowd some class with basically useless fields.将模式作为静态字段保留在某个地方并在静态块中编译它可能是一个好主意,这样临时正则表达式字符串就不会使某些具有基本无用字段的类过度拥挤。
Then you can retrieve each part by its name:然后您可以按名称检索每个部分:
Matcher m = p.matcher( input );
if (m.matches()) {
String m_part = m.group( "M" );
...
String q_part = m.group( "Q" );
}
You can go even a step further by making a RegexGroup interface/objects where each implementing object represents a part of the regex which has a name and the actual regex.您可以通过创建 RegexGroup 接口/对象更进一步,其中每个实现对象代表具有名称和实际正则表达式的正则表达式的一部分。 Though you definitely lose the simplicity makes it harder to understand it with a quick glance.虽然你肯定会失去简单性,但快速浏览一下就更难理解了。 (I wouldn't do this, just pointing out its possible and has its own benefits) (我不会这样做,只是指出它的可能并有其自身的好处)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.