简体   繁体   English

使用Pattern和Matcher的Java正则表达式

[英]Java Regular Expressions using Pattern and Matcher

My question is related to Regular Expressions in Java, and in particular, multiple matches for a given search pattern. 我的问题与Java中的正则表达式有关,特别是给定搜索模式的多个匹配。 All of the info i need to get is on 1 line and it contains an alias (eg SA) which maps to an IP address. 我需要获取的所有信息都在1行,它包含一个映射到IP地址的别名(例如SA)。 Each one is separated by a comma. 每个都用逗号分隔。 I need to extract each one. 我需要提取每一个。

SA "239.255.252.1", SB "239.255.252.2", SC "239.255.252.3", SD "239.255.252.4"

My Reg Ex looks like this: 我的Reg Ex看起来像这样:

Pattern alias = Pattern.compile("(\\S+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"");  
Matcher match = alias.matcher(lineInFile)  
while(match.find()) {  
   // do something  
}

This works but I'm not totally happy with it because since introducing this small piece of code, my program has slowed down a bit (< 1 sec) but enough to notice a difference. 这可行,但我并不完全满意,因为自从引入这一小段代码后,我的程序已经放慢了一点(<1秒)但足以注意到差异。

So my question is, am I going about this in the correct manner? 所以我的问题是,我是否以正确的方式解决这个问题? Is there a more efficient or possibly lightweight solution without the need for a while(match) loop? 是否有更高效或可能轻量级的解决方案,而不需要一个while(匹配)循环? and/or Pattern/Matcher classes? 和/或模式/匹配类?

如果该行可能不包含除别名定义之外的任何内容,则使用.match()而不是.find()可能会加快对非匹配的搜索。

I'm afraid your code looks pretty efficient already. 我担心你的代码看起来非常有效。 Here's my version: 这是我的版本:

Matcher match = Pattern
                .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"")
                .matcher(lineInFile);  
while(match.find()) {  
    //do something  
}

There are two micro-optimizations: 有两个微优化:

  1. No need to keep pattern in an extra variable, inlined that 无需将模式保存在额外的变量中,内联它
  2. For the alias, search for word characters, not non-space characters 对于别名,搜索单词字符,而不是非空格字符

Actually, if you do a lot of processing like this and the pattern never changes, you should keep the compiled pattern in a constant: 实际上,如果你做了很多这样的处理并且模式永远不会改变,你应该将编译后的模式保持为常量:

private static final Pattern PATTERN = Pattern
            .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"");

Matcher match = PATTERN.matcher(lineInFile);  
while(match.find()) {  
    //do something  
}

Update: I took some time on RegExr to come up with a much more specific pattern, which should only detect valid IP addresses as a bonus. 更新:我花了一些时间在RegExr上提出了一个更具体的模式,它应该只检测有效的IP地址作为奖励。 I know it's ugly as hell, but my guess is that it's pretty efficient, as it eliminates most of the backtracking: 我知道这很难看,但我的猜测是它非常高效,因为它消除了大部分的回溯:

([A-Z]+)\s*\"((?:1[0-9]{2}|2(?:(?:5[0-5]|[0-9]{2})|[0-9]{1,2})\.)
{3}(?:1[0-9]{2}|2(?:5[0-5]|[0-9]{2})|[0-9]{1,2}))

(Wrapped for readability, all back-slashes need to be escaped in java, but you can test it on RegExr as it is with the OP's test string) (为了便于阅读,所有反斜杠都需要在java中进行转义,但你可以在RegExr上测试它,因为它与OP的测试字符串一样)

You can improve your regex to: "(\\\\S{2})\\\\s+\\"((\\\\d{1,3}\\\\.){3}\\\\d{1,3})\\"" by specifying an IP address more explicitly. 您可以将正则表达式改进为: "(\\\\S{2})\\\\s+\\"((\\\\d{1,3}\\\\.){3}\\\\d{1,3})\\""通过更明确地指定IP地址。

Try out the performance of using a StringTokenizer . 尝试使用StringTokenizer的性能。 It does not use regular expressions. 它不使用正则表达式。 (If you are concerned about using a legacy class, then take a look at its source and see how it is done.) (如果您担心使用遗留类,那么请查看其源代码并了解它是如何完成的。)

StringTokenizer st = new StringTokenizer(lineInFile, " ,\"");
while(st.hasMoreTokens()){
    String key = st.nextToken();
    String ip = st.nextToken();
    System.out.println(key + " ip: " +  ip);
}

I don't know if this will yield a big performance benefit, but you could also first do 我不知道这是否会产生很大的性能优势,但你也可以先做

string.split(", ") // separate groups

and then 然后

string.split(" ?\"") // separate alias from IP address

on the matches. 在比赛中。

Precompiling and reusing the Pattern object is (IMO) likely to be the most effective optimization. 预编译和重用Pattern对象(IMO)可能是最有效的优化。 Pattern compilation is potentially an expensive step. 模式编译可能是一个昂贵的步骤。

Reusing the Matcher instance (eg using reset(CharSequence) ) might help, but I doubt that it will make much difference. 重用Matcher实例(例如使用reset(CharSequence) )可能有所帮助,但我怀疑它会有很大的不同。

The regex itself cannot be optimized significantly. 正则表达式本身无法显着优化。 One possible speedup would be to replace (\\d+\\.\\d+\\.\\d+\\.\\d+) with ([0-9\\.]+) . 一种可能的加速方式是用([0-9\\.]+)替换(\\d+\\.\\d+\\.\\d+\\.\\d+) ([0-9\\.]+) This might help because it reduces the number of potential backtrack points ... but you'd need to do some experiments to be sure. 这可能有所帮助,因为它减少了潜在回溯点的数量......但你需要做一些实验才能确定。 And the obvious downside is that it matches character sequences that are not valid IP addresses. 明显的缺点是它匹配的是无效IP地址的字符序列。

If you`re noticing a difference of < 1 sec on that piece of code, then your input string must contain around a million (ot at least some 100k) of entries. 如果您注意到该段代码的差异<1秒,那么您的输入字符串必须包含大约一百万(或至少大约100k)的条目。 I think that's a pretty fair performance and I cannot see how you could significantly optimize that without writing your own specialized parser. 我认为这是一个相当公平的性能,如果不编写自己的专用解析器,我无法看到如何显着优化它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM