优化正则表达式

Question

I'm using the following code to discard unsupported physical interfaces / subinterfaces from routers that connects to a big ISP network (by big I mean tens of thousands of routers): 我正在使用以下代码从连接到大型ISP网络的路由器丢弃不支持的物理接口/子接口（大的意思是成千上万的路由器）：

private final static Pattern INTERFACES_TO_FILTER = 
   Pattern.compile("unrouted VLAN|GigabitEthernet.+-mpls layer|FastEthernet.+-802\\.1Q vLAN subif"); 

// Simplification
List<String> interfaces;
// lots of irrelevant code to query the routers 

for (String intf : interfaces) {
   if (INTERFACES_TO_FILTER.matcher(intf).find()) {
      // code to prevent the interface from being used
   } 
}

The idea is discarding entries such as: 这个想法是丢弃以下条目：

unrouted VLAN 2000 for GigabitEthernet2/11.2000 未命中的VLAN 2000 for GigabitEthernet2 / 11.2000
GigabitEthernet1/2-mpls layer GigabitEthernet1 / 2-mpls层
FastEthernet6/0/3.2000-802.1Q vLAN subif FastEthernet6 / 0 / 3.2000-802.1Q vLAN subif

This code is hit often enough (several times per minute) over huge sets of interfaces (some routers have 50k+ subintefaces), cache doesn't really help much either because new subinterfaces are being configured / discarded very often. 这个代码在大量接口（一些路由器有50k +子接口）上经常被击中（每分钟几次），由于新的子接口经常被配置/丢弃，因此高速缓存并没有多大帮助。 The plan is to optimize the regex so that the procedure completes a tad faster (every nanosecond counts). 计划是优化正则表达式，以便程序更快地完成（每纳秒计数）。 Can you guys enlighten me? 你们能开导我吗？

Note: mpls layer and 802.1Q are supported for other kinds of interfaces, unrouted VLANs isn't. 注意：其他类型的接口支持mpls layer和802.1Q ， unrouted VLANs不支持。

Answer 1

There are some string search algorithms that allow you to try to search in a string of length n for k strings at once cheaper than the obvious O(n*k) cost. 有一些字符串搜索算法允许您尝试以比明显的O（n * k）成本更便宜的k字符串搜索长度为n的字符串。

They usually compare a rolling hash against a list of existing hashes of your words. 它们通常将滚动哈希与您单词的现有哈希列表进行比较。 A prime example of this would be the Rabin-Karp algorithm . 一个典型的例子是Rabin-Karp算法。 The wiki page even has a section about this. 维基页面甚至有关于此的部分。 There are more advanced versions of the principle out there as well, but it's easy to understand the principle. 还有更高级的原理版本，但很容易理解这个原理。

No idea if there already are libraries in Java that do this (I'd think so), but that's what I'd try - although 5 strings is rather small here (and different size makes it more complex too). 不知道Java中是否已经存在这样做的库（我想是这样的），但这就是我要尝试的 - 尽管这里有5个字符串相当小（不同的大小也使它更复杂）。 So better check whether a good KMP string search isn't faster - I'd think that'd be by far the best solution really (the default java api uses a naive string search, so use a lib) 所以最好检查一个好的KMP字符串搜索是不是更快 - 我认为这是迄今为止最好的解决方案（默认的java api使用了一个天真的字符串搜索，所以使用一个lib）

About your regexes: backtracking regex implementation for performance critical search code? 关于你的正则表达式：回溯性能关键搜索代码的正则表达式实现？ I doubt that's a good idea. 我怀疑这是个好主意。

PS: If you'd post a testset and a test harness for your problem, chances are good people would see how much they could beat the favorite - has worked before.. human nature is so easy to trick :) PS：如果你为你的问题发布一个测试集和一个测试工具，很可能人们会看到他们能够击败最喜欢的东西 - 以前工作过..人性很容易欺骗:)

Answer 2

I'm answering my own question for further reference, although the credits goes to @piotrekkr since he was the one that pointed the way. 我正在回答我自己的问题以供进一步参考，虽然这些学分归于@piotrekkr，因为他指的是那个。 Also my Kudos to @JB and @ratchet. 我还要感谢@JB和@ratchet。 I ended up using matches() , and the logic using indexOf and several contains was almost as fast (that's news to me, I always assumed that a single regex would be faster than several calls to contains ). 我最终使用了matches() ，并且使用indexOf和几个contains的逻辑几乎一样快（这对我来说是新闻，我总是假设单个正则表达式会比contains几个调用更快）。

Here's a solution that is several times faster (according to the profiler, about 7 times less time is spent at Matcher class methods): 这是一个快几倍的解决方案（根据分析器，在Matcher类方法上花费的时间减少了大约7倍）：

^(?:unrouted VLAN.++|GigabitEthernet.+?-mpls layer|FastEthernet.+?-802\\.1Q vLAN subif)$

Answer 3

If your problem is that you have a number of long string constants you're searching for, i would recommend using a Java analog of the standard C tool "lex". 如果您的问题是您正在搜索许多长字符串常量，我建议使用标准C工具“lex”的Java模拟。

A quick googling took me to JFlex . 一个快速的谷歌搜索带我到JFlex 。 I haven't used this particular tool and there may be others available, but that is an example of the kind of tool i would look for. 我没有使用过这个特殊工具，可能还有其他工具可用，但这是我想要的那种工具的一个例子。

Answer 4

If you must use regex for this try changing to this one: 如果您必须使用正则表达式尝试更改为此：

^(?:unrouted VLAN)|(?:GigabitEthernet.+?-mpls layer)|(?:FastEthernet.+?-802\.1Q vLAN subif)

^ make engine match from begining of string, not anywhere in string ^从字符串的开头创建引擎匹配，而不是字符串中的任何位置

.+? makes + ungreedy 使+不合适

(?:...) makes () non-capturing group (?:...)使()非捕获组

优化正则表达式

问题描述

4 个解决方案

解决方案1
2 2011-12-29 18:54:30

解决方案2
2 已采纳 2011-12-29 21:14:30

解决方案3
1 2011-12-29 19:13:44

解决方案4
1 2011-12-29 19:22:25

优化正则表达式

问题描述

4 个解决方案

解决方案1 2 2011-12-29 18:54:30

解决方案2 2 已采纳 2011-12-29 21:14:30

解决方案3 1 2011-12-29 19:13:44

解决方案4 1 2011-12-29 19:22:25

解决方案1
2 2011-12-29 18:54:30

解决方案2
2 已采纳 2011-12-29 21:14:30

解决方案3
1 2011-12-29 19:13:44

解决方案4
1 2011-12-29 19:22:25