简体   繁体   English

自定义键值对的正则表达式优化

[英]Optimization of regular expression for custom key-value pairs

I am trying to extract some key-value pairs plus their preceding text from a large file, but the regular expression used runs very slowly, so it needs optimization. 我正在尝试从一个大文件中提取一些键-值对及其前面的文本,但是所使用的正则表达式运行速度非常慢,因此需要优化。

The input consists of fairly short strings with 1 or 2 key-value pairs, like 输入由带有1或2个键值对的较短字符串组成,例如

one two three/1234==five/5678 some other text

or 要么

one two three/1234==five/5678 some other text four/910==five/1112 more text

The (apparently suboptimal) regular expression used is 使用的(显然次优)正则表达式是

(.*?)\\s*([^ /]+)\\s*/\\s*([\\d]+)\\s*==\\s*([^ /]+)\\s*/\\s*([\\d]+)\\s*

(Spaces may appear in numerous areas within the string, hence the repeated \\s* elements.) (空格可能出现在字符串中的许多区域,因此重复的\\s*元素。)

Sample code to test the above: 测试上面的示例代码:

  public static void main(String[] args) {
    String text = "one two three/1234==five/5678 some other text";
    text = "one two three/1234==five/5678 some other text four/910==five/1112 more text";
    String regex = "(.*?)\\s*([^ /]+)\\s*/\\s*([\\d]+)\\s*==\\s*([^ /]+)\\s*/\\s*([\\d]+)\\s*";
    Matcher matcher = Pattern.compile(regex).matcher(text);
    int end = 0;
    System.out.println("--------------------------------------------------");
    while (matcher.find()) {
      System.out.println("\"" + matcher.group(1) + "\"");
      System.out.println(matcher.group(2) + " == " + matcher.group(3));
      System.out.println(matcher.group(4) + " == " + matcher.group(5));
      end = matcher.end();
      System.out.println("--------------------------------------------------");
    }
    System.out.println(text.substring(end).trim());
  }

The output is the key-value pairs, plus the preceding text (all extracted fields are required). 输出是键值对,加上前面的文本(所有提取的字段都是必需的)。 For example, for the longer string, the output is: 例如,对于较长的字符串,输出为:

--------------------------------------------------
"one two"
three == 1234
five == 5678
--------------------------------------------------
"some other text"
four == 910
five == 1112
--------------------------------------------------
more text

In other words, the matcher.find() method runs for 1 or 2 rounds, depending on whether the string has the short or long form (1 or 2 key-value pairs, respectively). 换句话说, matcher.find()方法将运行1或2个回合,具体取决于字符串是短格式还是长格式(分别为1或2个键值对)。

The problem is that the extraction speed is low and at times, depending on the variation of the input string, the find() method takes a lot of time to complete. 问题在于提取速度很慢,有时,取决于输入字符串的变化, find()方法需要花费大量时间才能完成。

Is there any better form for the regular expression, to significantly speed up processing? 正则表达式是否有更好的形式,可以大大加快处理速度?

It's never a good idea to put (.*?) at the beginning of a regex. (.*?)放在正则表达式的开头绝不是一个好主意。

First, it can be slow. 首先,它可能很慢。 Although in theory non-greedy matches can be handled efficiently (see, for example, Russ Cox's re2 implementation), many regex implementations do not handle non-greedy matches very well, especially in the case where the find operation is going to fail. 尽管从理论上讲非贪婪匹配可以得到有效处理(例如,参见Russ Cox的re2实现),但是许多正则表达式实现并不很好地处理非贪婪匹配,尤其是在find操作将失败的情况下。 I don't know whether the Java regex implementation falls into this category or not, but there's no reason to tempt fate. 我不知道Java regex实现是否属于此类,但是没有理由去吸引命运。

Second, it's pointless. 其次,这毫无意义。 The semantics of regex searching is that the first possible match will be found, which is identical to the semantics of .*? 正则表达式搜索的语义是将找到第一个可能的匹配项,这与.*?的语义相同.*? . To get the capture (.*?) , you only need the substring from the end of the previous match (or the beginning of the string) to the beginning of the current match. 要获取捕获(.*?) ,您只需要从上一个匹配项的末尾(或字符串的开头)到当前匹配项的子字符串。 That's trivial, especially since you're already tracking the end of the previous match. 这很简单,尤其是因为您已经在跟踪上一场比赛的结束。

How are you reading the file? 您如何阅读文件? If you read the file line-by-line with BufferedReader#readLine() or Scanner#nextLine() , all you need to do is add \\G to the beginning of your regex. 如果您使用BufferedReader#readLine()Scanner#nextLine()读取文件,则只需将\\G添加到正则表达式的开头即可。 It acts like \\A the first time you apply the regex, anchoring the match to the beginning of the string. 第一次应用正则表达式时,它的行为类似于\\A ,将匹配项锚定到字符串的开头。 If that match succeeds, the next find() will be anchored to the position where the previous match ended. 如果该匹配成功,则下一个find()将锚定到上一个匹配结束的位置。 If it doesn't find a match starting right there , it gives up and doesn't look for any more matches in that string. 如果找不到从此处开始的匹配项,它将放弃并且不再在该字符串中寻找任何匹配项。

EDIT: I'm assuming each of the sequences you want to match, whether it's one key/value pair or two, is on its own line. 编辑:我假设您要匹配的每个序列,无论是一个键/值对还是两个,都在自己的行上。 If you read the file one line at a time, you can run the code in your question on on each line. 如果您一次读取一行文件,则可以在每一行中运行问题中的代码。

As for why your regex is so slow, it's because the regex engine has to make multiple match attempts--possibly hundreds of them--on every non-matching line before it gives up. 至于为什么您的正则表达式这么慢,是因为正则表达式引擎必须在放弃之前在每条非匹配行上进行多次匹配尝试(可能是数百次匹配尝试)。 It isn't smart enough to realize that if the first attempt on a given line fails, no further attempts on that line will do any good. 不足以意识到如果在给定行上的第一次尝试失败,那么在该行上进行任何进一步的尝试都将无济于事。 So it bumps forward one position and tries again. 因此,它向前撞了一个位置,然后重试。 And it keeps doing that for the whole line. 而且它在整个生产线中一直如此。

If you were only expecting one match per line, I would say to use a start-of-line anchor ( ^ in MULTILINE mode). 如果你只希望每行一个比赛,我会说使用开始的行一个锚( ^在MULTILINE模式)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM