简体   繁体   English

Java正则表达式需要新鲜的眼睛,太贪心了

[英]Need fresh eyes for Java regular expression, which is too greedy

I have a string of the form:我有一个形式的字符串:

canonical_class_name[key1="value1",key2="value2",key3="value3",...] 

The purpose is to capture the canonical_class_name in a group and then alternating key=value groups.目的是在一个组中捕获 canonical_class_name,然后交替 key=value 组。 Currently it does not match a test string (in the following program, testString ).目前它不匹配测试字符串(在下面的程序中, testString )。

There must be at least one key/value pair, but there may be many such pairs.必须至少有一个键/值对,但可能有很多这样的对。

Question : Currently the regex grabs the canonical class name, and the first key correctly but then it gobbles up everything until the last double quote, how do I make it grab the key value pairs lazy?问题:目前正则表达式获取规范的 class 名称和第一个键正确,但它会吞噬所有内容,直到最后一个双引号,我如何让它懒惰地获取键值对?

Here is the regular expression which the following program puts together:这是以下程序放在一起的正则表达式:

(\S+)\[\s*(\S+)\s*=\s*"(.*)"\s*(?:\s*,\s*(\S+)\s*=\s*"(.*)"\s*)*\]

Depending on your preference you may find the programs version easier to read.根据您的喜好,您可能会发现程序版本更易于阅读。

If my program is passed the String:如果我的程序传递了字符串:

org.myobject[key1=\"value1\", key2=\"value2\", key3=\"value3\"]

...these are the groups I get: ...这些是我得到的组:

Group1 contains: org.myobject<br/>
Group2 contains: key1<br/>
Group3 contains: value1", key2="value2", key3="value3<br/>

One more note, using String.split() I can simplify the expression, but I'm using this as a learning experience to better my regex understanding, so I don't want to use such a short cut.还有一点需要注意的是,使用String.split()我可以简化表达式,但我将其作为一种学习经验来更好地理解正则表达式,所以我不想使用这样的捷径。

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class BasicORMParser {
     String regex =
            "canonicalName\\[ map (?: , map )*\\]"
            .replace("canonicalName", "(\\S+)")
            .replace("map", "key = \"value\"")
            .replace("key", "(\\S+)")
            .replace("value", "(.*)")
            .replace(" ", "\\s*"); 

    List<String> getGroups(String ormString){
        List<String> values = new ArrayList();
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(ormString);
        if (matcher.matches() == false){
            String msg = String.format("String failed regex validiation. Required: %s , found: %s", regex, ormString);
            throw new RuntimeException(msg);
        }
        if(matcher.groupCount() < 2){
            String msg = String.format("Did not find Class and at least one key value.");
            throw new RuntimeException(msg);
        }
        for(int i = 1; i < matcher.groupCount(); i++){
            values.add(matcher.group(i));
        }
        return values;
    }
}

You practically answered the question yourself: make them lazy.您实际上自己回答了这个问题:让他们变得懒惰。 That is, use lazy (aka non-greedy or reluctant ) quantifiers.也就是说,使用惰性(又名非贪婪不情愿)量词。 Just change each (\S+) to (\S+?) , and each (.*) to (.*?) .只需将每个(\S+)更改为(\S+?) ,并将每个(.*)更改为(.*?) But if it were me, I'd change those subexpressions so they can never match too much, regardless of greediness.但是如果是我,我会改变那些子表达式,这样它们就永远不会匹配太多,不管贪婪。 For example, you could use ([^\s\[]+) for the class name, ([^\s=]+) for the key, and "([^"]*)" for the value.例如,您可以使用([^\s\[]+)作为 class 名称,使用([^\s=]+)作为键,使用"([^"]*)"作为值。

I don't think that's going to solve your real problem, though.不过,我认为这不会解决您的真正问题。 Once you've got it so it correctly matches all the key/value pairs, you'll find that it only captures the first pair (groups #2 and #3) and the last pair (groups #4 and #5).一旦你得到它,它正确匹配所有的键/值对,你会发现它只捕获第一对(组#2和#3)和最后一对(组#4和#5)。 That's because, each time (?:\s*,\s*(\S+)\s*=\s*"(.*)"\s*)* gets repeated, those two groups get their contents overwritten, and whatever they captured on the previous iteration is lost.这是因为,每次(?:\s*,\s*(\S+)\s*=\s*"(.*)"\s*)*重复时,这两个组的内容都会被覆盖,并且无论他们在上一次迭代中捕获的内容丢失了。 There's no getting around it, this is at least a two-step operation.没有办法绕过它,这至少是一个两步操作。 For example, you could match all of the key/value pairs as a block, then break out the individual pairs.例如,您可以将所有键/值对作为一个块进行匹配,然后拆分各个对。

One more thing.还有一件事。 This line:这一行:

if(matcher.groupCount() < 2){

...probably isn't doing what you think it does. ......可能没有做你认为它做的事情。 groupCount() is a static property of the Pattern object; groupCount()是模式 object 的 static 属性; it tells how many capturing groups there are in the regex.它告诉正则表达式中有多少个捕获组。 Whether the match succeeds or fails, groupCount() will always return the same value--in this case, five.无论匹配成功还是失败, groupCount()都将始终返回相同的值——在本例中为 5。 If the match succeeds, some of the capture groups may be null (indicating that they didn't participate in the match), but there will always be five of them.如果匹配成功,部分捕获组可能是null(表示他们没有参加比赛),但总会有五个。


EDIT: I suspect this is what you were trying for initially:编辑:我怀疑这是你最初尝试的:

Pattern p = Pattern.compile(
    "(?:([^\\s\\[]+)\\[|\\G)([^\\s=]+)=\"([^\"]*)\"[,\\s]*");

String s = "org.myobject[key1=\"value1\", key2=\"value2\", key3=\"value3\"]";
Matcher m = p.matcher(s);
while (m.find())
{
  if (m.group(1) != null)
  {
    System.out.printf("class : %s%n", m.group(1));
  }
  System.out.printf("key : %s, value : %s%n", m.group(2), m.group(3));
}

output: output:

class : org.myobject
key : key1, value : value1
key : key2, value : value2
key : key3, value : value3

The key to understanding the regex is this part: (?:([^\s\[]+)\[|\G) .理解正则表达式的关键是这部分: (?:([^\s\[]+)\[|\G) On the first pass it matches the class name and the opening square bracket.在第一遍中,它匹配 class 名称和左方括号。 After that, \G takes over, anchoring the next match to the position where the previous match ended.之后, \G接管,将下一场比赛锚定到上一场比赛结束的 position。

For non-greedy matching, append a ?对于非贪婪匹配,append a ? after the pattern.模式之后。 eg, .*?例如, .*? matches the fewest number of characters possible.匹配尽可能少的字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM