简体   繁体   English

正则表达式分割嵌套的坐标字符串

[英]Regex to split nested coordinate strings

I have a String of the format "[(1, 2), (2, 3), (3, 4)]" , with an arbitrary number of elements. 我有一个格式为"[(1, 2), (2, 3), (3, 4)]"的字符串,具有任意数量的元素。 I'm trying to split it on the commas separating the coordinates, that is, to retrieve (1, 2) , (2, 3) , and (3, 4) . 我试图将它拆分为分隔坐标的逗号,即检索(1, 2)(2, 3)(3, 4)

Can I do it in Java regex? 我可以用Java正则表达式吗? I'm a complete noob but hoping Java regex is powerful enough for it. 我是一个完整的菜鸟,但希望Java正则表达式足够强大。 If it isn't, could you suggest an alternative? 如果不是,你能建议一个替代方案吗?

From Java 5 来自Java 5

Scanner sc = new Scanner();
sc.useDelimiter("\\D+"); // skip everything that is not a digit
List<Coord> result = new ArrayList<Coord>();
while (sc.hasNextInt()) {
    result.add(new Coord(sc.nextInt(), sc.nextInt()));
}
return result;

EDIT: We don't know how much coordinates are passed in the string coords . 编辑:我们不知道字符串coords中传递了多少coords

You can use String#split() for this. 您可以使用String#split()

String string = "[(1, 2), (2, 3), (3, 4)]";
string = string.substring(1, string.length() - 1); // Get rid of braces.
String[] parts = string.split("(?<=\\))(,\\s*)(?=\\()");
for (String part : parts) {
    part = part.substring(1, part.length() - 1); // Get rid of parentheses.
    String[] coords = part.split(",\\s*");
    int x = Integer.parseInt(coords[0]);
    int y = Integer.parseInt(coords[1]);
    System.out.printf("x=%d, y=%d\n", x, y);
}

The (?<=\\\\)) positive lookbehind means that it must be preceded by ) . (?<=\\\\)) 正向后视意味着它必须以( ) 开头 The (?=\\\\() positive lookahead means that it must be suceeded by ( . The (,\\\\s*) means that it must be splitted on the , and any space after that. The \\\\ are here just to escape regex-specific chars. (?=\\\\() 正预测先行意味着它必须由suceeded ((,\\\\s*)意味着它必须在被分裂,并且之后的任何空间, \\\\在这里只是为了逃避特定于正则表达式的字符。

That said, the particular String is recognizeable as outcome of List#toString() . 也就是说,特定的String可以识别为List#toString() Are you sure you're doing things the right way? 你确定你做得对吗? ;) ;)

Update as per the comments, you can indeed also do the other way round and get rid of non-digits: 根据评论更新 ,您确实可以做相反的方式并摆脱非数字:

String string = "[(1, 2), (2, 3), (3, 4)]";
String[] parts = string.split("\\D.");
for (int i = 1; i < parts.length; i += 3) {
    int x = Integer.parseInt(parts[i]);
    int y = Integer.parseInt(parts[i + 1]);
    System.out.printf("x=%d, y=%d\n", x, y);
}

Here the \\\\D means that it must be splitted on any non -digit (the \\\\d stands for digit). 这里\\\\D表示必须在任何非数字上拆分( \\\\d代表数字)。 The . . after means that it should eliminate any blank matches after the digits. after意味着它应该消除数字后的任何空白匹配。 I must however admit that I'm not sure how to eliminate blank matches before the digits. 但我必须承认,我不确定如何消除数字的空白匹配。 I'm not a trained regex guru yet. 我还不是一个训练有素的正则表达大师。 Hey, Bart K, can you do it better? 嘿,巴特K,你能做得更好吗?

After all, it's ultimately better to use a parser for this. 毕竟,为此最好使用解析器 See Huberts answer on this topic . 请参阅Huberts关于此主题的答案

If you do not require the expression to validate the syntax around the coordinates, this should do: 如果您不需要表达式来验证坐标周围的语法那么应该这样做:

\(\d+,\s\d+\)

This expression will return several matches (three with the input from your example). 此表达式将返回多个匹配项(三个与您的示例中的输入相对应)。

In your question, you state that you want to "retreive (1, 2) , (2, 3) , and (3, 4) . In the case that you actually need the pair of values associated with each coordinate, you can drop the parentheses and modify the regex to do some captures: 在你的问题中,你声明你想要“retreive (1, 2)(2, 3)(3, 4) 。如果你确实需要与每个坐标相关的值对,你可以放弃括号并修改正则表达式来做一些捕获:

(\d+),\s(\d+)

The Java code will look something like this: Java代码看起来像这样:

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("(\\d+),\\s(\\d+)");
        Matcher matcher = pattern.matcher("[(1, 2), (2, 3), (3, 4)]");

        while (matcher.find()) {
            int x = Integer.parseInt(matcher.group(1));
            int y = Integer.parseInt(matcher.group(2));
            System.out.printf("x=%d, y=%d\n", x, y);
        }
    }
}

Will there always be 3 groups of coordinates that need to be analyzed? 是否总会有3组坐标需要分析?

You could try: 你可以尝试:

\\[(\\(\\d,\\d\\)), (\\(\\d,\\d\\)), (\\(\\d,\\d\\))\\]

If you use regex, you are going to get lousy error reporting and things will get exponentially more complicated if your requirements change (For instance, if you have to parse the sets in different square brackets into different groups). 如果你使用正则表达式,你将会得到糟糕的错误报告,如果你的需求发生变化,事情会变得更加复杂(例如,如果你必须将不同方括号中的集合解析成不同的组)。

I recommend you just write the parser by hand, it's like 10 lines of code and shouldn't be very brittle. 我建议你手工编写解析器,它就像10行代码,不应该很脆弱。 Track everything you are doing, open parens, close parens, open braces & close braces. 跟踪你正在做的一切,打开parens,关闭parens,打开括号和关闭括号。 It's like a switch statement with 5 options (and a default), really not that bad. 它就像一个带有5个选项(和默认值)的switch语句,真的没那么糟糕。

For a minimal approach, open parens and open braces can be ignored, so there are really only 3 cases. 对于最小的方法,可以忽略开放的parens和开括号,因此实际上只有3种情况。


This would be the bear minimum. 这将是最低限度的。

// Java-like psuedocode
int valuea;
String lastValue;
tokens=new StringTokenizer(String, "[](),", true);

for(String token : tokens) {  

    // The token Before the ) is the second int of the pair, and the first should
    // already be stored
    if(token.equals(")"))
        output.addResult(valuea, lastValue.toInt());

    // The token before the comma is the first int of the pair
    else if(token.equals(",")) 
        valuea=lastValue.toInt();

    // Just store off this token and deal with it when we hit the proper delim
    else
        lastValue=token;
}

This is no better than a minimal regex based solution EXCEPT that it will be MUCH easier to maintain and enhance. 这并不比基于正则表达式的最小解决方案更好,除了它将更容易维护和增强。 (add error checking, add a stack for paren & square brace matching and checking for misplaced commas and other invalid syntax) (添加错误检查,为paren和方括号匹配添加堆栈并检查错放的逗号和其他无效语法)

As an example of expandability, if you were to have to place different sets of square-bracket delimited groups into different output sets, then the addition is something as simple as: 作为可扩展性的一个例子,如果你不得不将不同的方括号分隔组放到不同的输出集中,那么添加就像这样简单:

    // When we close the square bracket, start a new output group.
    else if(token.equals("]"))
        output.startNewGroup();

And checking for parens is as easy as creating a stack of chars and pushing each [ or ( onto the stack, then when you get a ] or ), pop the stack and assert that it matches. 检查parens就像创建一堆字符并推送每个[或(在堆栈上,然后当你得到]或)时一样简单,弹出堆栈并声明它匹配。 Also, when you are done, make sure your stack.size() == 0. 此外,完成后,请确保您的stack.size()== 0。

In regexes, you can split on (?<=\\)), which use Positive Lookbehind : 在正则表达式中,你可以拆分(?<=\\)),它使用Positive Lookbehind

string[] subs = str.replaceAll("\[","").replaceAll("\]","").split("(?<=\)),");

In simpe string functions, you can drop the [ and ] and use string.split("),") , and return the ) after it. 在simpe字符串函数,你可以删除[]并使用string.split("),")并返回)之后。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM