使用正则表达式提取特定模式

Question

I'm having a hard time using regular expressions in Java even after reading numerous tutorials online. 即使在线阅读了大量的教程之后，我也很难在Java中使用正则表达式。 I'm trying to extract parts of a String received to be used later in my application. 我正在尝试提取收到的String的一部分，以便稍后在我的应用程序中使用。

Here are examples of the possible String received: 以下是收到的可能字符串的示例：

53248 <CERCLE> 321 211 55 </CERCLE>
57346 <RECTANGLE> 272 99 289 186 </RECTANGLE>

The first number is to be extracted as a sequence number. 第一个数字将被提取为序列号。 The word between <> is to be extracted as well. <>之间的单词也将被提取。 Then, the sequence of numbers in between as well. 然后，它们之间的数字序列也是如此。

Here is my pattern: 这是我的模式：

"(\\d+)\\s*<(\\w+)>\\s*((\\d+\\s*)+)\\s*</\\w*>.*"

Here is the code for my method so far: 到目前为止，这是我的方法的代码：

public decompose(String s) throws IllegalArgumentException {

    Pattern pattern = Pattern.compile(PATTERN);
    Matcher matcher = pattern.matcher(s);

    noSeq = Integer.parseInt(matcher.group(1));
    type = typesFormes.valueOf(matcher.group(2));
    strCoords = matcher.group(3).split(" ");

}

Problem is that when I run the code, all my matcher groups are at -1 for some reason (not found I guess). 问题是，当我运行代码时，由于某种原因，我的所有匹配器组都为-1（我猜不到）。 I've been banging my head on this for a while and any suggestion is welcome :) Thanks. 我一直在敲打这个问题一段时间，欢迎提出任何建议:)谢谢。

Answer 1

Simply try with String#split() 只需尝试使用String#split()

  String str="53248 <CERCLE> 321 211 55 </CERCLE>";
  String[] array=str.split("(\\s<|>\\s)"); 
  // simple regex (space < OR > space)

Note: Try with \\\\s+ if there are one ore more spaces. 注意：如果有一个或多个空格，请尝试使用\\\\s+ 。

Use first three values of array that will be 53248, CERCLE, 321 211 55 in this case. 在这种情况下53248, CERCLE, 321 211 55使用前三个数组值为53248, CERCLE, 321 211 55 。

Complete code: 完整代码：

String str = "53248 <CERCLE> 321 211 55 </CERCLE>";
String[] array = str.split("(\\s<|>\\s)");

int noSeq = Integer.valueOf(array[0]);
String type = array[1];
String strCoords = array[2];

System.out.println(noSeq+", "+type+", "+strCoords);

output: 输出：

53248, CERCLE, 321 211 55

Answer 2

You just needed to tell the matcher to start matching the pattern against the input string. 您只需要告诉匹配器开始匹配输入字符串的模式。 This works for me on ideone : 这对我来说很有用：

String s = "53248 <CERCLE> 321 211 55 </CERCLE>";
String PATTERN = "(\\d+)\\s*<(\\w+)>\\s*((\\d+\\s*)+)\\s*</\\w*>.*";
Pattern pattern = Pattern.compile(PATTERN);
Matcher matcher = pattern.matcher(s);
matcher.find();                         // aye, there's the rub
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));

Output was: 产出是：

53248
CERCLE
321 211 55

The find() method, when successful, will let the matcher yield the information you want. find()方法成功后，将让匹配器生成所需的信息。 From the javadocs: 来自javadocs：

If the match succeeds then more information can be obtained via the start, end, and group methods. 如果匹配成功，则可以通过start，end和group方法获得更多信息。

group() says something similarly indicative, emphasis mine: group()说出一些类似的指示，强调我的：

Returns the input subsequence captured by the given group during the previous match operation. 返回在上一个匹配操作期间由给定组捕获的输入子序列。

Answer 3

As @2rs2ts pointed out, the problem was the missing matcher.find() call. 正如@ 2rs2ts指出的那样，问题是缺少matcher.find()调用。

I would further improve like this: 我会像这样进一步改进：

final String PATTERN = "(\\d+)\\s*<(\\w+)>\\s*([\\d\\s]+)\\s*</\\2>.*";
String s = "53248 <CERCLE> 321 211 55 </CERCLE>";
Pattern pattern = Pattern.compile(PATTERN);
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
    System.out.println(matcher.group(3).trim());
}

Some improvements: 一些改进：

In the pattern, you can simplify ((\\\\d+\\\\s*)+) as ([\\\\d\\\\s]+) . 在模式中，您可以将((\\\\d+\\\\s*)+)简化为([\\\\d\\\\s]+) 。 For your purpose, it's equivalent. 为了您的目的，它是等价的。
In the pattern, you probably want to match <CERCLE> with a closing </CERCLE> , not </OTHER> . 在模式中，您可能希望将<CERCLE>与结束</CERCLE>匹配，而不是</OTHER> 。 You can do that using \\\\2 , which is a back reference to the 2nd capture group. 您可以使用\\\\2执行此操作， \\\\2是第二个捕获组的后向引用。
You can judge by the result of matcher.find() if anything was matched. 你可以通过matcher.find()的结果来判断是否有任何匹配。
Before you split the list of numbers in the middle, you might want to trim the possible trailing whitespace at the end using .trim() . 在中间拆分数字列表之前，您可能希望使用.trim()修剪末尾可能的尾随空格。

使用正则表达式提取特定模式

问题描述

3 个解决方案

解决方案1
1 2014-05-16 22:02:31

解决方案2
1 2014-05-16 22:09:08

解决方案3
1 已采纳 2014-05-16 22:27:55

使用正则表达式提取特定模式

问题描述

3 个解决方案

解决方案1 1 2014-05-16 22:02:31

解决方案2 1 2014-05-16 22:09:08

解决方案3 1 已采纳 2014-05-16 22:27:55

解决方案1
1 2014-05-16 22:02:31

解决方案2
1 2014-05-16 22:09:08

解决方案3
1 已采纳 2014-05-16 22:27:55