简体   繁体   English

使用正则表达式提取特定模式

[英]Using regex to extract specific pattern

I'm having a hard time using regular expressions in Java even after reading numerous tutorials online. 即使在线阅读了大量的教程之后,我也很难在Java中使用正则表达式。 I'm trying to extract parts of a String received to be used later in my application. 我正在尝试提取收到的String的一部分,以便稍后在我的应用程序中使用。

Here are examples of the possible String received: 以下是收到的可能字符串的示例:

53248 <CERCLE> 321 211 55 </CERCLE>
57346 <RECTANGLE> 272 99 289 186 </RECTANGLE>

The first number is to be extracted as a sequence number. 第一个数字将被提取为序列号。 The word between <> is to be extracted as well. <>之间的单词也将被提取。 Then, the sequence of numbers in between as well. 然后,它们之间的数字序列也是如此。

Here is my pattern: 这是我的模式:

"(\\d+)\\s*<(\\w+)>\\s*((\\d+\\s*)+)\\s*</\\w*>.*"

Here is the code for my method so far: 到目前为止,这是我的方法的代码:

public decompose(String s) throws IllegalArgumentException {

    Pattern pattern = Pattern.compile(PATTERN);
    Matcher matcher = pattern.matcher(s);

    noSeq = Integer.parseInt(matcher.group(1));
    type = typesFormes.valueOf(matcher.group(2));
    strCoords = matcher.group(3).split(" ");

}

Problem is that when I run the code, all my matcher groups are at -1 for some reason (not found I guess). 问题是,当我运行代码时,由于某种原因,我的所有匹配器组都为-1(我猜不到)。 I've been banging my head on this for a while and any suggestion is welcome :) Thanks. 我一直在敲打这个问题一段时间,欢迎提出任何建议:)谢谢。

Simply try with String#split() 只需尝试使用String#split()

  String str="53248 <CERCLE> 321 211 55 </CERCLE>";
  String[] array=str.split("(\\s<|>\\s)"); 
  // simple regex (space < OR > space)

Note: Try with \\\\s+ if there are one ore more spaces. 注意:如果有一个或多个空格,请尝试使用\\\\s+

Use first three values of array that will be 53248, CERCLE, 321 211 55 in this case. 在这种情况下53248, CERCLE, 321 211 55使用前三个数组值为53248, CERCLE, 321 211 55


Complete code: 完整代码:

String str = "53248 <CERCLE> 321 211 55 </CERCLE>";
String[] array = str.split("(\\s<|>\\s)");

int noSeq = Integer.valueOf(array[0]);
String type = array[1];
String strCoords = array[2];

System.out.println(noSeq+", "+type+", "+strCoords);

output: 输出:

53248, CERCLE, 321 211 55

You just needed to tell the matcher to start matching the pattern against the input string. 您只需要告诉匹配器开始匹配输入字符串的模式。 This works for me on ideone : 这对我来说很有用

String s = "53248 <CERCLE> 321 211 55 </CERCLE>";
String PATTERN = "(\\d+)\\s*<(\\w+)>\\s*((\\d+\\s*)+)\\s*</\\w*>.*";
Pattern pattern = Pattern.compile(PATTERN);
Matcher matcher = pattern.matcher(s);
matcher.find();                         // aye, there's the rub
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));

Output was: 产出是:

53248
CERCLE
321 211 55

The find() method, when successful, will let the matcher yield the information you want. find()方法成功后,将让匹配器生成所需的信息。 From the javadocs: 来自javadocs:

If the match succeeds then more information can be obtained via the start, end, and group methods. 如果匹配成功,则可以通过start,end和group方法获得更多信息。

group() says something similarly indicative, emphasis mine: group()说出一些类似的指示,强调我的:

Returns the input subsequence captured by the given group during the previous match operation. 返回在上一个匹配操作期间由给定组捕获的输入子序列

As @2rs2ts pointed out, the problem was the missing matcher.find() call. 正如@ 2rs2ts指出的那样,问题是缺少matcher.find()调用。

I would further improve like this: 我会像这样进一步改进:

final String PATTERN = "(\\d+)\\s*<(\\w+)>\\s*([\\d\\s]+)\\s*</\\2>.*";
String s = "53248 <CERCLE> 321 211 55 </CERCLE>";
Pattern pattern = Pattern.compile(PATTERN);
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
    System.out.println(matcher.group(3).trim());
}

Some improvements: 一些改进:

  • In the pattern, you can simplify ((\\\\d+\\\\s*)+) as ([\\\\d\\\\s]+) . 在模式中,您可以将((\\\\d+\\\\s*)+)简化为([\\\\d\\\\s]+) For your purpose, it's equivalent. 为了您的目的,它是等价的。
  • In the pattern, you probably want to match <CERCLE> with a closing </CERCLE> , not </OTHER> . 在模式中,您可能希望将<CERCLE>与结束</CERCLE>匹配,而不是</OTHER> You can do that using \\\\2 , which is a back reference to the 2nd capture group. 您可以使用\\\\2执行此操作, \\\\2是第二个捕获组的后向引用。
  • You can judge by the result of matcher.find() if anything was matched. 你可以通过matcher.find()的结果来判断是否有任何匹配。
  • Before you split the list of numbers in the middle, you might want to trim the possible trailing whitespace at the end using .trim() . 在中间拆分数字列表之前,您可能希望使用.trim()修剪末尾可能的尾随空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM