什么是Java字符串REGEX的正确格式以标识DOI

Question

I am conducting some research on identify DOI in free format text. 我正在进行一些有关以自由格式文本识别DOI的研究。

I am using Java 8 and REGEX 我正在使用Java 8和REGEX

I Have found these REGEX's that are supposed to fulfil my requirements 我发现这些正则表达式可以满足我的要求

/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
/^10.1002/[^\s]+$/i
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i

The code I am trying is 我正在尝试的代码是

private static final Pattern pattern_one = Pattern.compile("/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i", Pattern.CASE_INSENSITIVE);

Matcher matcher = pattern_one.matcher("http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1");
while (matcher.find()) {
                System.out.print("Start index: " + matcher.start());
                System.out.print(" End index: " + matcher.end() + " ");
                System.out.println(matcher.group());
        }

However the matcher doesnt find anything. 但是，匹配器找不到任何东西。

Where have I gone wrong? 我哪里出问题了？

UPDATE 更新

I have encountered a valid DOI that my set of REGEXs do not match 我遇到了我的REGEX集不匹配的有效DOI

heres an example DOI : 10.1175/1520-0485(2002)032<0870:CT>2.0.CO;2 这是一个示例DOI： 10.1175/1520-0485(2002)032<0870:CT>2.0.CO;2

Why doesn't this pattern work? 为什么这种模式不起作用？

/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i

Answer 1

Your pattern looks incorrect to me. 您的模式在我看来不正确。 You are currently using this: 您当前正在使用此：

/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i

But I think you intend to use this: 但我认为您打算使用此：

^.*/10\\.\\d{4,9}/[-._;()/:A-Z0-9]+$

Problems with your pattern include that you are using JavaScript regex syntax, or some other language's syntax. 模式的问题包括您正在使用JavaScript regex语法或某种其他语言的语法。 Also, you were not escaping a literal dot in the regex, and the start of the pattern marker was out of place. 另外，您没有在正则表达式中转义文字点，并且模式标记的开头不正确。

Code: 码：

String pattern = "^.*/10\\.\\d{4,9}/[-._;()/:A-Z0-9]+$";
String url = "http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(url);
if (m.find( )) {
    System.out.println("Found value: " + m.group(0) );
} else {
    System.out.println("NO MATCH");
}

Demo here: 演示在这里：

Rextester 右旋酯

Answer 2

In Java, a regex is written as a String. 在Java中，正则表达式写为String。 In other languages, the regex is quoted using /.../ , with options like i given after the ending / . 在其他语言中，使用/.../引用正则表达式，并在/结束后给出类似i选项。 So, what is written as /XXX/i will in Java be done like this: 因此，在Java中以/XXX/i编写的代码将如下所示：

// Using flags parameter
Pattern p = Pattern.compile("XXX", Pattern.CASE_INSENSITIVE);

// Using embedded flags
Pattern p = Pattern.compile("(?i)XXX");

In most languages, regex are using to find a matching substring. 在大多数语言中，正则表达式用于查找匹配的子字符串。 Java can do that too, using the find() method (or any of the many replaceXxx() regex methods), however Java also has the matches() method which will match against the entire string, eliminating the need for the begin and end boundary matchers ^ and $ . Java可以做到这一点，使用find()方法（或任何许多replaceXxx()正则表达式的方法），但是Java也有matches()方法将匹配对整个字符串，省去了在开始和结束边界匹配器^和$ 。

Anyway, your problem is that the regex has both ^ and $ boundary matchers, which means it will only work if string is nothing but the text you want to match. 无论如何，您的问题是该正则表达式同时具有^和$边界匹配器，这意味着它仅在字符串不是您要匹配的文本时才有效。 Since you actually want to find a substring, remove those matchers. 由于您实际上要查找子字符串，因此请删除那些匹配器。

To search for one of multiple patterns, using the | 要搜索多个模式之一，请使用| logical regex operator. 逻辑正则表达式运算符。

And finally, since Java regex is given as a String literal, any special characters, most notably \\ , needs to be escaped. 最后，由于Java正则表达式是以String文字形式给出的，因此任何特殊字符（尤其是\\ ）都需要转义。

So, to build a single regex that can find substrings matching any of the following: 因此，要构建一个可以找到与以下任意一项匹配的子字符串的正则表达式：

/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
/^10.1002/[^\s]+$/i
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i

You would write it like this: 您可以这样写：

String regex = "10.\\d{4,9}/[-._;()/:A-Z0-9]+" +
              "|10.1002/[^\\s]+" +
              "|10.\\d{4}/\\d+-\\d+X?(\\d+)\\d+<[\\d\\w]+:[\\d\\w]*>\\d+.\\d+.\\w+;\\d" +
              "|10.1021/\\w\\w\\d++" +
              "|10.1207/[\\w\\d]+\\&\\d+_\\d+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

String input = "http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1";
Matcher m = p.matcher(input);
while (m.find()) {
    System.out.println("Start index: " + m.start() +
                       " End index: " + m.end() +
                       " " + m.group());
}

Output 输出量

Start index: 37 End index: 54 10.1175/JPO3002.1

什么是Java字符串REGEX的正确格式以标识DOI

问题描述

2 个解决方案

解决方案1
1 2017-04-28 15:40:12

Rextester 右旋酯

解决方案2
1 已采纳 2017-04-28 15:59:04

什么是Java字符串REGEX的正确格式以标识DOI

问题描述

2 个解决方案

解决方案1 1 2017-04-28 15:40:12

Rextester 右旋酯

解决方案2 1 已采纳 2017-04-28 15:59:04

解决方案1
1 2017-04-28 15:40:12

解决方案2
1 已采纳 2017-04-28 15:59:04